# 模組 1: Pandas 基礎操作複習

## 學習目標
- 理解 Pandas 的核心資料結構：Series 與 DataFrame
- 學習如何從 CSV 檔案載入資料
- 掌握檢視 DataFrame 基本資訊的常用方法
- 熟悉資料的選取、索引與過濾技巧

## 導論：為何從 Pandas 開始？

Pandas 是 Python 資料分析生態系中最重要的基石之一。它提供了高效能、易於使用的資料結構和資料分析工具，讓處理和分析結構化資料（像是表格、時間序列）變得非常直觀。在我們深入探索性資料分析（EDA）與特徵工程之前，穩固地掌握 Pandas 是不可或缺的第一步。

In [1]:
# 導入必要的函式庫
import pandas as pd
import numpy as np


## 1. Pandas 資料結構

Pandas 有兩種主要的資料結構：
- **Series**：一個一維的、帶有標籤的陣列，可以容納任何資料類型（整數、字串、浮點數、Python 物件等）。它就像是 Excel 中的一欄。
- **DataFrame**：一個二維的、帶有標籤的資料結構，擁有可對齊的行和列。它就像是 Excel 中的一張工作表或是一個 SQL 表格。

In [2]:
# 創建一個 Series
s = pd.Series([1, 3, 5, np.nan, 6, 8], name='MySeries')
print("這是一個 Series:")
print(s)



這是一個 Series:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: MySeries, dtype: float64


In [3]:
# 創建一個 DataFrame
data = {'State': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'Year': [2000, 2001, 2002, 2001, 2002, 2003],
        'Pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
print("\n這是一個 DataFrame:")
print(df)





這是一個 DataFrame:
    State  Year  Pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


## 2. 載入與檢視資料

在實務中，我們很少手動創建 DataFrame，更多的是從外部來源讀取資料。我們將使用鐵達尼號資料集作為範例。

根據我們在 `.cursorrules` 文件中規劃的資料夾結構，我們將從 `datasets` 資料夾讀取資料。

In [4]:
pd.read_csv(r"../../../../data_mining_course/datasets/raw/titanic/titanic.csv")


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [5]:
# 從 CSV 檔案載入資料
# 我們使用相對路徑來讀取檔案。
# 從 `notebooks` 目錄出發, `../../../` 會到達 `iSpan_python-FE_DM-cookbooks` 的根目錄
path = r"../../../../data_mining_course/datasets/raw/titanic/titanic.csv"
titanic_df = pd.DataFrame() # 建立一個空的 dataframe 以防檔案讀取失敗
try:
    titanic_df = pd.read_csv(path)
    print("成功載入 Titanic 資料集!")
except FileNotFoundError:
    print(f"在 '{path}' 找不到 train.csv，請確認 data_download.py 已成功執行。")
    # Fallback for alternative structure
    try:
        path = 'data_mining_course/datasets/titanic.csv'
        titanic_df = pd.read_csv(path)
        print("成功從備用路徑載入 Titanic 資料集!")
    except FileNotFoundError:
         print(f"在 '{path}' 也找不到 Titanic 資料集。請檢查檔案位置。")




成功載入 Titanic 資料集!


### 2.1 基本資料檢視 (Data Inspection)

載入資料後，第一步是快速了解它的樣貌。這對應我們在指南中提到的「初步資料理解與結構化」。

In [6]:
# 查看資料維度 (行, 列)
print(f"資料維度 (行, 列): {titanic_df.shape}")



資料維度 (行, 列): (891, 12)


In [7]:
# 查看前 5 筆資料
print("\n資料集頭部 (前5筆):")
titanic_df.head()




資料集頭部 (前5筆):


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
# 查看後 5 筆資料
print("\n資料集尾部 (後5筆):")
titanic_df.tail()




資料集尾部 (後5筆):


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [9]:
# 獲取 DataFrame 的摘要資訊
# 這會顯示索引類型、欄位、非空值數量、資料類型和記憶體使用量
print("\nDataFrame 資訊摘要 (.info()):")
titanic_df.info()





DataFrame 資訊摘要 (.info()):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


從 `.info()` 的輸出中，我們可以快速發現 `Age`, `Cabin`, `Embarked` 這幾個欄位存在缺失值 (non-null count 小於總行數 891)。這是 EDA 階段需要重點關注的問題。

In [None]:
# 獲取數值型欄位的描述性統計
# 這包括計數、平均值、標準差、最小值、25/50/75百分位數和最大值
print("\n數值型欄位描述性統計 (.describe()):")
titanic_df.describe()




數值型欄位描述性統計 (.describe()):


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [11]:
# 獲取類別型欄位的描述性統計
print("\n類別型欄位描述性統計:")
titanic_df.describe(include=['O'])





類別型欄位描述性統計:


Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


## 3. 資料選取與過濾

Pandas 提供了多種方式來選取資料的子集。

In [12]:
# 選取單一欄位 (返回一個 Series)
ages = titanic_df['Age']
print("選取 'Age' 欄位 (Series):")
ages.head()



選取 'Age' 欄位 (Series):


0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [13]:
# 選取多個欄位 (返回一個 DataFrame)
subset = titanic_df[['Name', 'Age', 'Sex']]
print("\n選取 'Name', 'Age', 'Sex' 欄位 (DataFrame):")
subset.head()




選取 'Name', 'Age', 'Sex' 欄位 (DataFrame):


Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22.0,male
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female
2,"Heikkinen, Miss. Laina",26.0,female
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,female
4,"Allen, Mr. William Henry",35.0,male


### 3.1 使用 `.loc` 和 `.iloc` 進行索引

- `.loc[]`：基於 **標籤** (label-based) 的索引。
- `.iloc[]`：基於 **位置** (integer-based) 的索引。

In [14]:
# .loc: 選取第 0 到 4 行的 'Pclass' 和 'Fare' 欄位
titanic_df.loc[0:4, ['Pclass', 'Fare']]



Unnamed: 0,Pclass,Fare
0,3,7.25
1,1,71.2833
2,3,7.925
3,1,53.1
4,3,8.05


In [15]:
# .iloc: 選取第 0 到 4 行 (不包含5)，以及第 2 和 3 個欄位 (Pclass, Name)
# 注意 Python 的 slicing 在基於整數索引時不包含結束點。
titanic_df.iloc[0:5, [2, 3]]




Unnamed: 0,Pclass,Name
0,3,"Braund, Mr. Owen Harris"
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,"Heikkinen, Miss. Laina"
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,3,"Allen, Mr. William Henry"


### 3.2 條件過濾 (Boolean Indexing)

這是資料分析中非常強大且常用的功能。

In [16]:
# 選取出所有女性乘客
female_passengers = titanic_df[titanic_df['Sex'] == 'female']
print(f"鐵達尼號上共有 {len(female_passengers)} 位女性乘客。")
female_passengers.head()



鐵達尼號上共有 314 位女性乘客。


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [17]:
# 選取出所有頭等艙 (Pclass=1) 且年齡大於 50 歲的乘客
senior_first_class = titanic_df[(titanic_df['Pclass'] == 1) & (titanic_df['Age'] > 50)]
print(f"\n共有 {len(senior_first_class)} 位年長的頭等艙乘客。")
senior_first_class.head()




共有 39 位年長的頭等艙乘客。


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
124,125,0,1,"White, Mr. Percival Wayland",male,54.0,0,1,35281,77.2875,D26,S


## 總結

在這個筆記本中，我們複習了 Pandas 的基礎知識，包括：
- 核心資料結構 `Series` 和 `DataFrame`。
- 從檔案讀取資料並使用 `.head()`, `.info()`, `.describe()` 等方法進行快速檢視。
- 使用 `[]`, `.loc`, `.iloc` 和布林條件來進行強大的資料選取與過濾。

這些是進行任何資料分析專案的必備技能。在下一個筆記本中，我們將學習如何將這些資料視覺化。