### Hierarchical Indexing

Hierarchical Indexing (also called `MultiIndexing`) in Pandas is a way of creating multiple levels of indexing in a DataFrame or Series. 

It lets you organize data in a nested format, similar to grouped or pivot tables.

#### Why Use Hierarchical Indexing?
- To work with high-dimensional data in a 2D format.
- To perform grouped operations and analysis.
- To slice and dice data more efficiently.

`High-dimensional` data refers to datasets that have a large number of features (columns or variables) — not just rows.

#### Where You See High-Dimensional Data:
| Field            | Example                                                     |
| ---------------- | ----------------------------------------------------------- |
| Bioinformatics   | Thousands of gene expressions per sample                    |
| Image processing | Each pixel is a feature (e.g. 64x64 image = 4,096 features) |
| Text data (NLP)  | Each word can be a feature (vector of 10,000+ words)        |
| Finance          | Many stock indicators and time series features              |
| Machine Learning | When using many features for prediction                     |

we’ll explore the direct creation of `MultiIndex` objects, considerations when `indexing`, `slicing`, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.

In [3]:
import pandas as pd
import numpy as np

#### A Multiply Indexed Series

Let’s start by considering how we might represent two-dimensional data within a one-dimensional Series.

We'll consider a series of data where each point has a character and numerical key.

#### The bad way

Suppose you would like to track data about states from two different years.

you might be tempted to simply use Python tuples as keys:

In [4]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
data = pd.Series(populations, index=index)
data

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

In [5]:
data[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there. For example, if you need to select all values from 2010, you’ll need to do some messy (and potentially slow) munging to make it happen:

In [6]:
data

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [20]:
data[[i for i in index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

This produces the desired result, but is not as clean (or as efficient for large datasets).

### The Better Way: Pandas MultiIndex

The Pandas `MultiIndex` type gives us the type of operations we wish to have.

We can create a multi-index from the tuples as follows:

In [7]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [9]:
data.reindex(index)

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [11]:
data = data.reindex(index)

In [12]:
data

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

In [14]:
data[:, 2000]  # Slicing by year

California    33871648
New York      18976457
Texas         20851820
dtype: int64

#### MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels.

In fact, Pandas is built with this equivalence in mind. The `unstack()` method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

In [15]:
data

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [16]:
data_df = data.unstack()
data_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


Naturally, the `stack()` method provides the opposite operation:

In [17]:
data_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Seeing this, you might wonder why would we would bother with hierarchical indexing at all.

The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional `Series`, we can also use it to represent data of three or more dimensions in a `Series` or `DataFrame`.

Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent.

Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a MultiIndex this is as easy as adding another column to the DataFrame:

In [18]:
data_df = data_df.stack()

In [19]:
data_df

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [20]:
data

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [21]:
data_df = pd.DataFrame({'total': data,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})

data_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In addition, all the `ufuncs` and other functionality work with hierarchical indices as well.

In [24]:
# compute the fraction of people under 18 by year

f_u18 = data_df['under18'] / data_df['total']
# f_u18
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


This allows us to easily and quickly manipulate and explore even high-dimensional data.

### Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. 

In [25]:
np.random.rand(4, 2)

array([[0.37865487, 0.24229909],
       [0.17204509, 0.91680819],
       [0.05432806, 0.18511918],
       [0.20653883, 0.73374693]])

In [33]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                    columns=['data1', 'data2'])

df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.753938,0.22166
a,2,0.959429,0.396387
b,1,0.047121,0.644294
b,2,0.211551,0.887456


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a `MultiIndex` by default:

In [26]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

You can construct the MultiIndex from a simple list of arrays giving the index values within each level

In [35]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can construct it from a list of tuples giving the multiple index values of each point:

In [36]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

### MultiIndex level names

Sometimes it is convenient to name the levels of the MultiIndex.

This can be accomplished by passing the `names` argument to any of the above MultiIndex constructors, or by setting the `names` attribute of the index. 

In [28]:
data_df.index.names = ['state', 'year']
data_df

Unnamed: 0_level_0,Unnamed: 1_level_0,total,under18
state,year,Unnamed: 2_level_1,Unnamed: 3_level_1
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


#### Indexing and Slicing a MultiIndex

In [29]:
data_df

Unnamed: 0_level_0,Unnamed: 1_level_0,total,under18
state,year,Unnamed: 2_level_1,Unnamed: 3_level_1
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


We can access single elements by indexing with multiple terms:

In [30]:
data_df.loc['California']


Unnamed: 0_level_0,total,under18
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,33871648,9267089
2010,37253956,9284094


In [31]:
data_df

Unnamed: 0_level_0,Unnamed: 1_level_0,total,under18
state,year,Unnamed: 2_level_1,Unnamed: 3_level_1
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [32]:
data_df.loc[('California', 2010)]

total      37253956
under18     9284094
Name: (California, 2010), dtype: int64

### Titanic Dataset Exploration

#### Import Libraries

In [33]:
import pandas as pd
import numpy as np

#### Load Dataset

In [36]:
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Explore the Data

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [38]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [39]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [40]:
df.shape

(891, 12)

### Handle Missing Values

Real-world datasets are often incomplete.

Missing values are empty or null entries in your dataset. They may appear as:
- NaN (Not a Number)
- None

#### Types of Missing Values

There are **three main types of missing values** in data analysis. Understanding them helps you choose the right strategy to handle them.

---

#### 1. Missing Completely at Random (MCAR)

**What it means:**  
The missingness is **completely unrelated** to any other data. It’s purely random.

**Example:**  
In a survey, a respondent accidentally skips a question.

**How to handle:**  
- You can **drop** rows or columns without introducing bias.
- Or use **simple imputation** (mean/median/mode) safely.

---

#### 2. Missing at Random (MAR)

**What it means:**  
The missingness is **systematic**, but related to **observed** data, not the missing data itself.

**Example:**  
Older respondents are less likely to answer income questions. Missingness depends on age (which is known).

**How to handle:**  
- Use **imputation based on other variables** (e.g., regression, KNN, multiple imputation).

---

#### 3. Missing Not at Random (MNAR)

**What it means:**  
The missingness is **related to the missing value itself**. This is the most problematic type.

**Example:**  
People with high incomes are less likely to report their income. The reason it’s missing is the value itself.

**How to handle:**  
- May require **domain knowledge**, **modeling the missingness**, or **collecting more data**.
- In some cases, treat missingness as a **feature**.

---

#### Summary Table

| Type  | Depends on...              | Safe to Drop? | Example                              |
|-------|----------------------------|---------------|--------------------------------------|
| MCAR  | Nothing (pure random)      | Yes         | A skipped question by mistake        |
| MAR   | Other known variables      | No          | Income missing depends on age        |
| MNAR  | The missing value itself   | No          | High earners skipping income field   |


##### Why Handle Missing Values?
- They can bias your analysis
- They can distort statistics (like mean, median)

##### Check for Missing Values

In [42]:
# View total missing values per column
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Ways to Handle Missing Values
#### 1. Remove Missing Data
Use when the column or row isn't critical

In [43]:
# df.dropna()                      # Drops rows with any missing values
df.drop(columns='Cabin')        # Drops specific column

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


#### 2. Fill Missing Data (Imputation)
For Numerical Columns:

In [44]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
df['Age'].fillna(df['Age'].mean(), inplace=True)    # Replace with mean
df['Age'].fillna(df['Age'].median(), inplace=True)  # Better with outliers

##### When to Use Mean vs Median for Imputation

| Method     | Use When...                                                                             | Why                                                                   |
| ---------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| **Mean**   | The data is **normally distributed** (symmetrical, bell-shaped) and has **no outliers** | Mean gives a good estimate of the "typical" value                     |
| **Median** | The data is **skewed** or contains **outliers**                                         | Median is more **robust** because it’s not affected by extreme values |


For Categorical Columns:

In [None]:
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)  # Replace with most frequent

### Filter / Indexing

In [46]:
# All passengers over 60
df[df['Age']>60].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
170,171,0,1,"Van der hoef, Mr. Wyckoff",male,61.0,0,0,111240,33.5,B19,S


In [66]:
# First-class survivors
df[(df['Pclass']) == 1 & (df['Survived'] == 1)].head()  

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C


#### Add / Modify Columns

In [67]:
# Create age group
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 35, 60, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior'])

In [68]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeGroup
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Young Adult
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Adult
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Young Adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Young Adult
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Young Adult


#### What are SibSp and Parch?

| Column  | Meaning                                                  |
| ------- | -------------------------------------------------------- |
| `SibSp` | Number of **siblings or spouses** a passenger had aboard |
| `Parch` | Number of **parents or children** a passenger had aboard |

In [47]:
# Family onboard
df['FamilySize'] = df['SibSp'] + df['Parch']

In [48]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


####  GroupBy and Aggregation

##### What is GroupBy?
`groupby()` in Pandas lets you split your data into groups, apply a function, and then combine the results.

This is known as the Split-Apply-Combine strategy:
- Split the data into groups (e.g., by gender)
- Apply a function (like mean, sum)
- Combine the results into a new table

1. Average Age by Gender

In [79]:
df.groupby('Sex')['Age'].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

2: Survival Rate by Passenger Class

In [80]:
df.groupby('Pclass')['Survived'].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

3: Count of Passengers per Embarkation Port

In [81]:
df.groupby('Embarked')['PassengerId'].count()

Embarked
C    168
Q     77
S    644
Name: PassengerId, dtype: int64

####  Multi-Column Aggregation

In [51]:
df.groupby('Sex').agg({
    'Age': 'median',
    'Fare': ['mean', 'max'],
    'Survived': 'sum'
})

Unnamed: 0_level_0,Age,Fare,Fare,Survived
Unnamed: 0_level_1,median,mean,max,sum
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,27.0,44.479818,512.3292,233
male,29.0,25.523893,512.3292,109


In [72]:
df.groupby(['Sex', 'Pclass'])['Survived'].agg(['mean', 'count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,0.968085,94
female,2,0.921053,76
female,3,0.5,144
male,1,0.368852,122
male,2,0.157407,108
male,3,0.135447,347


#### Resetting Index
After a groupby, you may want to turn the grouped column back into a normal column:

In [52]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [85]:
df.groupby('Pclass')['Fare'].mean().reset_index()

Unnamed: 0,Pclass,Fare
0,1,84.154687
1,2,20.662183
2,3,13.67555


#### Common Aggregation Functions

| Function    | Description                 |
| ----------- | --------------------------- |
| `.mean()`   | Average value               |
| `.sum()`    | Total sum                   |
| `.count()`  | Number of non-null values   |
| `.max()`    | Maximum value               |
| `.min()`    | Minimum value               |
| `.median()` | Middle value                |
| `.agg()`    | Apply multiple aggregations |

Use `groupby()` when you want to analyze data by categories, like:
- Average survival rate by gender
- Total fare by class
- Passenger counts by port

#### Sorting and Reshaping

In [73]:
# Sort by Fare
df.sort_values(by='Fare', ascending=False).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeGroup,FamilySize
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,Adult,1
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C,Young Adult,0
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,Young Adult,0
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S,Young Adult,5
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S,Senior,5


#### Pivot Tables

A `pivot table` lets you summarize, group, and rearrange data in a DataFrame — similar to Excel pivot tables.

You can use it to:
- Show relationships between multiple variables
- Compare averages, counts, or totals
- Reshape data from long to wide format

Syntax
```python
pd.pivot_table(data, index=..., columns=..., values=..., aggfunc=...)
```

- `index`: What you want on the rows (e.g., 'Sex')
- `columns`: What you want on the columns (e.g., 'Pclass')
- `values`: What to calculate (e.g., 'Survived', 'Fare')
- `aggfunc`: Aggregation function ('mean', 'sum', 'count', etc.)

1: Survival Rate by Gender and Class

In [86]:
pd.pivot_table(df, index='Sex', columns='Pclass', values='Survived', aggfunc='mean')

Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


 2: Average Fare by Class and Embarkation Port

In [87]:
pd.pivot_table(df, index='Pclass', columns='Embarked', values='Fare', aggfunc='mean')

Embarked,C,Q,S
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,104.718529,90.0,70.364862
2,25.358335,12.35,20.327439
3,11.214083,11.183393,14.644083


#### Difference Between groupby() and pivot_table()

| Feature      | `groupby()`                      | `pivot_table()`     |
| ------------ | -------------------------------- | ------------------- |
| Output Shape | Long format                      | Wide format         |
| Syntax       | More code (often requires reset) | Clean and intuitive |

In `wide` format, each category has its own column.

Survival Rate by Gender and Class (Wide)
| Sex    | Pclass\_1 | Pclass\_2 | Pclass\_3 |
| ------ | --------- | --------- | --------- |
| female | 0.97      | 0.92      | 0.50      |
| male   | 0.36      | 0.15      | 0.13      |

In `long` format, there's one column for values and another column indicating the category or group they belong to.
Same Data in Long Format:
| Sex    | Pclass | Survived |
| ------ | ------ | -------- |
| female | 1      | 0.97     |
| female | 2      | 0.92     |
| female | 3      | 0.50     |
| male   | 1      | 0.36     |
| male   | 2      | 0.15     |
| male   | 3      | 0.13     |

### Converting Between Formats
Wide to Long:

Use melt():

In [54]:
df.melt(id_vars='Sex', var_name='Pclass', value_name='Survived')

ValueError: value_name (Survived) cannot match an element in the DataFrame columns.

Long to Wide:

Use pivot_table() or pivot():

In [55]:
pd.pivot_table(df, index='Sex', columns='Pclass', values='Survived')

Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


#### Practice Exercises
- How many passengers embarked from each port?
- Who is the oldest passenger who survived?
- What’s the average fare per passenger class?
- What percentage of women survived?

1. How many passengers embarked from each port?

In [None]:
# value_counts() returns the count of each unique value
df['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

2. Who is the oldest passenger who survived?

In [76]:
df[df['Survived'] == 1].sort_values('Age', ascending=False).head(1)[['Name', 'Age']]

Unnamed: 0,Name,Age
630,"Barkworth, Mr. Algernon Henry Wilson",80.0


3. What’s the average fare per passenger class?

In [77]:
df.groupby('Pclass')['Fare'].mean()

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

4. What percentage of women survived?

In [78]:
# Total number of women
total_women = df[df['Sex'] == 'female'].shape[0]

# Number of women who survived
women_survived = df[(df['Sex'] == 'female') & (df['Survived'] == 1)].shape[0]

# Percentage
percentage = (women_survived / total_women) * 100
print(f"{percentage:.2f}% of women survived")


74.20% of women survived
