# 📅 Day 3 — Exploratory Data Analysis (EDA)



## 🛳 The Titanic Dataset

The **Titanic dataset** records information about passengers aboard the RMS Titanic. 
It contains a mix of **numerical** and **categorical** variables, as well as **missing data**, **outliers**, 
and a **binary target variable** (survival).

We will use Seaborn's built-in Titanic dataset (~891 rows, 15 columns).


In [None]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load Titanic dataset from seaborn
titanic = sns.load_dataset("titanic")
titanic.head()



### Key Features in the Seaborn Version
- **survived**: Whether the passenger survived (0 = No, 1 = Yes)  
- **pclass**: Passenger class (1st, 2nd, 3rd)  
- **sex**: Male or female  
- **age**: Age of passenger in years  
- **sibsp**: Number of siblings/spouses aboard  
- **parch**: Number of parents/children aboard  
- **fare**: Ticket fare (in British pounds)  
- **embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)  
- **class**: Duplicate categorical version of `pclass`  
- **who**: Man, woman, or child (based on sex and age)  
- **deck**: Cabin deck (many missing)  
- **embark_town**: Full name of embarkation port  
- **alive**: Duplicate categorical version of `survived`  
- **alone**: Whether passenger was alone or not  


## 📚 Instruction (3h)

### 1. Descriptive Statistics


#### Measures of Central Tendency: Mean, Median, Mode
We use measures like mean, median, and mode to summarize central tendencies in data. 
For example, average passenger age or the most common embarkation port.


In [None]:

# Mean, Median, Mode examples
print("Mean age:", titanic['age'].mean())
print("Median age:", titanic['age'].median())
print("Mode embark_town:", titanic['embark_town'].mode()[0])



#### Measures of Spread: Variance, Standard Deviation, Quartiles
Spread measures help us understand variability in the dataset. 
Titanic fares vary widely across classes.


In [None]:

print("Fare variance:", titanic['fare'].var())
print("Fare standard deviation:", titanic['fare'].std())
print("Fare quartiles:")
print(titanic['fare'].describe()[4:7])



#### Skewness & Kurtosis
Skewness shows asymmetry in distributions, while kurtosis measures tail heaviness.


In [None]:

print("Fare skewness:", titanic['fare'].skew())
print("Fare kurtosis:", titanic['fare'].kurt())


### 2. Distribution Exploration

#### Histograms

In [None]:
sns.histplot(titanic['age'].dropna(), bins=20, kde=False);

#### Density Plots

In [None]:
sns.kdeplot(data=titanic, x='age', hue='survived', common_norm=False);

#### Boxplots

In [None]:
sns.boxplot(data=titanic, x='class', y='fare');

### 3. Outlier Detection

#### Boxplots & Scatterplots

In [None]:
sns.scatterplot(data=titanic, x='age', y='fare', hue='class');

#### IQR Method

In [None]:

Q1 = titanic['fare'].quantile(0.25)
Q3 = titanic['fare'].quantile(0.75)
IQR = Q3 - Q1
outliers = titanic[(titanic['fare'] < Q1 - 1.5*IQR) | (titanic['fare'] > Q3 + 1.5*IQR)]
print("Outliers count:", len(outliers))


### 4. Correlation Analysis

In [None]:

corr = titanic[['age','fare','sibsp','parch']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm');


### 5. Grouping & Aggregation

In [None]:

# Survival rates by sex
print(titanic.groupby('sex')['survived'].mean())

# Multiple statistics on fares by class
print(titanic.groupby('class')['fare'].agg(['mean','median','count']))


#### Pivot Tables

In [None]:

pd.crosstab(titanic['sex'], titanic['class'], values=titanic['survived'], aggfunc='mean')


### 6. Missing Data Analysis

In [None]:
titanic.isnull().sum()


## 📝 Exercise (30–45 min)

Using the Titanic dataset, complete the following tasks:

1. Compute survival rates by **embark_town** and **class**.  
2. Plot the age distribution for survivors vs non-survivors using **histograms** or **density plots**.  
3. Detect outliers in **age** using the IQR method and comment on findings.  
4. Create a new feature `family_size = sibsp + parch + 1`. Analyze its relationship with survival.  
5. Use a **pivot table** to display survival rates by **sex** and **class**. What patterns do you see?  
6. (Optional, for advanced) Perform a **chi-square test** on survival vs sex to test independence.  



## 🔄 Reflection

- What surprised you most about Titanic passenger survival patterns?  
- How do categorical vs numerical features require different EDA approaches?  
- How could EDA guide the next steps in modeling or prediction?  



## 📚 Additional Sources for Further EDA Work

- Wes McKinney, *Python for Data Analysis*  
- Jake VanderPlas, *Python Data Science Handbook*  
- YData Profiling: [https://ydata-profiling.ydata.ai](https://ydata-profiling.ydata.ai)  
- Seaborn Documentation: [https://seaborn.pydata.org](https://seaborn.pydata.org)  
