# Titanic Dataset — Exploratory Data Analysis (EDA)
Skillytixs Task 2 — EDA Notebook

This notebook performs a comprehensive EDA on the Titanic dataset (train.csv). It is written for beginners and includes step-by-step code and explanations.

Place `train.csv` in the same folder as this notebook before running. If you have internet when running the notebook, it can optionally download a public copy.

## 1. Setup & Load Data
Install required libraries if needed and load dataset from local file or URL.

In [None]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', 200)
%matplotlib inline

In [None]:
# Load dataset: try local file first
import os
local_path = 'train.csv'
if os.path.exists(local_path):
    df = pd.read_csv(local_path)
else:
    # If you have internet when running the notebook, replace the URL below with a valid raw CSV URL
    # or place train.csv in the same directory as this notebook.
    raise FileNotFoundError('train.csv not found in notebook directory. Please place train.csv here before running.')

print('Dataset loaded. Shape:', df.shape)
df.head()

## 2. Quick Data Overview
Use `.info()`, `.describe()`, `.isnull().sum()` to inspect the dataset

In [None]:
df.info()

df.describe(include='all')

print('Missing values per column:')
print(df.isnull().sum())

## 3. Data Cleaning (Suggested Steps)
- Rename columns to snake_case
- Convert data types (e.g., age to numeric)
- Handle missing values (Age, Cabin, Embarked)
- Remove or keep PassengerId as needed

In [None]:
# Standardize column names and basic cleaning

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-','_')
# Ensure age numeric
df['age'] = pd.to_numeric(df['age'], errors='coerce')
# Fill or mark missing 'embarked'
df['embarked'] = df['embarked'].fillna('Unknown')

print(df.columns.tolist())
df.head()

## 4. Univariate Analysis
Plot distributions for numerical features and counts for categorical features

In [None]:
# Histograms for numeric features
numeric_cols = ['age','fare']
for col in numeric_cols:
    plt.figure(figsize=(6,3))
    sns.histplot(df[col].dropna(), kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# Boxplots to check outliers
for col in numeric_cols:
    plt.figure(figsize=(6,3))
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

In [None]:
# Categorical counts
cat_cols = ['sex','pclass','embarked']
for c in cat_cols:
    plt.figure(figsize=(5,3))
    sns.countplot(data=df, x=c)
    plt.title(f'Countplot of {c}')
    plt.show()

## 5. Bivariate Analysis
Investigate relationships between features and the target (`survived`)

In [None]:
# Survival rate by sex
print(pd.crosstab(df['sex'], df['survived'], normalize='index')*100)
plt.figure(figsize=(5,3))
sns.barplot(x='sex', y='survived', data=df)
plt.title('Survival Rate by Sex')
plt.show()

In [None]:
# Survival rate by Pclass
plt.figure(figsize=(5,3))
sns.barplot(x='pclass', y='survived', data=df)
plt.title('Survival Rate by Passenger Class')
plt.show()

In [None]:
# Age distribution split by survival
plt.figure(figsize=(6,3))
sns.kdeplot(df.loc[df['survived']==1,'age'].dropna(), label='Survived')
sns.kdeplot(df.loc[df['survived']==0,'age'].dropna(), label='Died')
plt.legend()
plt.title('Age distribution by Survival')
plt.show()

## 6. Correlation & Heatmap
Check how numeric features correlate and inspect pairwise relationships

In [None]:
# Correlation matrix for numeric columns
num = df.select_dtypes(include=[np.number])
plt.figure(figsize=(8,6))
sns.heatmap(num.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation matrix')
plt.show()

## 7. Grouping & Business Insights
Examples: survival rate by combination of features

In [None]:
# Group by multiple columns
insights = df.groupby(['pclass','sex'])['survived'].mean().reset_index()
insights['survived_pct'] = insights['survived']*100
insights = insights.sort_values('survived_pct', ascending=False)
print('Top survival groups:')
print(insights)
insights

## 8. Dealing with Missing Values
Common approaches and code examples

In [None]:
# Imputing age with median grouped by pclass and sex
median_age = df.groupby(['pclass','sex'])['age'].transform('median')
df['age_imputed'] = df['age'].fillna(median_age)
print('Missing age before:', df['age'].isnull().sum())
print('Missing age after (age_imputed):', df['age_imputed'].isnull().sum())

## 9. Outlier detection and skewness
Use skew() and consider log transform for skewed features like `fare`

In [None]:
# Skewness of fare
print('Fare skewness:', df['fare'].skew())
plt.figure(figsize=(6,3))
sns.histplot(np.log1p(df['fare'].dropna()), kde=True)
plt.title('Log-transformed fare distribution')
plt.show()

## 10. Final Summary & Next Steps
- Key findings (to be confirmed by running the notebook):
  - Females had higher survival rates than males.
  - First-class passengers had better survival rates.
  - Age and fare show relationships with survival.
- Next: Feature engineering, modeling, cross-validation, and evaluation.

---

**Save cleaned dataset and figures**

In [None]:
# Save cleaned dataset
out_name = 'titanic_train_cleaned.csv'
df.to_csv(out_name, index=False)
print('Saved cleaned dataset as', out_name)