### Author : __AbdulRehman__
Created on : 16-07-2025

# Data Transformation with Pandas: A Beginner's Guide (Titanic Dataset)

This Jupyter Notebook demonstrates essential data transformation techniques using the Pandas library in Python, applied to the well-known Titanic dataset.

## 1. Loading and Understanding Your Data

First, we load the Titanic dataset using Seaborn. It's crucial to inspect the data's structure, identify missing values, and understand data types before any transformations.

In [None]:
import seaborn as sns
import pandas as pd

In [None]:
df = sns.load_dataset('titanic')

In [None]:
print("### First 5 rows of the dataset:")
print(df.head())

In [None]:
print("\n### Dataset Information (Data Types and Non-Null Counts):")
print(df.info())

In [None]:
print("\n### Descriptive Statistics:")
print(df.describe(include='all'))

## 2. Handling Missing Values

Missing data is a common challenge. We'll demonstrate strategies for numerical and categorical columns.

### Handling Missing 'Age' Values

For numerical columns like 'age', filling missing values with the median is a robust strategy as it's less sensitive to outliers than the mean.

In [None]:
print("\n### Missing values in 'age' before filling:", df['age'].isnull().sum())
df['age'].fillna(df['age'].median(), inplace=True)
print("### Missing values in 'age' after filling:", df['age'].isnull().sum())

### Handling Missing 'Embarked' and 'Deck' Values

For categorical columns like 'embarked', filling with the mode (most frequent value) is often appropriate. Columns with a very high percentage of missing values, like 'deck', are often best dropped.

In [None]:
print("\n### Missing values in 'embarked' before filling:", df['embarked'].isnull().sum())
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
print("### Missing values in 'embarked' after filling:", df['embarked'].isnull().sum())

In [None]:
print("\n### Missing values in 'deck' before dropping:", df['deck'].isnull().sum())
df.drop('deck', axis=1, inplace=True)
print("### 'deck' column dropped. Current columns:", df.columns.tolist())

## 3. Creating New Features (Feature Engineering)

Feature engineering involves creating new, more informative features from existing ones. This can significantly improve model performance.

### Creating 'Family Size'

We'll create a 'family_size' feature by summing 'sibsp' (siblings/spouses) and 'parch' (parents/children) and adding 1 for the individual themselves.

In [None]:
df['family_size'] = df['sibsp'] + df['parch'] + 1
print("\n### 'family_size' feature created:")
print(df[['sibsp', 'parch', 'family_size']].head())

### Creating 'Is Alone'

Building on 'family_size', we can derive a binary 'is_alone' feature, indicating whether a passenger was traveling without family.

In [None]:
df['is_alone'] = (df['family_size'] == 1).astype(int)
print("\n### 'is_alone' feature created:")
print(df[['family_size', 'is_alone']].head())

## 4. Encoding Categorical Variables

Machine learning models typically require numerical input. Categorical variables need to be converted.

### One-Hot Encoding 'Sex' and 'Embarked'

One-hot encoding is a common technique where each category becomes a new binary (0 or 1) column. We use `drop_first=True` to avoid multicollinearity.

In [None]:
df = pd.get_dummies(df, columns=['sex', 'embarked'], drop_first=True)
print("\n### After One-Hot Encoding 'sex' and 'embarked':")
print(df[['sex_male', 'embarked_Q', 'embarked_S']].head())
print(df.info())

## 5. Data Type Conversion

Converting data types can optimize memory usage or prepare data for specific operations.

### Converting 'Fare' to Integer

If high precision isn't required, converting a float column like 'fare' to an integer can save memory and simplify analysis.

In [None]:
df['fare'] = df['fare'].astype(int)
print("\n### 'fare' column after converting to integer:")
print(df['fare'].head())
print(df['fare'].dtype)

In [None]:
print("\n### Final DataFrame Info after all transformations:")
print(df.info())