# Data Cleaning Notebook
Contributors:

**Egemen Alkan**: Designed and implemented the steps for data cleaning, including removing irrelevant columns, handling missing values, and addressing outliers.
Objective:
Prepare the Titanic dataset for analysis by cleaning and preprocessing the data. This includes renaming columns, handling missing values, addressing outliers, and saving the cleaned dataset for further use in machine learning models.

Overview of Tasks:

Rename Columns: Standardize column names to make them easier to use programmatically.
Drop Irrelevant Columns: Remove unnecessary columns such as name, ticket, and cabin that are not useful for predictive modeling.
Handle Missing Values: Fill missing values in key columns (age, embarked) and remove rows with missing fare values.
Handle Outliers: Identify and filter out extreme values in the age column using the IQR (Interquartile Range) method.
Save the Cleaned Dataset: Export the cleaned dataset to a CSV file for use in future analysis or machine learning tasks.

# Kaggle Dataset Link:
**https://www.kaggle.com/datasets/yasserh/titanic-dataset**

## 1. Load the Titanic Dataset

- Objective: Load the Titanic dataset from a CSV file into a pandas DataFrame named titanic_df.
- Why: The raw dataset needs to be cleaned and prepared for analysis or machine learning.

In [None]:
import pandas as pd
titanic_df = pd.read_csv('../data/titanic.csv')

## 2. Rename Columns

- Objective: Standardize column names by:
    - Converting all column names to lowercase.
    - Replacing spaces with underscores (_).
- Why: This makes the column names easier to handle in code (especially in programming environments sensitive to casing or spaces).

In [None]:
titanic_df.columns = titanic_df.columns.str.lower().str.replace(' ', '_')

## 3. Drop Irrelevant Columns

- Objective: Remove columns that are deemed irrelevant for analysis or modeling:
    - name: Not typically useful for prediction.
    - ticket: Unique values, so it doesn't contribute useful information.
    - cabin: Often has too many missing values and may not be predictive.
- Why: Reduces noise and keeps the dataset focused on relevant features.

In [None]:
titanic_df.drop(columns=['name', 'ticket', 'cabin'], inplace=True)

## 4. Handle Missing Values

- Objective: Address missing data for key columns:
    - age: Missing values are replaced with the median age, as it's less sensitive to outliers compared to the mean.
    - embarked: Missing values are filled with the mode (most frequent value), assuming it's representative of most passengers.
    - fare: Rows with missing values in the fare column are dropped because fare is crucial for analysis and cannot easily be inferred.
- Why: Missing data can cause issues in analysis or modeling, so these steps ensure the dataset is complete.

In [None]:
# Fill missing 'age' with median, 'embarked' with mode, and drop rows with missing 'fare'
titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)
titanic_df['embarked'].fillna(titanic_df['embarked'].mode()[0], inplace=True)
titanic_df.dropna(subset=['fare'], inplace=True)

## 5. Handle Outliers in the age Column

- Objective: Remove extreme outliers in the age column using the IQR (Interquartile Range) method:
    - Calculate Q1 (25th percentile) and Q3 (75th percentile).
    - Compute the IQR as Q3 - Q1.
    - Define acceptable values for age:
        - Lower bound: Q1 - 1.5 * IQR.
        - Upper bound: Q3 + 1.5 * IQR.
    - Keep only rows where age falls within these bounds.
- Why: Outliers can distort analysis and models, so removing them ensures the age distribution is more representative of the majority of passengers.

In [None]:
# Calculate the interquartile range (IQR)
Q1 = titanic_df['age'].quantile(0.25)
Q3 = titanic_df['age'].quantile(0.75)
IQR = Q3 - Q1

# Determine lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out rows where 'age' is outside these bounds
titanic_df = titanic_df[(titanic_df['age'] >= lower_bound) & (titanic_df['age'] <= upper_bound)]

## 6. Save the Cleaned Dataset

- Objective: Save the cleaned and preprocessed dataset to a new CSV file named titanic_cleaned.csv in the ../data/ directory.
- Why: The cleaned dataset can be used for further analysis, visualization, or machine learning without repeating the cleaning process.

In [None]:
titanic_df.to_csv('../data/titanic_cleaned.csv', index=False)