In [22]:
import pandas as pd

In [23]:
file_dir = "../data/raw/raw-titanic-dataset.csv"

df = pd.read_csv(file_dir)

In [24]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [25]:
print("The dataset columns are:", end="\n")
df.columns

The dataset columns are:


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')

## This dataset contains 891 passengers and 12 features 

In [26]:
print(f"This is the shape of the Dataframe: {df.shape}")

This is the shape of the Dataframe: (891, 12)


In [27]:
print("List of dataset data types: ", end="\n")
df.dtypes

List of dataset data types: 


PassengerId      int64
Survived         int64
Pclass           int64
Name               str
Sex                str
Age            float64
SibSp            int64
Parch            int64
Ticket             str
Fare           float64
Cabin              str
Embarked           str
dtype: object

In [28]:
missing_values = df.isnull().sum()
print(missing_values)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


## There are some datas that have null values:
- Age: 177 rows
- Cabin: 687 rows
- Embarked: 2 rows

### This is the next steps:
1. Replace the missing values of age column with the median of the ages
2. Replace the missing values of embarked column with mode method
3. Drop `Cabin` & `Ticket` columns
4. Verify the dataset has no more `NaN` values & the columns are correct
5. Save the cleaned dataset

### 1Ô∏è‚É£ Filling Missing Values in Age Using Median
The Age column contains missing values that must be handled before performing analysis.

Since Age is a numerical variable, we use the median to fill missing values. The median is preferred over the mean because it is less affected by outliers and skewed distributions.

By using the median:
- We maintain a realistic central value.
- We reduce the impact of extreme age values.
- We preserve the overall distribution of the dataset.

After filling the missing values, we verify the results by checking the remaining null values in the dataset.

In [42]:
median_age = df['Age'].median()
df.fillna({'Age': median_age }, inplace=True)

28.0


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [30]:
missing_values = df.isnull().sum()
print(missing_values)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


### 2Ô∏è‚É£ Filling Missing Values in Embarked Using Mode
The Embarked column contains missing values that need to be handled before further analysis.

Since Embarked is a categorical variable, it is appropriate to fill missing values using the mode, which represents the most frequently occurring category in the dataset.

Using the mode ensures that:
- We preserve the distribution of the most common embarkation port.
- We avoid introducing unrealistic or biased values.
- The dataset remains consistent for categorical analysis.

After applying the imputation, we verify that the missing values have been successfully replaced by checking the null value count again.

In [43]:
frq_embarked = df['Embarked'].mode()[0]
df.fillna({'Embarked': frq_embarked }, inplace=True)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


as you see, we have to add `[0]` to the mode method to get the value. This because the mode() doesn't return `scalar` value, but `series`, so we have to call the first row value.

In [44]:
missing_values = df.isnull().sum()
print(missing_values)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


### 3Ô∏è‚É£ Drop Unused Columns
In data preprocessing, it is important to remove irrelevant or low-value columns to simplify analysis and reduce noise.

For this dataset, we drop the following columns:
- Cabin
- Ticket

#### üîé Why Drop These Columns?
- **Cabin**
  - Contains a large number of missing values (687 null entries).
  - Highly sparse and difficult to impute reliably.
  - High cardinality with limited predictive value in this stage.
- **Ticket**
  - Mostly unique values.
  - High cardinality feature.
  - Difficult to extract meaningful patterns without additional feature engineering.


In [45]:
df.drop(columns=['Cabin', 'Ticket'], inplace=True)
missing_values = df.isnull().sum()
print(missing_values)

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64


### 4Ô∏è‚É£ Verify Columns and Data Integrity
After cleaning and transformation, we must verify that:
- All required columns exist.
- No missing values remain.
- Data types are correct.
- The dataset structure is ready for modeling.

In [50]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked'],
      dtype='str')

In [51]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

In [52]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    str    
 4   Sex          891 non-null    str    
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Fare         891 non-null    float64
 9   Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(3)
memory usage: 69.7 KB


### 5Ô∏è‚É£ Save the Cleaned Dataset
The final step is to export the cleaned dataset into a new file for future use.

We save the dataset in CSV format.

In [48]:
df.to_csv("../data/processed/cleaned-titanic-dataset.csv", index=False, encoding='utf-8')