This notebook demonstrates how to preprocess data for machine learning tasks. 

We'll cover the following steps:
1. Data Cleaning
2. Feature Scaling
3. Encoding Categorical Features
4. Data Splitting

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import seaborn as sns


In [7]:
df = sns.load_dataset('titanic')

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Data Cleaning

First, we'll check for any missing values and handle them if necessary.


In [8]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [9]:
df = df.dropna(subset=['survived'])

# filling missing values for other columns
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['fare'] = df['fare'].fillna(df['fare'].median())

df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Feature Scaling

We'll use `StandardScaler` to scale the feature columns. This will standardize the features so they have a mean of 0 and a standard deviation of 1.


In [10]:
# define features and target variable
features = ['pclass', 'age', 'sibsp', 'parch', 'fare', 'embarked']
target = 'survived'

# separate features and target variable
X = df[features]
y = df[target]

# define numerical and categorical columns
numerical_features = ['age', 'sibsp', 'parch', 'fare']
categorical_features = ['pclass', 'embarked']

#ColumnTransformer for scaling numerical features and encoding categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

X_processed = preprocessor.fit_transform(X)

X_processed_df = pd.DataFrame(X_processed, columns=preprocessor.get_feature_names_out())
X_processed_df.head()


Unnamed: 0,num__age,num__sibsp,num__parch,num__fare,cat__pclass_1,cat__pclass_2,cat__pclass_3,cat__embarked_C,cat__embarked_Q,cat__embarked_S
0,-0.565736,0.432793,-0.473674,-0.502445,0.0,0.0,1.0,0.0,0.0,1.0
1,0.663861,0.432793,-0.473674,0.786845,1.0,0.0,0.0,1.0,0.0,0.0
2,-0.258337,-0.474545,-0.473674,-0.488854,0.0,0.0,1.0,0.0,0.0,1.0
3,0.433312,0.432793,-0.473674,0.42073,1.0,0.0,0.0,0.0,0.0,1.0
4,0.433312,-0.474545,-0.473674,-0.486337,0.0,0.0,1.0,0.0,0.0,1.0


## Encoding Categorical Features

Categorical features have been encoded into numerical format using `OneHotEncoder`.


In [12]:
# We have already applied OneHotEncoder in the preprocessing step above.
# We'll check the resulting columns in X_processed_df to see the encoded categorical features.
X_processed_df.head()

Unnamed: 0,num__age,num__sibsp,num__parch,num__fare,cat__pclass_1,cat__pclass_2,cat__pclass_3,cat__embarked_C,cat__embarked_Q,cat__embarked_S
0,-0.565736,0.432793,-0.473674,-0.502445,0.0,0.0,1.0,0.0,0.0,1.0
1,0.663861,0.432793,-0.473674,0.786845,1.0,0.0,0.0,1.0,0.0,0.0
2,-0.258337,-0.474545,-0.473674,-0.488854,0.0,0.0,1.0,0.0,0.0,1.0
3,0.433312,0.432793,-0.473674,0.42073,1.0,0.0,0.0,0.0,0.0,1.0
4,0.433312,-0.474545,-0.473674,-0.486337,0.0,0.0,1.0,0.0,0.0,1.0


## Data Splitting

Finally, we split the data into training and testing sets.


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_processed_df, y, test_size=0.3, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((623, 10), (268, 10), (623,), (268,))

# Summary

In this notebook, we have:
1. **Loaded the Titanic dataset**: We used `seaborn` to load the dataset.
2. **Cleaned the data**: Handled missing values by imputing with median or mode.
3. **Scaled the features**: Applied `StandardScaler` to numerical features to standardize them.
4. **Encoded categorical features**: Used `OneHotEncoder` to convert categorical features into numerical format.
5. **Split the data**: Divided the dataset into training and testing sets for model evaluation.

The preprocessed Titanic dataset is now ready for machine learning algorithms.
