1. Load Data (sns.load_dataset("titanic"))
2. Impute Missing Values:

    Fill age with median.
    Fill embarked with mode.
3. Categorical Encoding:

    One-hot encode sex, embarked.
4. Feature Splitting:

    Split name into first_name and title using string operations.
5. Discretization:

    Bin age into categories (e.g., child, adult, senior).
6. Scaling:

    Use MinMaxScaler on fare.
7. Handling Outliers:

    Clip extreme fare values above the 95th percentile.
8. Variable Transformation:

    Apply log transformation to fare for skewness reduction.

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

df=sns.load_dataset('titanic')

df['age'].fillna(df['age'].median(),inplace=True)
df['embarked'].fillna(df['embarked'].mode()[0],inplace=True)

df=pd.get_dummies(df,columns=['sex','embarked'],drop_first=True)



# def extract_name_title(name):
#     if pd.isnull(name):
#         return pd.Series(['Unknown', 'Unknown'])
#     parts = name.split(',')
#     first_name = parts[1].split()[1] if len(parts) > 1 else 'Unknown'
#     title = parts[1].split('.')[0].strip() if '.' in parts[1] else 'Unknown'
#     return pd.Series([first_name, title])

# df[['first_name', 'title']] = df['name'].apply(extract_name_title)



# def age_bin(age):
#     if age < 18:
#         return 'child'
#     elif age < 60:
#         return 'adult'
#     else:
#         return 'senior'

# df['age_category'] = df['age'].apply(age_bin)

labels=['child','adult','senior']
bins=[18,40,60,np.inf]
df['age_category']=pd.cut(df['age'],bins=bins,labels=labels,right=False)



scaler=MinMaxScaler()
df['fare_scaled']=scaler.fit_transform(df[['fare']])

fare_95 = df['fare'].quantile(0.95)
df['fare_clipped'] = np.clip(df['fare'], a_min=None, a_max=fare_95)


df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].median(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['embarked'].fillna(df['embarked'].mode()[0],inplace=True)


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alive,alone,sex_male,embarked_Q,embarked_S,age_category,fare_scaled,fare_clipped
0,0,3,22.0,1,0,7.25,Third,man,True,,Southampton,no,False,True,False,True,child,0.014151,7.25
1,1,1,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,yes,False,False,False,False,child,0.139136,71.2833
2,1,3,26.0,0,0,7.925,Third,woman,False,,Southampton,yes,True,False,False,True,child,0.015469,7.925
3,1,1,35.0,1,0,53.1,First,woman,False,C,Southampton,yes,False,False,False,True,child,0.103644,53.1
4,0,3,35.0,0,0,8.05,Third,man,True,,Southampton,no,True,True,False,True,child,0.015713,8.05
