# Preprocessing and Pipelines

## Preprocessing

**Transoforming categorical features into binary features**

In order to include in the model categorical features, we need to make them work in our model of choice. Not every model can process categorical data (meaning non numeric). But sometimes the categorical features can store important data that'll improve the model ability to generalize way on the unseen data. So making the categorical feature accesible for the model to use, is worth the effort. 

There are many methods to transform categorical features. One of them is creating dummy variable with pandas get_dummies method. 

In [1]:
import pandas as pd
import seaborn as sns

# Load Titanic dataset from seaborn
titanic_data = sns.load_dataset('titanic')
# Select relevant columns for demonstration
selected_columns = ['sex', 'class', 'embark_town', 'survived']

In [9]:
titanic_data[selected_columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   sex          891 non-null    object  
 1   class        891 non-null    category
 2   embark_town  889 non-null    object  
 3   survived     891 non-null    int64   
dtypes: category(1), int64(1), object(2)
memory usage: 22.0+ KB


In [10]:
titanic_data[selected_columns].head()

Unnamed: 0,sex,class,embark_town,survived
0,male,Third,Southampton,0
1,female,First,Cherbourg,1
2,female,Third,Southampton,1
3,female,First,Southampton,1
4,male,Third,Southampton,0


In [11]:
titanic_data[selected_columns].shape

(891, 4)

In [6]:
df = titanic_data[selected_columns].copy()

# Transform categorical features using get_dummies()
df_encoded = pd.get_dummies(df, columns=['sex', 'class', 'embark_town'])

In [7]:
df_encoded.shape

(891, 9)

In [8]:
df_encoded.head()

Unnamed: 0,survived,sex_female,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,0,1,0,0,1,0,0,1
1,1,1,0,1,0,0,1,0,0
2,1,1,0,0,0,1,0,0,1
3,1,1,0,1,0,0,0,0,1
4,0,0,1,0,0,1,0,0,1
