# Preprocessing and Pipelines

## Preprocessing

**Transoforming categorical features into binary features**

In order to include in the model categorical features, we need to make them work in our model of choice. Not every model can process categorical data (meaning non numeric). But sometimes the categorical features can store important data that'll improve the model ability to generalize way on the unseen data. So making the categorical feature accesible for the model to use, is worth the effort. 

There are many methods to transform categorical features. One of them is creating dummy variable with pandas get_dummies method. 

In [1]:
import pandas as pd
import seaborn as sns

# Load Titanic dataset from seaborn
titanic_data = sns.load_dataset('titanic')
# Select relevant columns for demonstration
selected_columns = ['sex', 'class', 'embark_town', 'survived']

In [2]:
titanic_data[selected_columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   sex          891 non-null    object  
 1   class        891 non-null    category
 2   embark_town  889 non-null    object  
 3   survived     891 non-null    int64   
dtypes: category(1), int64(1), object(2)
memory usage: 22.0+ KB


In [3]:
titanic_data[selected_columns].head()

Unnamed: 0,sex,class,embark_town,survived
0,male,Third,Southampton,0
1,female,First,Cherbourg,1
2,female,Third,Southampton,1
3,female,First,Southampton,1
4,male,Third,Southampton,0


In [4]:
titanic_data[selected_columns].shape

(891, 4)

In [5]:
df = titanic_data[selected_columns].copy()

# Transform categorical features using get_dummies()
df_encoded = pd.get_dummies(df, columns=['sex', 'class', 'embark_town'])

In [6]:
df_encoded.shape

(891, 9)

In [7]:
df_encoded.head()

Unnamed: 0,survived,sex_female,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,0,1,0,0,1,0,0,1
1,1,1,0,1,0,0,1,0,0
2,1,1,0,0,0,1,0,0,1
3,1,1,0,1,0,0,0,0,1
4,0,0,1,0,0,1,0,0,1


## Missing data

In order to make use of our data in the models, we need to take care of missing values in dataset. The usual approach is to remove observations (rows) which consists less than 5% of all data.

In [8]:
import pandas as pd
import numpy as np

In [9]:
df = pd.read_csv('music_clean.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre
0,36506,60.0,0.896,0.726,214547.0,0.177,2e-06,0.116,-14.824,0.0353,92.934,0.618,1
1,37591,63.0,0.00384,0.635,190448.0,0.908,0.0834,0.239,-4.795,0.0563,110.012,0.637,1
2,37658,59.0,7.5e-05,0.352,456320.0,0.956,0.0203,0.125,-3.634,0.149,122.897,0.228,1
3,36060,54.0,0.945,0.488,352280.0,0.326,0.0157,0.119,-12.02,0.0328,106.063,0.323,1
4,35710,55.0,0.245,0.667,273693.0,0.647,0.000297,0.0633,-7.787,0.0487,143.995,0.3,1


In [10]:
# Wybor losowych indeksow
num_rows_to_replace = 10
num_cols_to_replace = 2
random_rows = np.random.choice(df.index, num_rows_to_replace, replace=False)
random_cols = np.random.choice(df.columns, num_cols_to_replace, replace=False)

# Zastapienie wybranych komorek wartoscia null
df.loc[random_rows, random_cols] = np.nan


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1000 non-null   int64  
 1   popularity        1000 non-null   float64
 2   acousticness      1000 non-null   float64
 3   danceability      990 non-null    float64
 4   duration_ms       1000 non-null   float64
 5   energy            1000 non-null   float64
 6   instrumentalness  1000 non-null   float64
 7   liveness          1000 non-null   float64
 8   loudness          1000 non-null   float64
 9   speechiness       990 non-null    float64
 10  tempo             1000 non-null   float64
 11  valence           1000 non-null   float64
 12  genre             1000 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 101.7 KB


In [12]:
ls_to_clean = ['genre', 'popularity', 'loudness', 'liveness', 'tempo']
cln_df = df.dropna(subset=ls_to_clean)

In [13]:
cln_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1000 non-null   int64  
 1   popularity        1000 non-null   float64
 2   acousticness      1000 non-null   float64
 3   danceability      990 non-null    float64
 4   duration_ms       1000 non-null   float64
 5   energy            1000 non-null   float64
 6   instrumentalness  1000 non-null   float64
 7   liveness          1000 non-null   float64
 8   loudness          1000 non-null   float64
 9   speechiness       990 non-null    float64
 10  tempo             1000 non-null   float64
 11  valence           1000 non-null   float64
 12  genre             1000 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 109.4 KB


## Imputing values

Other method to deal with missing values is to replace them with other values. This approach usually requires some expertise knowledge or before hand detailed analysis to perform. 

We can impute values also with mean or median. And the categorical ones with the most frequent one.

In [14]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

#Splitting dataset vars to cat and numer
X_cat = df['genre'].values.reshape(-1,1)
X_num = df.drop(['genre', 'popularity'], axis=1).values
y = df['popularity'].values

In [15]:
X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=42)
X_train_num, X_test_num, y_train_, y_test = train_test_split(X_num, y, test_size=0.2, random_state=42)

In [16]:
# Calling simple imputer for cat vars
imp_cat = SimpleImputer(strategy='most_frequent')
X_train_cat = imp_cat.fit_transform(X_train_cat)
X_test_cat = imp_cat.transform(X_test_cat) # only transform, because we fitted the data to the training set


In [18]:
# Simple imputer for num vars
imp_num = SimpleImputer()
X_train_num = imp_num.fit_transform(X_train_num)
X_test_num = imp_num.transform(X_test_num)

# Combining imputed cat with num
X_train = np.append(X_train_num, X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_cat, axis=1)

## Imputing within pipeline

In order to speed up the process we can use pipelines. Pipeline is an object, which allows to transform and train data in a single worfklow.

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [22]:
# consttrucitng steps
steps = [('imputation', SimpleImputer()),
         ('log_reg', LogisticRegression())]

pip = Pipeline(steps)

In [23]:
X = df.drop('genre', axis=1).values
y = df['genre'].values

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,
                                                    random_state=42)
pip.fit(X_train, y_train)
pip.score(X_test, y_test)

0.8033333333333333