In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Preprocessing Data and Building Pipelines for Machine Learning Projects


- [Facing Reality: Missing Values and Mixed Data Types](#Facing-Reality:-Missing-Values-and-Mixed-Data-Types)
- [Handling Missing Data (Part 1): Identification and Deletion](#Handling-Missing-Data-(Part-1):-Identification-and-Deletion)
- [Converting Texts into Numerical Values](#Converting-Texts-into-Numerical-Values)
- [Handling Categorical Features](#Handling-Categorical-Features)
- [Handling Missing Data (Part 2): Imputation](#Handling-Missing-Data-(Part-2):-Imputation)
- [Bringing Features onto the Same Scale](#Bringing-Features-onto-the-Same-Scale)
- [Streamlining Processes Using Pipelines](#Streamlining-Processes-Using-Pipelines)
- [Putting Everything Together](#Putting-Everything-Together)


## <font color="#0000E0">Facing Reality: Missing Values and Mixed Data Types</font>

<div class="alert alert-block alert-info"><font color="#000000">

In the following simple dataset,

- color and neck-style are <b>nominal</b> categorical features;
- size is an <b>ordinal</b> categorical feature;
- price, cotton, and sales are <b>numerical</b> or <b>continuous</b> variables;
- many values are missing, with various missing-value codes: NaN, price = -1, and style = 99;
- data format is not ideal: percentage of cotton should be numbers but in text format.

Note: If a cell in .csv file is empty, the imported value will be <b>NaN</b>, which stands for missing value.  Although NaN literally stands for 'not a number', it is not text either.
</font></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.set_printoptions(suppress=True, linewidth=200)

# For other input function, see https://pandas.pydata.org/pandas-docs/stable/reference/io.html
df0 = pd.read_csv('clothing_simple.csv')
df = df0.copy()  # Making a copy will facilitate comparison. We will not change df0 and only work on df.

print(df.dtypes)
df

<div class="alert alert-block alert-success"><font color="#000000">
<b><font color="#008000" size=3> Review: Type of target variable determines the type of machine learning task</font></b>

When the target variable is continuous, we have a numerical prediction problem that can be analyzed using <b>regressions</b>.  When the target variable is categorical with two classes, we have a <b>binary classification</b> problem.  When the target variable is categorical with three or more classes, we need to distinguish between <b>nominal vs. ordinal classification</b>.

Recall that if a <b>nominal</b> target variable is in text form (e.g., name of the iris classes), we can use <b>label encoder</b> to encode the target variable into numerical values.  But as we will see, nominal feature values need to be treated differently.
</font></div>

## <font color="#0000E0">Handling Missing Data (Part 1): Identification and Deletion</font>

Reference: Chapter 4 of <i>Python Machine Learning</i>

<div class="alert alert-block alert-info"><font color="#000000">
<b><font color="#0000E0" size=3>Machine learning models require no missing data</font></b>

We have learned that machine learning models analyze feature values in various ways: logistic regression, support vector machines, and linear regressions compute <b>weighted feature values</b>, k-nearest neighbors model calculates <b>distance between feature values</b>, and decision tree model finds <b>threshold feature values</b>.

Missing values cannot pass through model training.  For example, weighted feature value $w_0 + w_1 x_1 + w_2 \text{NaN}$ is meaningless unless we replace NaN by some value.
In fact, scikit-learn's machine learning models require that no missing values are present in the input data. See https://scikit-learn.org/stable/modules/impute.html

When some feature values are missing, could we discard entire rows and/or columns containing missing values?  Yes, but only sometimes, because this may come at the price of losing valuable data.
Another strategy  is  to <b>impute</b> the missing values, i.e., to infer them from the known part of the data, which we will discuss in Part 2.
</font></div>



<div class="alert alert-block alert-info"><font color="#000000">

<b><font color="#0000E0" size=3> Identify missing values </font></b>
    
Special codes are often used to indicate missing values, e.g., in a survey dataset, 95 may represent "prefer not to answer", 99 may represent "I don't know".  It is important <b>not to mistake these codes for actual values</b>.

A dataframe's <b>replace</b> function can replace all the missing value codes by <b>np.nan</b> (from numpy), which is understood as missing value by computers.  
For using the replace function, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
</font></div>

<p style="background-color: #FFFF88; font-weight:bold">
    <font color="#0000E0">Essential code: </font>
    <font style="font-family:Consolas">df = df.replace( { 'price': -1, 'neckstyle': 99 }, np.nan ) </font>
</p>

In [None]:
# {'price': -1, 'neckstyle': 99} is a dictionary. It informs the replace function to 
# look for -1 in column 'price' and 99 in column 'neckstyle' and then replace these values with the second argument
# Here, we overwrite df with replaced values
df = df.replace({'price': -1, 'neckstyle': 99}, np.nan)
df

In [None]:
# The following shortcut is okay only if -1 and 99 represent missing values for the entire dataset
# Caution: It will cause mistakes if 99 appears in 'sales' column.

print(np.sum( (df0==-1) | (df0==99) )) # checking if -1 and 99 appear somewhere else
df0.replace([-1, 99], np.nan)

<div class="alert alert-block alert-info"><font color="#000000">
It is useful to get a sense of how many values are missing.
For a pandas dataframe, you can use either <b>.isnull()</b> or <b>.isna()</b> to identify missing values.  Then, you can count missing values by columns or by rows.  If an instance (row) has too many missing values, we can consider dropping that instance, while if it has only a few missing values, we can consider imputing missing values.
</font></div>

In [None]:
df.isnull()

In [None]:
# count missing values by columns
df.isnull().sum()

In [None]:
# count missing values by rows
df.isnull().sum(axis=1)

In [None]:
# total number of missing values
df.isnull().sum().sum()

<div class="alert alert-block alert-info"><font color="#000000">

<b><font color="#0000E0" size=3> Delete instances (rows) or features (columns) that meet given criteria </font></b>

We can require each row or column to have a given minimum number of non-missing values, and drop those that don't meet this minimum requirement.
</font></div>

In [None]:
df

In [None]:
# Keep rows that have at least 'thresh' non-missing values 
# Change 'thresh' to test
df.dropna(thresh=1)

In [None]:
# Drop a row if all values are missing
# Equivalent to dropna(thresh=1)
df.dropna(how='all')

In [None]:
# Drop a row if any value is missing
# Equivalent to dropna(thresh=number of columns)
df.dropna(how='any')

In [None]:
# We can ask dropna() to check the requirement only on a given subset of columns
# For example, require no missing values in specific columns
df.dropna(subset=['color','size','price'], how='any')

In [None]:
# Keep rows that have at least 'thresh' non-missing values in specific columns
df.dropna(subset=['color','size','price'], thresh=2)

In [None]:
# Keep features (columns) that have at least 'thresh' non-missing values 
df.dropna(thresh=5, axis='columns')

In [None]:
# Drop a column if all values in that column are missing
df.dropna(how='all', axis='columns')

<p style="background-color: #FFFF88; font-weight:bold">
    <font color="#0000E0">Essential code: </font>
    <font style="font-family:Consolas">df = df.dropna( how='all' ).dropna( how='all', axis='columns' ) </font>
</p>

In [None]:
# Deleting a combination of rows and columns
# This time, we overwrite the original df
df = df.dropna(how='all').dropna(how='all', axis='columns')
df

## <font color="#0000E0">Converting Texts into Numerical Values</font>

<div class="alert alert-block alert-info"><font color="#000000">

Feature values don't always come in the format desirable for machine learning.  Numerical features may be stored as texts (due to symbols such as %, $, and thousands separator), while categorical features may be stored as numbers.  

To check whether data types is consistent with its meaning, use dataframe's <b>dtypes</b>.  
See data types at https://pbpython.com/pandas_dtypes.html
</font></div>

In [None]:
# In the output below, 'color', 'size', and 'price' have data types consistent with their meanings.
# ('object' type contains text or mixed text and numbers.)  
# However, 'neckstyle' should be categorical and 'cotton' should be numerical.
df.dtypes

In [None]:
# Check 'cotton' column.
# We will convert the texts in 'cotton' column into numerical values.
df['cotton']

In [None]:
# We can delete '%', but the data type is still 'object'
df['cotton'].str.replace('%', '')

<p style="background-color: #FFFF88; font-weight:bold">
    <font color="#0000E0">Essential code: </font>
    <font style="font-family:Consolas">df['cotton'] = pd.to_numeric( df['cotton'].str.replace('%', '') ) </font>
</p>

In [None]:
# We use pandas to_numeric function to convert text into numbers.
# We overwrite the 'cotton' column
df['cotton'] = pd.to_numeric(df['cotton'].str.replace('%',''))
df

In [None]:
# check the data types again
df['cotton'].dtypes

<div class="alert alert-block alert-info"><font color="#000000">
<b><font color="#0000E0" size=3> Other Resources </font></b>

Converting texts into numbers can be tricky, regardless what software you use.  Here are a few more references:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
<b>read_excel</b> is sometimes useful because Excel specifies whether each cell contains number or text, but if a column has some numbers in text format, the resulting dataframe from read_excel needs to be checked and cleaned. 

https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html pay particular attention to the section on <b>extracting substrings</b>.

In working with texts, you may encounter unfamiliar syntax related to <b>regular expressions</b>.  A good tutorial is at  
https://www.dataquest.io/blog/regular-expressions-data-scientists/

</font></div>

## <font color="#0000E0">Handling Categorical Features</font>

Reference: Chapter 4 of <i>Python Machine Learning</i>

<div class="alert alert-block alert-info"><font color="#000000">
<b><font color="#0000E0" size=3> Types of categorical features </font></b>

Ordinal features (e.g., XS, S, M, L, XL in clothing sizes) can be mapped into numerical values.  Nominal features, however, cannot be simply encoded, otherwise the learning model will mistake nominal features for ordinal features. 
</font></div>

In [None]:
# This is our dataset
df

<div class="alert alert-block alert-warning"><font color="#000000">
    
<b><font color="#800000" size=3> Incorrect mapping of ordinal features</font></b>

What mistake will we make if using label encoder on an ordinal feature ('size')?   Try the following.
</font></div>

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

size_num = le.fit_transform( df['size'].dropna() )
print("Original size data:", df['size'].dropna().to_numpy())
print("Encoded size data: ", size_num)
print("Label encoder orders labels alphabetically:", le.classes_)

<div class="alert alert-block alert-info"><font color="#000000">
    
<b><font color="#0000E0" size=3> Map ordinal features: Correct approach </font></b>

To make sure that our learning models interpret ordinal features correctly, we need to <b>define the mapping from the categorical texts to numerical values</b>, and then use pandas series' <b>map</b> function to apply the mapping.  It is similar to <b>replace</b> but <b>map</b> applies to series only.
</font></div>

<p style="background-color: #FFFF88; font-weight:bold">
    <font color="#0000E0">Essential code: </font>
    <font style="font-family:Consolas">df['size'] = df['size'].map({'S':1, 'M':2, 'L':3, 'XL':4 }) </font>
</p>

In [None]:
# Overwrite the dataframe with encoded 'size' column
df['size'] = df['size'].map({'S':1, 'M':2, 'L':3, 'XL':4 })
df

In [None]:
# check data types
df.dtypes

<div class="alert alert-block alert-warning"><font color="#000000">
    
<b><font color="#800000" size=3> Incorrect encoding of nominal features </font></b>

In our dataset, 'color' and 'neckstyle' are nominal features.  If we encode colors into numbers (say 1 for blue, 2 for green, 3 for red) and feed these numbers into a training process, we will make one of the most common mistakes in dealing with categorical data, because we are imposing 'blue' < 'green' < 'red', which should not be assumed.  Same for 'neckstyle', which is already encoded and will be interpreted by computers as 1 < 2 < 3, which could imply 'V-neck' < 'crew neck' < 'scoop neck'.  These unintended orders can degrade model performance.
</font></div>

<div class="alert alert-block alert-info"><font color="#000000">
    
<b><font color="#0000E0" size=3> Correct approach: "One-hot" encoding on nominal features </font></b>

One-hot encoding is a simple idea that is illustrated in the block below.  For each category, we create an indicator feature. Only one indicator feature is set to 1 in each row (hence the name 'one-hot'), and 0 elsewhere.

Note that one-hot encoding is exactly the same as using dummy variables in statistics.
</font></div>

In [None]:
df1 = df.copy()
df1 = df1.replace({'color': np.nan}, 'green')
df1 = df1.replace({'neckstyle': np.nan}, 3)
df1[['color', 'neckstyle']]

<p style="background-color: #FFFF88; font-weight:bold">
    <font color="#0000E0">Essential code: </font>
    <font style="font-family:Consolas"> OneHotEncoder( handle_unknown='ignore' ) </font>
</p>

In [None]:
# handle_unknown='ignore':  When an unknown category is encountered during transform, 
# the resulting one-hot code for this category will be all zeros.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform( df1[['color']].to_numpy() ).toarray()

In [None]:
# Encoding scheme
ohe.transform([['blue'], ['green'], ['red']]).toarray()

<div class="alert alert-block alert-info"><font color="#000000">
<b>Caveat</b>: One-hot encoding creates the 'multicollinearity' problem, which can lead to arbitrarily large weights in ordinary linear regressions. (For example, if a model involves weighted feature values: $- w_0 + w_1 \text{blue} + w_2 \text{green} + w_3 \text{red} + w_4 ...$, then increasing $w_0, w_1, w_2, w_3$ by the same amount will not change the weighted feature value.)  Some people suggest dropping one indicator feature.
    
But if our model is trained with regularization (ridge, lasso, elastic net), keeping all indicator features won't be an issue.  In fact, we should keep all indicator features if our model prediction depends on which column we drop.
See more discussions at https://inmachineswetrust.com/posts/drop-first-columns/
</font></div>

## <font color="#0000E0">Handling Missing Data (Part 2): Imputation</font>

<div class="alert alert-block alert-info"><font color="#000000">

All of the above data preprocessing steps (deleting rows and columns, converting texts into numerical values, and encoding categorical features) may be performed before splitting data into training and test sets.

Imputation is to replace missing data with substituted values, e.g., mean or median or mode (most frequent value) of the non-missing values. The mean, median, or mode should be calculated based on the training set only, or based on 9 folds during the 10-fold cross validation, otherwise imputed values may contain information from the test or validation set.

To illustrate imputation, we don't split data below, but treat our dataset as the training set.  We will use scikit-learn's <b>SimpleImputer</b>. Its usage is consistent with all other data transformers.   
See https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
</font></div>

<p style="background-color: #FFFF88; font-weight:bold">
    <font color="#0000E0">Essential code: </font>
    <font style="font-family:Consolas"> SimpleImputer( strategy='mean' or 'median' or 'most_frequent')</font>
</p>

In [None]:
X_train_num = df[['cotton','price']]
print(X_train_num,  '  Original data\n')

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X1 = imp.fit_transform(X_train_num)
print(X1, ' Mean imputation\n')

imp = SimpleImputer(strategy='median')
X1 = imp.fit_transform(X_train_num)
print(X1, ' Median imputation')

In [None]:
X_train_cat = df[['size', 'neckstyle', 'color']]
print(X_train_cat,  ' Original data\n')

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='most_frequent')
X2 = imp.fit_transform(X_train_cat)
print(X2, ' Mode imputation')

<div class="alert alert-block alert-info"><font color="#000000">    

<b>Caveats</b>: If a nominal feature has missing values, an alternative to imputation is to let 'missing value' be another category. After one-hot encoding, the column corresponding to the 'missing value' category can be dropped. 


Note that SimpleImputer does not allow different imputation strategies for different columns.  We could do this manually, but scikit-learn already has a solution: <b>Column Transformer</b>, which we will introduce in [Streamlining Processes Using Pipelines](#Streamlining-Processes-Using-Pipelines).
</font></div>

<div class="alert alert-block alert-info"><font color="#000000"> 
   
<b><font color="#0000E0" size=3> More resources </font></b>

Imputing missing values can be a learning process by itself.  For example, we can try to estimate a missing value as a function of other features; we can also apply the idea of k-nearest neighbors: identify k neighbors based on the non-missing values, and then impute the missing value based on the feature values of the neighbors.  More resources can be found at

https://scikit-learn.org/stable/modules/impute.html

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute

</font></div>

## <font color="#0000E0">Bringing Features onto the Same Scale</font>

<div class="alert alert-block alert-info"><font color="#000000">

Majority of machine learning models perform much better if features are on the same scale.
Two common approaches: 

- Standardization: Rescale a feature so that it has mean 0 and std dev 1: $x^{(i)}_{std} = (x^{(i)} - \mu_x)\ /\ {\sigma_x}$ (equivalent to the z-score)

- Normalization: Rescale a feature to the unit range [0,1]:
$x^{(i)}_{norm} = (x^{(i)} - x^{(i)}_{min})\ /\ ( x^{(i)}_{max} - x^{(i)}_{min}) $

</font></div>

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(X1)

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit_transform(X1)

## <font color="#0000E0">Streamlining Processes Using Pipelines</font>

<div class="alert alert-block alert-info"><font color="#000000">

The following two blocks of codes put together all of the above data preprocessing steps.

- The first block collects the first four lines of "essential codes" highlighted in yellow;

- The second block utilizes pipelines to streamline data transformations.
</font></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.set_printoptions(suppress=True, linewidth=200)

df = pd.read_csv('clothing_simple.csv')

df = df.replace({'price': -1, 'neckstyle': 99}, np.nan)         # Identify missing data
df = df.dropna(how='all').dropna(how='all', axis='columns')     # Delete some missing data
df['cotton'] = pd.to_numeric(df['cotton'].str.replace('%',''))  # Convert texts into numbers
df['size'] = df['size'].map({'S':1, 'M':2, 'L':3, 'XL':4 })     # Map ordinal categories to numbers

df

<div class="alert alert-block alert-info"><font color="#000000">

<b><font color="#0000E0" size=3> Data preprocessing in the machine learning framework</font></b>

Steps that are independent of how we split data can be performed before splitting data:
 
- Deleting rows and columns that have no valid data
- Converting text into numbers
- Mapping ordinal features into numbers
- Other nonlinear transformation such logarithmic transformation 

Steps that depend on how we split data should be performed after splitting data:

- Imputing missing data by mean, median, or mode
- One-hot encoding (in principle can be done early, but better do this after imputation)
- Standardization and normalization 

</font></div>

<img src="Overall_process.png" width=600>

<div class="alert alert-block alert-info"><font color="#000000">

<b><font color="#0000E0" size=3> Pipeline and Column Transformer</font></b>

We will build a <b>main pipeline</b> with three <b>parallel branches</b> processing three different types of features.
A <b>column transformer</b> will send the right columns to the right branches.  Explanations are provided along with the codes below.   
Refer to https://scikit-learn.org/stable/data_transforms.html for more details.


</font></div>

<img src="Pipeline_branches.png" width=500>

In [None]:
# Pipeline and column transformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer

# Data transformers
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Learning model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

nom_col = ['color','neckstyle'] # nominal features
ord_col = ['size']              # ordinal features
num_col = ['cotton', 'price']   # numerical features

# We assume df is the training data, otherwise split data here
X_train = df[nom_col + ord_col + num_col]
y_train = df['sales']

# Make three branches (you can make more branches to meet different preprocessing needs)
# Branch for nominal features
nom_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),
                         OneHotEncoder(handle_unknown='ignore')
                        )
# Branch for ordinal features
ord_pipe = make_pipeline(SimpleImputer(strategy='median'),
                         StandardScaler()
                        )
# Branch for numerical features
num_pipe = make_pipeline(SimpleImputer(strategy='mean'),
                         MinMaxScaler()
                        )
# Make the main pipe, in which a column transformer sends 'nom_col' into 'nom_pipe', etc.
pipe = make_pipeline(ColumnTransformer( [ ('nom', nom_pipe, nom_col),
                                          ('ord', ord_pipe, ord_col),
                                          ('num', num_pipe, num_col) ] ),
                     # PCA( n_components = 3 ),
                     # Can use any suitable learning model here
                     Lasso( alpha=0.001, max_iter=5000 )
                     # LinearRegression()
                    )

pipe.fit(X_train, y_train)

with(np.printoptions(precision=2, floatmode='fixed')):
    print('Predicted y:', pipe.predict(X_train))
    print('   Actual y:', y_train.values)
print('Training score:', pipe.score(X_train,y_train))

print('\n blue    green    red     sty1    sty2    sty3     size   cotton   price')

with(np.printoptions(precision=3)):
    print(pipe.named_steps.lasso.coef_, '(coef), %.4f (intercept)'%pipe.named_steps.lasso.intercept_)


<div class="alert alert-block alert-info"><font color="#000000">
To see the input data into the ML model, first <b>comment out</b> the machine learning model in the previous block, and then run the following codes.
</font></div>

In [None]:
print(' blue    green    red     sty1    sty2    sty3     size   cotton   price')
with(np.printoptions(precision=4)):
    print( pipe.transform(X_train) )
X_train  # Compare with input data into the main pipe

## <font color="#0000E0">Putting Everything Together</font>

<div class="alert alert-block alert-info"><font color="#000000">

We look at a machine learning project using a <b>Titanic dataset</b> from https://www.kaggle.com/c/titanic
We will use the above pipeline framework to build our machine learning model, which will be trained and cross validated.
</font></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.set_printoptions(suppress=True, linewidth=200)

df = pd.read_csv('titanic_train.csv')

<div class="alert alert-block alert-info"><font color="#000000"> 
   
<b><font color="#0000E0" size=3> Exploratory Data Analysis </font></b>

</font></div>

In [None]:
df.dtypes

In [None]:
df.isnull().sum()  # identify missing values

In [None]:
df.hist(bins=12, figsize=(14,6), layout=(2,-1))
plt.show()

In [None]:
df.describe()  # descriptive statistics of the numerical features

In [None]:
df.describe(include=['O'])    # descriptive statistics of the rest

In [None]:
df['Embarked'].value_counts()  # counts for categorical variables

In [None]:
df['Sex'].value_counts()

In [None]:
df['Gender'] = df['Sex'].map({'male':0, 'female':1 })

In [None]:
import seaborn as sns
sns.pairplot(df)
plt.show()

In [None]:
import seaborn as sns
columns = df.describe().columns
cm = df[columns].corr()   # Correlation matrix
sns.set(font_scale=1)
plt.figure(figsize=(6,6))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=columns, xticklabels=columns)
plt.show()

<div class="alert alert-block alert-info"><font color="#000000"> 
   
<b><font color="#0000E0" size=3> Building Pipeline Framework: Titanic Project </font></b>

</font></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.set_printoptions(suppress=True, linewidth=200)

# Pipeline and column transformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer

# Data transformers
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Data splitter and model evaluator
from sklearn.model_selection import train_test_split
from sklearn.model_selection import validation_curve

# Learning models (use one of them or any other model)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('titanic_train.csv')       # import data

nom_col = ['Sex', 'Embarked']               # nominal features
ord_col = ['Pclass']                        # ordinal features
num_col = ['Age', 'SibSp', 'Parch', 'Fare'] # numerical features

# We assume df is the training data, otherwise split data here
X_train = df[nom_col + ord_col + num_col]
y_train = df['Survived']

# Branch for nominal features
nom_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),
                         OneHotEncoder(handle_unknown='ignore') )
# Branch for ordinal features
ord_pipe = make_pipeline(SimpleImputer(strategy='median'),
                         StandardScaler())
# Branch for numerical features
num_pipe = make_pipeline(SimpleImputer(strategy='mean'),
                         StandardScaler())
# Main pipe 
pipe = make_pipeline(ColumnTransformer( [ ('nom', nom_pipe, nom_col),
                                          ('ord', ord_pipe, ord_col),
                                          ('num', num_pipe, num_col) ] ),
                     #PCA(n_components=3),
                     SVC(kernel='rbf', C=1000, gamma=1)
                     #LogisticRegression(solver='lbfgs', C=0.01)
                     #DecisionTreeClassifier(criterion='gini', max_depth=3)
                     #RandomForestClassifier(criterion='gini', n_estimators=20, random_state=1)
                     #KNeighborsClassifier(n_neighbors=5, p=2)
                    )

pipe.fit(X_train,y_train)

print('Training score:', pipe.score(X_train,y_train))

#pipe.named_steps.logisticregression.coef_

In [None]:
param_name  = 'svc__gamma'
param_range = np.logspace(-5, 1, 13)
#param_name  = 'logisticregression__C'
#param_range = np.logspace(-4, 2, 13)
#param_name  = 'decisiontreeclassifier__max_depth'
#param_range = np.arange(1,15)
#param_name  = 'randomforestclassifier__max_depth'
#param_range = np.arange(1,15)
#param_name  = 'kneighborsclassifier__n_neighbors'
#param_range = np.arange(1,26,2)


train_scores, val_scores = validation_curve(estimator=pipe, X=X_train, y=y_train, 
                                            cv=10,
                                            param_name=param_name, 
                                            param_range=param_range)

trn_mean = np.mean(train_scores, axis=1)
trn_std  = np.std (train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std  = np.std (val_scores, axis=1)

plt.figure(figsize=(12,6))
plt.plot(param_range, trn_mean, 'bo-',  markersize=5, label='training accuracy')
plt.fill_between(param_range, trn_mean+trn_std, trn_mean-trn_std, alpha=0.25, color='blue')

plt.plot(param_range, val_mean, 'gs--', markersize=5, label='validation accuracy')
plt.fill_between(param_range, val_mean+val_std, val_mean-val_std, alpha=0.15, color='green')

plt.grid()
plt.xscale('log')  # Use this only when param_range = np.logspace(...). Comment this out otherwise.
plt.legend(loc='lower center', fontsize=14)
plt.xlabel(param_name, fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
#plt.savefig('val_curve')
plt.show()