# <font color=red>Tutorial 10 - Cross Validation and Pre-Processing </font>

## K-fold Cross-validation

The train-test-split method we used in the last tutorial for evaluating our model is dependent on how the data is split into train and test sets. <br>
The cross-validation method is used to overcome this problem.<br>
The K-fold cross-validation method works as follows:
* Randomly splitting the dataset into K groups (folds).
* One of the groups is used as the test set and the rest are used as the training set.
* The model is trained on the training set and scored on the test set.
* The process is repeated until each group was used as the test set.

<img src="cross_validation.jpg" width=750 height=750>

Lets try the k-fold cross validation method using 5 folds (k=5) on our model from the last tutorial

### Loading iris data

In [2]:
import pandas as pd
from sklearn import datasets

iris_data = datasets.load_iris()
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df['class'] = iris_data.target

X = iris_df.iloc[:, :-1].values # features
Y = iris_df.iloc[:, 4].values # labels

iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Using cross-validation for evaluation

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# Create a new KNN model
knn_cv = KNeighborsClassifier(n_neighbors=5)

# Train model using cross-validation method using 5 groups (cv=5) 
cv_scores = cross_val_score(knn_cv, X, Y, cv=5) # (model, features, labels, k) - for more details refer to the documentation
# cross_val_score?

In [4]:
import numpy as np
print('Mean : ' + str(np.mean(cv_scores)) + ', STD: ' + str(np.std(cv_scores)))

Mean : 0.9733333333333334, STD: 0.02494438257849294


## **Case Study - Titanic**</font>

In the following classification task, we will use Titanic data from Seaborn. We will try to build a KNN model that predicts if a passenger survived based on the provided features

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [7]:
df = df[df.columns[:9]]
del df['pclass']
df.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,embarked,class
0,0,male,22.0,1,0,7.25,S,Third
1,1,female,38.0,1,0,71.2833,C,First
2,1,female,26.0,0,0,7.925,S,Third
3,1,female,35.0,1,0,53.1,S,First
4,0,male,35.0,0,0,8.05,S,Third


* **sibsp** - Number of siblings / spouses aboard the Titanic
* **parch** - Number of parents / children aboard the Titanic
* **fare** - Passenger fare
* **embarked** - Port of embarkation: C = Cherbourg, Q = Queenstown,S = Southampton
* **class** - A proxy for socio-economic status (SES)
    * First = Upper
    * Second = Middle
    * Third = Lower

In [8]:
df.shape

(891, 8)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   survived  891 non-null    int64   
 1   sex       891 non-null    object  
 2   age       714 non-null    float64 
 3   sibsp     891 non-null    int64   
 4   parch     891 non-null    int64   
 5   fare      891 non-null    float64 
 6   embarked  889 non-null    object  
 7   class     891 non-null    category
dtypes: category(1), float64(2), int64(3), object(2)
memory usage: 49.8+ KB


There are some missing values, it is critical to remove them since our model will not be able to deal with missing data

In [10]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   survived  712 non-null    int64   
 1   sex       712 non-null    object  
 2   age       712 non-null    float64 
 3   sibsp     712 non-null    int64   
 4   parch     712 non-null    int64   
 5   fare      712 non-null    float64 
 6   embarked  712 non-null    object  
 7   class     712 non-null    category
dtypes: category(1), float64(2), int64(3), object(2)
memory usage: 45.3+ KB


In [11]:
df.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,embarked,class
0,0,male,22.0,1,0,7.25,S,Third
1,1,female,38.0,1,0,71.2833,C,First
2,1,female,26.0,0,0,7.925,S,Third
3,1,female,35.0,1,0,53.1,S,First
4,0,male,35.0,0,0,8.05,S,Third


### Pre-Processing

For our KNN model to perform better it is critical to pre-process the data. Pre-processing includes the following two important stages:
1. Set non-numerical (categorical) data ready for KNN model - our KNN model does know how to handle strings, it does not know the distance between 'male' and 'female'. Therefore, we will have to convert categorical features to a numeric scale.
2. Since KNN is based on a distance metric, it is important to normalize the features to a relatively similar scale.

#### 1. Convert categorical features to a numeric scale

There are two main types of categorical features:
1. **Nominal Categories** - Nominal categories are unordered e.g. colours, sex, nationality.
2. **Ordinal Categories** - Ordinal categories are ordered, e.g. school grades, price ranges, salary bands.

In [12]:
df.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,embarked,class
0,0,male,22.0,1,0,7.25,S,Third
1,1,female,38.0,1,0,71.2833,C,First
2,1,female,26.0,0,0,7.925,S,Third
3,1,female,35.0,1,0,53.1,S,First
4,0,male,35.0,0,0,8.05,S,Third


In [13]:
df.describe(include='all')

Unnamed: 0,survived,sex,age,sibsp,parch,fare,embarked,class
count,712.0,712,712.0,712.0,712.0,712.0,712,712
unique,,2,,,,,3,3
top,,male,,,,,S,Third
freq,,453,,,,,554,355
mean,0.404494,,29.642093,0.514045,0.432584,34.567251,,
std,0.491139,,14.492933,0.930692,0.854181,52.938648,,
min,0.0,,0.42,0.0,0.0,0.0,,
25%,0.0,,20.0,0.0,0.0,8.05,,
50%,0.0,,28.0,0.0,0.0,15.64585,,
75%,1.0,,38.0,1.0,1.0,33.0,,


We have three categorical features: 'sex', 'embarked' and 'class'. 'sex' and 'embarked' are nominal categorical features (unordered), while the 'class' feature is ordinal (First > Second > Third). Lets convert the categorical features to a numerical scale for our KNN model to be able to handle and process them.

In [14]:
# convert nominal categorical features
df['sex'] = df['sex'].astype('category').cat.codes
df['embarked'] = df['embarked'].astype('category').cat.codes

# convert ordinal categorical features
from pandas.api.types import CategoricalDtype

ordinal_cat_type = CategoricalDtype(categories=['Third', 'Second', 'First'], ordered=True)
df['class'] = df['class'].astype(ordinal_cat_type).cat.codes

df.head()

Unnamed: 0,survived,sex,age,sibsp,parch,fare,embarked,class
0,0,1,22.0,1,0,7.25,2,0
1,1,0,38.0,1,0,71.2833,0,2
2,1,0,26.0,0,0,7.925,2,0
3,1,0,35.0,1,0,53.1,2,2
4,0,1,35.0,0,0,8.05,2,0


#### 2. Normalize

Min-max scaling is a common feature pre-processing technique which results in scaled data values that fall in the range [0,1]. Min-max scaling follows the following formula for each feature $X$:

$\dfrac{x_i - min(X)}{max(X) - min(X)}$

In [21]:
from sklearn import preprocessing

df_columns = df.columns
scaler = preprocessing.MinMaxScaler()
scaled_np_matrix = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_np_matrix, columns=df_columns)
scaled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  712 non-null    float64
 1   sex       712 non-null    float64
 2   age       712 non-null    float64
 3   sibsp     712 non-null    float64
 4   parch     712 non-null    float64
 5   fare      712 non-null    float64
 6   embarked  712 non-null    float64
 7   class     712 non-null    float64
dtypes: float64(8)
memory usage: 44.6 KB


## <font color=blue>**Exercise**</font>

Now that we have a processed data we are ready to run our model:

Run a KNN model to predict if passeger survived. Use the cross validation method to evaluate your model. You can use K=5 for the number of neighbors for the KNN model and k=5 for the number of folds in the cross-validation method

In [26]:
X = scaled_df.iloc[:, :-1].values # features
Y = scaled_df.iloc[:, 0].values # labels

# Create a new KNN model
knn_cv = KNeighborsClassifier(n_neighbors=5)

# Train model using cross-validation method using 5 groups (cv=5) 
cv_scores = cross_val_score(knn_cv, X, Y, cv=5) # (model, features, labels, k) - for more details refer to the documentation
# cross_val_score?