# Lab 9: Laboratory Notes - Week 9: More Model Building and Serialising/De-serialising Model Objects

We have covered:

* Data Collection
* Data Wrangling
* Data Preparation & Data Analysis
* Data Visualisation

What do we do with our machine learning models?  This week's laboratory has a few objectives:

* To provide more experience with classifiers
* To vary the training dataset sizes, the more training data the better
* To vary the number of features, the more the better until a certain amount
* To serialise/deserialise your model
* To serialise and share it with another analyst

## Data Preparation

We will use the titanic dataset that we used before.

<span style="color:red">"""import libraries"""  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
from sklearn.tree import DecisionTreeClassifier  
from sklearn.preprocessing import StandardScaler """import here but not used, you can try to use it to get better results"""  
from sklearn.model_selection import train_test_split  
%matplotlib inline</span>  

<span style="color:red">dataset = pd.read_csv('titanic.csv')</span>

You can proceed to do the data auditing to view the dataset

<span style="color:red">dataset.shape  
dataset.head()</span>

<span style="color:red">X = dataset.iloc[:,2:11].values  # The other unprocessed features  
y = dataset.iloc[:,1].values # We want the "Survived" as the label as we are predicting if the passenger survived the disaster</span>  

Let's inspect a row of the features X, and a row of the label y.

<span style="color:red">X[0]  
y[3]</span>

We proceed to split our dataset for training and testing.

<span style="color:red">"""We start with 20% for training dataset"""  
"""Do vary the random_state"""  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.8, random_state = 0)</span>

Then we build our model

<span style="color:red">classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)  
classifier.fit(X_train, y_train)</span>

#### Exercise 9.1:

Discuss in class and among yourselves why there is an error and the first error stated is

ValueError: could not convert string to float: 'Kelly, Mr. James'

What does this mean?  There are a few things that we need to take into consideration when we are doing the model building. Let's look at the features available and decide which we should include or which feature would not make a difference in the prediction. The features

* Passenger ID
* Name
* Ticket

would not have an impact on whether the person survived or not. On the other hand,

* SibSp (siblings and/or spouse)
* Parch (parents and/or children)

may have an influence on survivability. What about

* Fare
* Embarked

We aren't sure.  Let's remove the 3 that will not impact the prediction. We get the names of the columns for clarity.

<span style="color:red">dataset.columns</span>

Instead of subsetting using the column index, let's use the column names.

<span style="color:red">dataset = dataset.drop(columns=['PassengerId', 'Name', 'Ticket'])</span>

Let's attempt to feed this into the decision tree classifier again.

<span style="color:red">X = dataset.iloc[:,1:8].values  """The other unprocessed features"""  
y = dataset.iloc[:,0].values</span>

<span style="color:red">X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.8, random_state = 0)</span>

<span style="color:red">classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)  
classifier.fit(X_train, y_train)</span>

#### Exercise 9.2:

You still have errors. If you didn't figure out the actual error in Exercise 1, try again.

If you have figured out from the errors message earlier, it means that the Decision Tree algorithm (DecisionTreeClassifier) needs the input to be of datatype 'float'. It means that you cannot have categorical data as the input and the input has to be numeric and to be specific, real numbers, or floating point numbers in computing language.

We will need to represent these categorical data as unique numbers in the columns. Let's look at the column types of the DataFrame.

dataset.dtypes

We have 'Sex', 'Cabin' and 'Embarked' as object, which means they are categorical data. How do we represent them? We have some Python functions for this, but let's look at the possible values in the columns.

pd.unique(dataset['Sex'])

pd.unique(dataset['Cabin'])

pd.unique(dataset['Embarked'])

For the 'Sex', it seems straightforward enough, with just 2 options. We can simply select to represent 'male' with 0 and 'female' with 1. We will take this opportunity to also introduce the function map() which maps the values. (We have introduced you filter() somewhere earlier).

dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

dataset.head()

There are many cabin types and there are 3 embarkation ports and an NaN.

#### Exercise 9.3:

Should we decide to do something with the NaN, or do we remove these?  (Please discuss and/or ask in class)

Let's find out how many of the rows have NaN in the 'Embarked' column

dataset['Embarked'].isna().sum()

There seem to be only 2! Why don't we remove those rows.

dataset = dataset.dropna(subset=['Embarked'])
dataset.shape

dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q':2} ).astype(int)

What about 'Cabin'?

dataset['Cabin'].isna().sum()

That's almost 80% of the dataset.

#### Exercise 9.4:

Should we use 'Cabin'?

Look at the 1st class cabin identifier and the 2nd class cabin identifier. We can think of 'Cabin' as either the passenger has a cabin or otherwise. However, for this class, let's just drop the column 'Cabin' as it would have a very strong correlation with 1st and 2nd class anyway.  (We are just taking some shortcuts here but we should process and ensure that there is a strong correlation before we remove the column).

dataset = dataset.drop(columns=['Cabin'])
dataset.shape

dataset.head(30)

On checking, we have not cleared all the NaN and it looks like that there are many.

dataset['Age'].isna().sum()

We can use standard deviation to work out a suitable deviation from the mean to impute the values. However, since 177 out of 889 is a reduction that still allows us to have 700+ entries, let's remove those.  (Again, we are taking shortcuts here, we should attempt to figure out if imputing values can build better models).

dataset = dataset.dropna(subset=['Age'])
dataset.shape

Let's try again! Let's hope we are ready! (You can also try this without removing those Age rows that have NaN and look at the new error message).

X = dataset.iloc[:,1:7].values  # We now only have 7 features
y = dataset.iloc[:,0].values

## Training Data - More is better

Repeat the following for various training and testing dataset splits

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.8, random_state = 0
)
classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

Let's try to test our model that used 20% of 712 rows to train, about 142 rows.

# Predicting the Test set results
y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix # You can include this when you imported all the libraries earlier
cm = confusion_matrix(y_test, y_pred)
cm

Accuracy = (TP + TN) / Population = 428 / 570 = 75%

Go back and repeat the training and testing split to use 80% of the data for training.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 0
)
classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

A smaller total number because the testing dataset is now just 20% of the 712. The accuracy now is 112 / 143 = 78%

Although it does not seem like a large improvement, more training data will usually result in a better model. Do note that the model accuracy may plateau around 75% - 78%. You can try to build your best model.

Other than the general guideline that more data will result in a better model, in a similar way more features will result in better models. The caveat here is that the number of features is heavily dependent on whether they have influence on the prediction and also, as a rule of thumb commercially, we usually use about 12 - 18 features. Let's illustrate this with fewer features. We will assume that where they embarked from has little impact, and how much they paid also has little impact. So, we choose the last 3 features.

X = dataset.iloc[:,4:7].values  # We now only have 3 features
y = dataset.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 0
)

classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

#### Exercise 9.5:

Build your best model!  Hint, you probably can get better results by categorizing the 'Age' into a range (e.g., 0 - 18, 19 - 25, 25 - 35, etc.).  This would mean that it is discretised and hence is categorical data.  This is also a form of normalisation.  The other column that you may want to do something with would be the 'Fare' feature.

## Serialisation and De-serialisation

Equipped with your best model, let's exchange your model with your classmates WITHOUT disclosing your training parameters!

import pickle # It should already be included in your Python installation.
filename = 'final_model.sav'
pickle.dump(classifier, open(filename, 'wb')) # The 'w' is for write, and the 'b' is for binary

You can also view it using the dumps() function instead of dump().

pickle.dumps(classifier)

You now have a file that is serialised called 'final_model.sav'. You can share this with your classmate and ask them to test it.

friends_classifier = pickle.load(open('final_model.sav','rb')) # name may differ and make sure you don't load your own again
y_pred = friends_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

Note that you can use any other test dataset that you have created.

#### Exercise 9.6:

Did your friend's model have better accuracy? Was it the same for him/her as well?

In this laboratory, you would have experienced:

* Data preparation takes most of the effort.
* New map() function.
* Learning algorithms need the input to generally be in numeric format.
* More data will generally result in better models as the algorithm has more to learn from.
* More features will also generally result in better models, but there are limitations to how many features.
* Normalisation, in this case categorisation helps.
* Serialisation and de-serialisation of model objects to share without disclosing how you built the model, or simply when you want to use it later.

Hope you have had fun so far.  We will have one more laboratory session where we will move away from using the Jupyter Notebook and will serve our model as a web service.

## My code part

#### Understanding Clustering (k-means)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# This is a library we import to run the K-means clustering algorithm as a blackbox
# For more information please see: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
from sklearn.cluster import KMeans