# Lab 9: Laboratory Notes - Week 9: More Model Building and Serialising/De-serialising Model Objects

We have covered:

* Data Collection
* Data Wrangling
* Data Preparation & Data Analysis
* Data Visualisation

What do we do with our machine learning models?  This week's laboratory has a few objectives:

* To provide more experience with classifiers
* To vary the training dataset sizes, the more training data the better
* To vary the number of features, the more the better until a certain amount
* To serialise/deserialise your model
* To serialise and share it with another analyst

## Data Preparation

We will use the titanic dataset that we used before.

<span style="color:red">"""import libraries"""  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
from sklearn.tree import DecisionTreeClassifier  
from sklearn.preprocessing import StandardScaler """import here but not used, you can try to use it to get better results"""  
from sklearn.model_selection import train_test_split  
%matplotlib inline</span>  

<span style="color:red">dataset = pd.read_csv('titanic.csv')</span>

You can proceed to do the data auditing to view the dataset

<span style="color:red">dataset.shape  
dataset.head()</span>

<span style="color:red">X = dataset.iloc[:,2:11].values  # The other unprocessed features  
y = dataset.iloc[:,1].values # We want the "Survived" as the label as we are predicting if the passenger survived the disaster</span>  

Let's inspect a row of the features X, and a row of the label y.

<span style="color:red">X[0]  
y[3]</span>

We proceed to split our dataset for training and testing.

<span style="color:red">"""We start with 20% for training dataset"""  
"""Do vary the random_state"""  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.8, random_state = 0)</span>

Then we build our model

<span style="color:red">classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)  
classifier.fit(X_train, y_train)</span>

#### Exercise 9.1:

Discuss in class and among yourselves why there is an error and the first error stated is

ValueError: could not convert string to float: 'Kelly, Mr. James'

What does this mean?  There are a few things that we need to take into consideration when we are doing the model building. Let's look at the features available and decide which we should include or which feature would not make a difference in the prediction. The features

* Passenger ID
* Name
* Ticket

would not have an impact on whether the person survived or not. On the other hand,

* SibSp (siblings and/or spouse)
* Parch (parents and/or children)

may have an influence on survivability. What about

* Fare
* Embarked

We aren't sure.  Let's remove the 3 that will not impact the prediction. We get the names of the columns for clarity.

<span style="color:red">dataset.columns</span>

Instead of subsetting using the column index, let's use the column names.

<span style="color:red">dataset = dataset.drop(columns=['PassengerId', 'Name', 'Ticket'])</span>

Let's attempt to feed this into the decision tree classifier again.

<span style="color:red">X = dataset.iloc[:,1:8].values  """The other unprocessed features"""  
y = dataset.iloc[:,0].values</span>

<span style="color:red">X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.8, random_state = 0)</span>

<span style="color:red">classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)  
classifier.fit(X_train, y_train)</span>

#### Exercise 9.2:

You still have errors. If you didn't figure out the actual error in Exercise 1, try again.

If you have figured out from the errors message earlier, it means that the Decision Tree algorithm (DecisionTreeClassifier) needs the input to be of datatype 'float'. It means that you cannot have categorical data as the input and the input has to be numeric and to be specific, real numbers, or floating point numbers in computing language.

We will need to represent these categorical data as unique numbers in the columns. Let's look at the column types of the DataFrame.

<span style="color:red">dataset.dtypes</span>

We have 'Sex', 'Cabin' and 'Embarked' as object, which means they are categorical data. How do we represent them? We have some Python functions for this, but let's look at the possible values in the columns.

<span style="color:red">pd.unique(dataset['Sex'])</span>

<span style="color:red">pd.unique(dataset['Cabin'])</span>

<span style="color:red">pd.unique(dataset['Embarked'])</span>

For the 'Sex', it seems straightforward enough, with just 2 options. We can simply select to represent 'male' with 0 and 'female' with 1. We will take this opportunity to also introduce the function map() which maps the values. (We have introduced you filter() somewhere earlier).

<span style="color:red">dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)  
dataset.head()</span>

There are many cabin types and there are 3 embarkation ports and an NaN.

#### Exercise 9.3:

Should we decide to do something with the NaN, or do we remove these?  (Please discuss and/or ask in class)

Let's find out how many of the rows have NaN in the 'Embarked' column

<span style="color:red">dataset['Embarked'].isna().sum()</span>

There seem to be only 2! Why don't we remove those rows.

<span style="color:red">dataset = dataset.dropna(subset=['Embarked'])  
dataset.shape</span>

<span style="color:red">dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q':2} ).astype(int)</span>

What about 'Cabin'?

<span style="color:red">dataset['Cabin'].isna().sum()</span>

That's almost 80% of the dataset.

#### Exercise 9.4:

Should we use 'Cabin'?

Look at the 1st class cabin identifier and the 2nd class cabin identifier. We can think of 'Cabin' as either the passenger has a cabin or otherwise. However, for this class, let's just drop the column 'Cabin' as it would have a very strong correlation with 1st and 2nd class anyway.  (We are just taking some shortcuts here but we should process and ensure that there is a strong correlation before we remove the column).

<span style="color:red">dataset = dataset.drop(columns=['Cabin'])  
dataset.shape</shape>

<span style="color:red">dataset.head(30)</shape>

On checking, we have not cleared all the NaN and it looks like that there are many.

<span style="color:red">dataset['Age'].isna().sum()</shape>

We can use standard deviation to work out a suitable deviation from the mean to impute the values. However, since 177 out of 889 is a reduction that still allows us to have 700+ entries, let's remove those.  (Again, we are taking shortcuts here, we should attempt to figure out if imputing values can build better models).

<span style="color:red">dataset = dataset.dropna(subset=['Age'])  
dataset.shape</shape>

Let's try again! Let's hope we are ready! (You can also try this without removing those Age rows that have NaN and look at the new error message).

<span style="color:red">X = dataset.iloc[:,1:7].values  """ We now only have 7 features"""   
y = dataset.iloc[:,0].values</shape>

## Training Data - More is better

Repeat the following for various training and testing dataset splits

<span style="color:red">X_train, X_test, y_train, y_test = train_test_split(</span>  
<p style="margin-left: 40px;"><span style="color:red">X, y, test_size = 0.8, random_state = 0</span></p>  
<span style="color:red">)  
classifier = DecisionTreeClassifier(</span>  
<p style="margin-left: 40px;"><span style="color:red">criterion = 'entropy', random_state = 0</span></p>  
<span style="color:red">)  
classifier.fit(X_train, y_train)</span>

Let's try to test our model that used 20% of 712 rows to train, about 142 rows.

<span style="color:red">""" Predicting the Test set results"""  
y_pred = classifier.predict(X_test)</span>

<span style="color:red">from sklearn.metrics import confusion_matrix """ You can include this when you imported all the libraries earlier"""  
cm = confusion_matrix(y_test, y_pred)  
cm</span>

Accuracy = (TP + TN) / Population = 428 / 570 = 75%

Go back and repeat the training and testing split to use 80% of the data for training.

<span style="color:red">X_train, X_test, y_train, y_test = train_test_split(</span>  
<p style="margin-left: 40px;"><span style="color:red">X, y, test_size = 0.2, random_state = 0</span></p>  
<span style="color:red">)  
classifier = DecisionTreeClassifier(</span>  
<p style="margin-left: 40px;"><span style="color:red">criterion = 'entropy', random_state = 0</span></p>  
<span style="color:red">)  
classifier.fit(X_train, y_train)</span>

<span style="color:red">y_pred = classifier.predict(X_test)</span>  
<p style="margin-left: 40px;"><span style="color:red">cm = confusion_matrix(y_test, y_pred)</span></p>  
<span style="color:red">cm</span>

A smaller total number because the testing dataset is now just 20% of the 712. The accuracy now is 112 / 143 = 78%

Although it does not seem like a large improvement, more training data will usually result in a better model. Do note that the model accuracy may plateau around 75% - 78%. You can try to build your best model.

Other than the general guideline that more data will result in a better model, in a similar way more features will result in better models. The caveat here is that the number of features is heavily dependent on whether they have influence on the prediction and also, as a rule of thumb commercially, we usually use about 12 - 18 features. Let's illustrate this with fewer features. We will assume that where they embarked from has little impact, and how much they paid also has little impact. So, we choose the last 3 features.

<span style="color:red">X = dataset.iloc[:,4:7].values  # We now only have 3 features  
y = dataset.iloc[:,0].values</span>

<span style="color:red">X_train, X_test, y_train, y_test = train_test_split(</span>
<p style="margin-left: 40px;"><span style="color:red">X, y, test_size = 0.2, random_state = 0</span></p>  
<span style="color:red">)</span>

<span style="color:red">classifier = DecisionTreeClassifier(</span>
<p style="margin-left: 40px;"><span style="color:red">criterion = 'entropy', random_state = 0</span></p>  
<span style="color:red">)  
classifier.fit(X_train, y_train)</span>

<span style="color:red">y_pred = classifier.predict(X_test)  
cm = confusion_matrix(y_test, y_pred)  
cm</span>

#### Exercise 9.5:

Build your best model!  Hint, you probably can get better results by categorizing the 'Age' into a range (e.g., 0 - 18, 19 - 25, 25 - 35, etc.).  This would mean that it is discretised and hence is categorical data.  This is also a form of normalisation.  The other column that you may want to do something with would be the 'Fare' feature.

## Serialisation and De-serialisation

Equipped with your best model, let's exchange your model with your classmates WITHOUT disclosing your training parameters!

<span style="color:red">import pickle """ It should already be included in your Python installation."""  
filename = 'final_model.sav'  
pickle.dump(classifier, open(filename, 'wb')) # The 'w' is for write, and the 'b' is for binary</span>  

You can also view it using the dumps() function instead of dump().

<span style="color:red">pickle.dumps(classifier)</span>  

You now have a file that is serialised called 'final_model.sav'. You can share this with your classmate and ask them to test it.

<span style="color:red">friends_classifier = pickle.load(open('final_model.sav','rb')) """ name may differ and make sure you don't load your own again"""  
y_pred = friends_classifier.predict(X_test)  
cm = confusion_matrix(y_test, y_pred)  
cm</span>

Note that you can use any other test dataset that you have created.

#### Exercise 9.6:

Did your friend's model have better accuracy? Was it the same for him/her as well?

In this laboratory, you would have experienced:

* Data preparation takes most of the effort.
* New map() function.
* Learning algorithms need the input to generally be in numeric format.
* More data will generally result in better models as the algorithm has more to learn from.
* More features will also generally result in better models, but there are limitations to how many features.
* Normalisation, in this case categorisation helps.
* Serialisation and de-serialisation of model objects to share without disclosing how you built the model, or simply when you want to use it later.

Hope you have had fun so far.  We will have one more laboratory session where we will move away from using the Jupyter Notebook and will serve our model as a web service.

## My code part

#### Understanding Clustering (k-means)

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler # import here but not used, you can try to use it to get better results
from sklearn.model_selection import train_test_split
%matplotlib inline

In [2]:
dataset = pd.read_csv('data/titanic.csv')

In [5]:
dataset.shape

(891, 12)

In [6]:
dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:

X = dataset.iloc[:,2:11].values  # The other unprocessed featuers
y = dataset.iloc[:,1].values # We want the "Survived" as the label as we are predicting if the passenger survived the disaster

In [9]:
X[0]

array([3, 'Braund, Mr. Owen Harris', 'male', 22.0, 1, 0, 'A/5 21171',
       7.25, nan], dtype=object)

In [10]:
y[3]

np.int64(1)

In [11]:
# We start with 20% for training dataset
# Do vary the random_state
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.8, random_state = 0
)

In [12]:
classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

ValueError: could not convert string to float: 'Kelly, Mr. James'

#### Exercise 9.1:

In [13]:
dataset.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [14]:
dataset = dataset.drop(columns=['PassengerId', 'Name', 'Ticket'])

In [15]:
X = dataset.iloc[:,1:8].values  # The other unprocessed features
y = dataset.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.8, random_state = 0
)

classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

ValueError: could not convert string to float: 'male'

#### Exercise 9.2:

In [16]:
dataset.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [17]:
pd.unique(dataset['Sex'])

pd.unique(dataset['Cabin'])

pd.unique(dataset['Embarked'])

array(['S', 'C', 'Q', nan], dtype=object)

In [18]:
dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

dataset.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,1,22.0,1,0,7.25,,S
1,1,1,0,38.0,1,0,71.2833,C85,C
2,1,3,0,26.0,0,0,7.925,,S
3,1,1,0,35.0,1,0,53.1,C123,S
4,0,3,1,35.0,0,0,8.05,,S


#### Exercise 9.3:

In [19]:
dataset['Embarked'].isna().sum()

np.int64(2)

In [20]:
dataset = dataset.dropna(subset=['Embarked'])
dataset.shape

(889, 9)

In [21]:
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q':2} ).astype(int)

In [22]:
dataset['Cabin'].isna().sum()

np.int64(687)

#### Exercise 9.4:

In [23]:
dataset = dataset.drop(columns=['Cabin'])
dataset.shape

(889, 8)

In [24]:
dataset.head(30)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,0
1,1,1,0,38.0,1,0,71.2833,1
2,1,3,0,26.0,0,0,7.925,0
3,1,1,0,35.0,1,0,53.1,0
4,0,3,1,35.0,0,0,8.05,0
5,0,3,1,,0,0,8.4583,2
6,0,1,1,54.0,0,0,51.8625,0
7,0,3,1,2.0,3,1,21.075,0
8,1,3,0,27.0,0,2,11.1333,0
9,1,2,0,14.0,1,0,30.0708,1


In [25]:
dataset['Age'].isna().sum()

np.int64(177)

In [26]:
dataset = dataset.dropna(subset=['Age'])
dataset.shape

(712, 8)

In [27]:
X = dataset.iloc[:,1:7].values  # We now only have 7 features
y = dataset.iloc[:,0].values

#### Training Data - More is better

In [28]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.8, random_state = 0
)
classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

In [29]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [30]:
from sklearn.metrics import confusion_matrix # You can include this when you imported all the libraries earlier
cm = confusion_matrix(y_test, y_pred)
cm

array([[280,  54],
       [ 88, 148]])

In [31]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 0
)
classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[61, 21],
       [10, 51]])

In [32]:
X = dataset.iloc[:,4:7].values  # We now only have 3 features
y = dataset.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 0
)

classifier = DecisionTreeClassifier(
    criterion = 'entropy', random_state = 0
)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[65, 17],
       [32, 29]])

#### Exercise 9.5:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

# Load dataset
dataset = pd.read_csv('data/titanic.csv')

# Drop non-relevant columns
dataset = dataset.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])

# Encode categorical variables
dataset['Sex'] = dataset['Sex'].map({'female': 0, 'male': 1}).astype(int)
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)
dataset['Embarked'] = dataset['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)

# Categorize Age
dataset['Age'] = dataset['Age'].fillna(dataset['Age'].median())
age_bins = [0, 18, 25, 35, 45, 55, 65, 100]
age_labels = [0, 1, 2, 3, 4, 5, 6]
dataset['Age'] = pd.cut(dataset['Age'], bins=age_bins, labels=age_labels).astype(int)

# Categorize Fare using quantiles
dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].median())
dataset['Fare'] = pd.qcut(dataset['Fare'], 4, labels=[0, 1, 2, 3]).astype(int)

# Prepare feature matrix and target variable
X = dataset.drop(columns=['Survived'])
y = dataset['Survived']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train classifier with hyperparameter tuning
classifier = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42)
classifier.fit(X_train, y_train)

# Evaluate model
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", cm)

# Perform cross-validation
cv_scores = cross_val_score(classifier, X, y, cv=5)
print("Cross-validation accuracy:", np.mean(cv_scores))


Accuracy: 0.7932960893854749
Confusion Matrix:
 [[90 15]
 [22 52]]
Cross-validation accuracy: 0.7979850605737242


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)


#### Serialisation and De-serialisation

In [3]:
import pickle # It should already be included in your Python installation.
filename = 'final_model.sav'
pickle.dump(classifier, open(filename, 'wb')) # The 'w' is for write, and the 'b' is for binary

In [4]:
pickle.dumps(classifier)

b'\x80\x04\x95\xcb\x14\x00\x00\x00\x00\x00\x00\x8c\x15sklearn.tree._classes\x94\x8c\x16DecisionTreeClassifier\x94\x93\x94)\x81\x94}\x94(\x8c\tcriterion\x94\x8c\x07entropy\x94\x8c\x08splitter\x94\x8c\x04best\x94\x8c\tmax_depth\x94K\x05\x8c\x11min_samples_split\x94K\x02\x8c\x10min_samples_leaf\x94K\x01\x8c\x18min_weight_fraction_leaf\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x0cmax_features\x94N\x8c\x0emax_leaf_nodes\x94N\x8c\x0crandom_state\x94K*\x8c\x15min_impurity_decrease\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\x0cclass_weight\x94N\x8c\tccp_alpha\x94G\x00\x00\x00\x00\x00\x00\x00\x00\x8c\rmonotonic_cst\x94N\x8c\x11feature_names_in_\x94\x8c\x16numpy._core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x07\x85\x94h\x18\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94(\x8c\x06Pclass\x94\x8c\x03Sex\x94\x8c\x03Age\x94\x8c

In [5]:
friends_classifier = pickle.load(open('final_model.sav','rb')) # name may differ and make sure you don't load your own again
y_pred = friends_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[90, 15],
       [22, 52]])

#### Exercise 9.6: