# Execute the code below

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors  import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score, r2_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
titanic = sns.load_dataset('titanic')

In [None]:
titanic.shape

(891, 15)

In [None]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
print(len(titanic))

891


**We are going to learn machine learning concepts with the titanic dataset, one of the most infamous shipwrecks in history.** 

Also as a fun fact, this dataset is one of the most famous datasets around in terms of machine learning next to the MNIST and Iris dataset!

# Data preprocessing


Machine learning is nothing without fine data preprocessing.  
Excute the code below that modifies the titanic dataset by:

* Selecting dedicated and useful features (i.e columns)
* Removing rows with NaN data
* Use `factorize` to recode features `sex` (gender) and `embark_town` (the harbour city) into numerical data, because ML needs (and loves) numerical data.

In [None]:
titanic = titanic[['survived', 'pclass', 'sex', 'age', 'embark_town']]
titanic.head()

Unnamed: 0,survived,pclass,sex,age,embark_town
0,0,3,male,22.0,Southampton
1,1,1,female,38.0,Cherbourg
2,1,3,female,26.0,Southampton
3,1,1,female,35.0,Southampton
4,0,3,male,35.0,Southampton


In [None]:
titanic['embark_town'].unique()

array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)

In [None]:
def filling_rate (dataframe):
    percentage = (1-dataframe[titanic.columns.values].isna().sum()/len(dataframe))*100
    return percentage
filling_rate(titanic)

survived       100.000000
pclass         100.000000
sex            100.000000
age             80.134680
embark_town     99.775533
dtype: float64

In [None]:
# It's up to you:
titanic.dropna(inplace=True)

In [None]:
def filling_rate (dataframe):
    percentage = (1-dataframe[titanic.columns.values].isna().sum()/len(dataframe))*100
    return percentage
filling_rate(titanic)

survived       100.0
pclass         100.0
sex            100.0
age            100.0
embark_town    100.0
dtype: float64

In [None]:
titanic['embark_town'].unique()

array(['Southampton', 'Cherbourg', 'Queenstown'], dtype=object)

In [None]:
titanic['sex'] = titanic['sex'].factorize()[0]
titanic['sex']

NameError: ignored

In [None]:
# factorize
titanic['embark_town'] = titanic['embark_town'].factorize()[0]
titanic['embark_town']

0      0
1      1
2      0
3      0
4      0
      ..
885    2
886    0
887    0
889    1
890    2
Name: embark_town, Length: 712, dtype: int64

# KNN classification with Scikit-Learn

## Train Test Split Data


First you have to divide the titanic dataframe into 2 separated dataframes :
  - `y` with the feature to be predicted (i.e. survived)
  - `X` with the other features that will be used for the model (all numeric features + sex recoded with factorize + embark_town recoded with factorize)

And then from `X` and `y`, you need to separate them for training and testing your model :
* Use 75% of data for training, the rest for testing
* Please split data with `random_state = 36`

[See the previous quest on train-test split if needed](https://odyssey.wildcodeschool.com/quests/581)

In [None]:
# Your code here
from sklearn.model_selection import train_test_split
y =titanic['survived']
X =titanic[['pclass', 'sex', 'age', 'embark_town']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 36, train_size = 0.75)
print("The length of the initial dataset is :", len(X))
print("The length of the train dataset is   :", len(X_train))
print("The length of the test dataset is    :", len(X_test))

The length of the initial dataset is : 712
The length of the train dataset is   : 534
The length of the test dataset is    : 178


## Model initialization

CONGRATS !!! You are going to develop your first ML model for KNN classification.  
For that, please create a `model` object that initialises your model with the KNN classifier


[More info here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [None]:
# Your code here
model =KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

## Model fitting


Now you have to fit your model on the training data.

[More info here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

## Make predictions

You model is ready for prediction !

In [None]:
# Your code here
predictions = model.predict(X_test)
print(predictions)

[0 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0
 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 1
 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0
 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1
 0 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0]


In [None]:
model.classes_

array([0, 1])

Make prediction for yourself !  
Fill the data below and evaluate your chance of survival ...

In [None]:
# Your code here
my_class =2
my_sex =1
my_age =39
my_town=2
my_data = np.array([my_class, my_sex, my_age, my_town]).reshape(1,4)
model.predict(my_data)

array([1])

## Model evaluation

Last but not least, you should evaluate the accuracy of your model.  
The metric `accuracy_score` is directly imported form `sklearn.metrics `.  
Please remember than other metrics are available to evaluate classification models such as precision, recall, f1 score and all together compose the `confusion matrix`.

In [None]:
# Your code here
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:

print("accuracy score on train set:",model.score(X_train, y_train))
print("accuracy score on test set:",model.score(X_test, y_test))


accuracy score on train set: 0.8651685393258427
accuracy score on test set: 0.7303370786516854


In [None]:
accuracy = accuracy_score (y_test , predictions)
matrix = confusion_matrix(y_true = y_test, y_pred = predictions)

print("accuracy = %.3f" % accuracy)
print("Confusion matrix:\n", matrix)
print("f1 score:", f1_score(y_test, predictions))

accuracy = 0.730
Confusion matrix:
 [[90 23]
 [25 40]]
f1 score: 0.625


Performances of our model are pretty poor and could be explained by the original dataset.  
Antoher way is to tune the hyperparameter such as the number of neighbors ...

## Hyperparameter


Let's play with the `n_neighbors` and `weight` hyperparameters of the model.  
* Evaluate the score of your models by adjusting the hyperparameter from 2 to 10.
* What is the values of `n_neighbors` and `weight` that leads to the best score ?

In [None]:
# Your code here
# for weight distance and n_neighbors odd numbers from 3-11
accuracy = 0
n=0
for i in range(3,11):
  model = KNeighborsClassifier(n_neighbors = i, weights = 'distance')
  model.fit(X_train, y_train)
  predictions = model.predict(X_test)
  score= accuracy_score (y_test,predictions)
  if score> accuracy:
    accuracy = score
    n = i
print(accuracy, n)

0.7752808988764045 6


In [None]:
#for weight uniform and n_neighbors odd numbers from 3-11
accuracy = 0
n=0
for i in range(3,11):
  model = KNeighborsClassifier(n_neighbors = i, weights = 'uniform')
  model.fit(X_train, y_train)
  predictions = model.predict(X_test)
  score= accuracy_score (y_test,predictions)
  if score> accuracy:
    accuracy = score
    n = i
print(accuracy, n)

0.7696629213483146 8


# Let's go back to data processing to improve our model

Please do the same data processing as previously, but change `embark_town` (the harbour city) with get_dummies (and not factorize).
Then initialize, fit and score your model. Is it better?

In [None]:
# It's up to you:
titanic1 =sns.load_dataset('titanic')

       

In [None]:
titanic1.dropna(inplace=True)

In [None]:
titanic1['sex'] = titanic1['sex'].factorize()[0]
titanic1['sex']

1      0
3      0
6      1
10     0
11     0
      ..
871    0
872    1
879    0
887    0
889    1
Name: sex, Length: 182, dtype: int64

In [None]:
titanicdummies=titanic1['embark_town'].str.get_dummies()
titanicdummies

Unnamed: 0,Cherbourg,Queenstown,Southampton
1,1,0,0
3,0,0,1
6,0,0,1
10,0,0,1
11,0,0,1
...,...,...,...
871,0,0,1
872,0,0,1
879,1,0,0
887,0,0,1


In [None]:
titanic1 = pd.concat([titanic1, titanicdummies], axis = 1)

titanic1.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Cherbourg,Queenstown,Southampton
1,1,1,0,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1,0,0
3,1,1,0,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0,0,1
6,0,1,1,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True,0,0,1
10,1,3,0,4.0,1,1,16.7,S,Third,child,False,G,Southampton,yes,False,0,0,1
11,1,1,0,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True,0,0,1


In [None]:
from sklearn.model_selection import train_test_split
y2 =titanic1['survived']
X2 =titanic1[['pclass', 'sex', 'age', 'Cherbourg', 'Queenstown','Southampton']]
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state = 36, train_size = 0.75)
print("The length of the initial dataset is :", len(X2))
print("The length of the train dataset is   :", len(X2_train))
print("The length of the test dataset is    :", len(X2_test))

The length of the initial dataset is : 182
The length of the train dataset is   : 136
The length of the test dataset is    : 46


In [None]:
model2=KNeighborsClassifier(n_neighbors=5, weights='distance')
model2.fit(X2_train, y2_train)
print(model2)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='distance')


In [None]:
predictions2 = model2.predict(X2_test)
predictions2

array([0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1])

In [None]:
print("accuracy score on train set:",model2.score(X2_train, y2_train))
print("accuracy score on test set:",model2.score(X2_test, y2_test))

accuracy score on train set: 0.9411764705882353
accuracy score on test set: 0.8043478260869565


In [None]:
y_pred2=model2.predict(X2_test)

In [None]:
accuracy2= accuracy_score (y2_test , predictions2)
confusion_matrix2= confusion_matrix(y2_test, predictions2)

print("accuracy = %.3f" % accuracy2)
print("Confusion matrix:\n", confusion_matrix2)
print("f1 score:", f1_score(y_test, y_pred2))

accuracy = 0.804
Confusion matrix:
 [[10  4]
 [ 5 27]]
f1 score: 0.8571428571428571


# Conclusions
* Congrats !!! You just landed on the MACHINE LEARNING planet
* The KNN classifier is an algorithm from supervised learning part of ML
* Scikit learn is the to-know-and-to-love toolbox for ML
* Our KNN classifier could be improved with hyperparameter tuning
* Other algorithms should be tested for selecting the best one, but it is another story ... to be continued ML Data Wilders :) 