<a href="https://colab.research.google.com/github/deborahmasibo/Moringa-Core-Module-2-Week-4-IP/blob/main/KNN_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Defining the Question

### a) Specifying the Question

### b) Defining the Metric for Success

### c) Understanding the context 

### d) Recording the Experimental Design

The following list depicts the steps to be undertaken during the project.

1. Data sourcing/loading.
2. Data Understanding
3. Data Relevance
4. External Dataset Validation
5. Data Preperation
6. Univariate Analysis
7. Bivariate Analysis
8. Multivariate Analysis
9. Modeling:

  a) KNN Classification

  b) Naive Bayes Classification

10. Implementing the solution
11. Challenging the solution
12. Conclusion
13. Follow up questions.


### e) Data Relevance

1. The data should have variables that adequately contribute to predicting the match results.
2. The dataset should lead to a high model fit (high accuracy, after all possible model optimization procedures have been applied.

## 2. Data Understanding

In [94]:
# Imports
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import  r2_score, f1_score, precision_score, recall_score, classification_report, accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import export_graphviz
from six import StringIO  
from IPython.display import Image  
import pydotplus
import os
# Using seaborn style defaults and setting the default figure size
sns.set(rc={'figure.figsize':(30, 5)})
from warnings import filterwarnings
filterwarnings('ignore')
%matplotlib inline

In [95]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Accessing working directory
os.chdir('/content/drive/My Drive/Core/Machine Learning/Moringa Core Module 2 Week 4 IP')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### a) Reading the Data

In [96]:
# Train dataset loading
train = pd.read_csv('train.csv')
# Test dataset loading
test = pd.read_csv('test.csv')

### b) Checking the Data

**Number of Records**

In [97]:
# Number of rows and columns
# Train dataset
print(f'Train dataset: records= {train.shape[0]} and columns = {train.shape[1]}')
# Test dataset
print(f'Test dataset: records= {test.shape[0]} and columns = {test.shape[1]}')

Train dataset: records= 891 and columns = 12
Test dataset: records= 418 and columns = 11


The train dataset has a column that is not present in the test dataset.

**Top Dataset Preview**

In [98]:
# Train dataset
# First 5 records
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [99]:
# Test dataset
# First 5 records
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


The test set does not have the label column, 'Survived'. Therefore, it will be used to make predictions.

**Bottom Dataset Preview**

In [100]:
# Train dataset
# Last 5 records
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [101]:
# Test dataset
# Last 5 records
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


The Cabin column has maissing values.

### c) Checking Datatypes

In [102]:
# Dataset infromation
# Train dataset
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Datatypes are as required. However, the Age and Cabin columns have numerous missing values.

In [103]:
# Dataset infromation
# Test dataset
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


The datatypes are as required. Similar to the train dataset, the Cabin and age columns have numerous missing values.

## 3. External Dataset Validation 

## 4. Data Preperation

### a) Validation

The PassengerId and the Name columns will be removed as it is not relevant to the study.

In [104]:
# Checking relevance of the ticket column
print(f'Percentage of unique values (train): {(len(train.Ticket.unique()) / train.shape[0]) * 100}%')
print(f'Percentage of unique values (test): {(len(test.Ticket.unique()) / test.shape[0]) * 100}%')

Percentage of unique values (train): 76.43097643097643%
Percentage of unique values (test): 86.8421052631579%


As the ticket numbers have a high number of unique values, it will be dropped from both datasets, as the numbers seem to be the ticket reference number.

In [105]:
# Removing columns that do not add meaning to the project
train.drop(['PassengerId', 'Ticket'], axis = 1, inplace = True)
test.drop(['PassengerId', 'Ticket'], axis = 1, inplace = True)

In [106]:
# Ensuring changes have been made
print(f'Train set columns: {len(train.columns.values)}')
print(f'Test set columns: {len(test.columns.values)}')

Train set columns: 10
Test set columns: 9


### b) Completeness

**Percentage of missing values**

In [107]:
# Function to find the percentage of missing values
def PercentageMissing(data):
  # Precentage of missing values
  for col in data.columns.tolist():
    missing = data[col].isnull().sum()
    if missing > 0:
      print(f'{col} = {(missing/data.shape[0])*100}%')

In [108]:
# Checking for misssing values
# Train dataset
PercentageMissing(train)

Age = 19.865319865319865%
Cabin = 77.10437710437711%
Embarked = 0.22446689113355783%


In [109]:
# Checking for misssing values
# Test dataset
PercentageMissing(test)

Age = 20.574162679425836%
Fare = 0.23923444976076555%
Cabin = 78.22966507177034%


Missing values in the Cabin column form more that 75% of the dataset in both sets, therefore, it will be dropped.

In [110]:
# Dropping the Cabin column
# Train dataset
train.drop('Cabin', axis = 1, inplace = True)
# Test dataset
test.drop('Cabin', axis = 1, inplace = True)

**Imputing missing values**

The age column will be imputed based on the mean age per passenger class.

*Train dataset*

In [111]:
# Unique passenger class values
train.Pclass.unique()

array([3, 1, 2])

In [112]:
def Fillna(class_values, ref_col, target_col, fill_value, data):
  for val in class_values:
    df = data[(data[ref_col] == val)]
    if fill_value == 'mean':
      data.target_col.fillna(df.target_col.mean(), inplace = True)
    elif fill_value == 'freq':
      data.target_col.fillna(df.target_col.value_counts().index[0], inplace = True)


In [None]:
# First class


In [92]:
# Imputing mising values in the age column using the mean age per passenger class.
# First class
first = train[(train['Pclass'] == 1)]
train.Age.fillna(first.Age.mean())
# Second class
first = train[(train['Pclass'] == 1)]
train.Age.fillna(first.Age.mean())

In [93]:
train[(train['Pclass'] == 1) & (train.Age.isnull() == True)]['Age']

31    NaN
55    NaN
64    NaN
166   NaN
168   NaN
185   NaN
256   NaN
270   NaN
284   NaN
295   NaN
298   NaN
306   NaN
334   NaN
351   NaN
375   NaN
457   NaN
475   NaN
507   NaN
527   NaN
557   NaN
602   NaN
633   NaN
669   NaN
711   NaN
740   NaN
766   NaN
793   NaN
815   NaN
839   NaN
849   NaN
Name: Age, dtype: float64

In [50]:

# Age column
train.Age.fillna(train.Age.mean(), inplace = True)
# Imputing the categorical missing values using the highest column frequncy.
train.Embarked.fillna(train.Embarked.value_counts().index[0], inplace = True)

In [52]:
# Test dataset
# Imputing numerical missing values using the column mean.
# Age column
test.Age.fillna(test.Age.mean(), inplace = True)

In [None]:
# Checking the passenger class of the missing values before filling the Fare column.
