<a href="https://colab.research.google.com/github/deborahmasibo/Moringa-Core-Module-2-Week-4-IP/blob/main/KNN_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Defining the Question

### a) Specifying the Question

### b) Defining the Metric for Success

### c) Understanding the context 

### d) Recording the Experimental Design

The following list depicts the steps to be undertaken during the project.

1. Data sourcing/loading.
2. Data Understanding
3. Data Relevance
4. External Dataset Validation
5. Data Preperation
6. Univariate Analysis
7. Bivariate Analysis
8. Multivariate Analysis
9. Modeling:

  a) KNN Classification

  b) Naive Bayes Classification

10. Implementing the solution
11. Challenging the solution
12. Conclusion
13. Follow up questions.


### e) Data Relevance

1. The data should have variables that adequately contribute to predicting the match results.
2. The dataset should lead to a high model fit (high accuracy, after all possible model optimization procedures have been applied.

## 2. Data Understanding

In [1]:
# Imports
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import  r2_score, f1_score, precision_score, recall_score, classification_report, accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import export_graphviz
from six import StringIO  
from IPython.display import Image  
import pydotplus
import os
# Using seaborn style defaults and setting the default figure size
sns.set(rc={'figure.figsize':(30, 5)})
from warnings import filterwarnings
filterwarnings('ignore')
%matplotlib inline

In [2]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Accessing working directory
os.chdir('/content/drive/My Drive/Core/Machine Learning/Moringa Core Module 2 Week 4 IP')

Mounted at /content/drive


### a) Reading the Data

In [3]:
# Train dataset loading
train = pd.read_csv('train.csv')
# Test dataset loading
test = pd.read_csv('test.csv')

### b) Checking the Data

**Number of Records**

In [4]:
# Number of rows and columns
# Train dataset
print(f'Train dataset: records= {train.shape[0]} and columns = {train.shape[1]}')
# Test dataset
print(f'Test dataset: records= {test.shape[0]} and columns = {test.shape[1]}')

Train dataset: records= 891 and columns = 12
Test dataset: records= 418 and columns = 11


The train dataset has a column that is not present in the test dataset.

**Top Dataset Preview**

In [5]:
# Train dataset
# First 5 records
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# Test dataset
# First 5 records
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


The test set does not have the label column, 'Survived'. Therefore, it will be used to make predictions.

**Bottom Dataset Preview**

In [42]:
# Train dataset
# Last 5 records
train.tail()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,fare,embarked
886,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0,S
887,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,25.0,1,2,23.45,S
889,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0,C
890,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,7.75,Q


In [43]:
# Test dataset
# Last 5 records
test.tail()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,fare,embarked
413,3,"Spector, Mr. Woolf",male,24.0,0,0,8.05,S
414,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,108.9,C
415,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,7.25,S
416,3,"Ware, Mr. Frederick",male,24.0,0,0,8.05,S
417,3,"Peter, Master. Michael J",male,24.0,1,1,22.3583,C


The Cabin column has maissing values.

### c) Checking Datatypes

In [9]:
# Dataset infromation
# Train dataset
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Datatypes are as required. However, the Age and Cabin columns have numerous missing values.

In [10]:
# Dataset infromation
# Test dataset
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


The datatypes are as required. Similar to the train dataset, the Cabin and age columns have numerous missing values.

## 3. External Dataset Validation 

## 4. Data Preperation

### a) Validation

The PassengerId and the Name columns will be removed as it is not relevant to the study.

In [11]:
# Checking relevance of the ticket column
print(f'Percentage of unique values (train): {(len(train.Ticket.unique()) / train.shape[0]) * 100}%')
print(f'Percentage of unique values (test): {(len(test.Ticket.unique()) / test.shape[0]) * 100}%')

Percentage of unique values (train): 76.43097643097643%
Percentage of unique values (test): 86.8421052631579%


As the ticket numbers have a high number of unique values, it will be dropped from both datasets, as the numbers seem to be the ticket reference number.

In [12]:
# Removing columns that do not add meaning to the project
train.drop(['PassengerId', 'Ticket'], axis = 1, inplace = True)
test.drop(['PassengerId', 'Ticket'], axis = 1, inplace = True)

In [13]:
# Ensuring changes have been made
print(f'Train set columns: {len(train.columns.values)}')
print(f'Test set columns: {len(test.columns.values)}')

Train set columns: 10
Test set columns: 9


### b) Completeness

**Percentage of missing values**

In [14]:
# Function to find the percentage of missing values
def PercentageMissing(data):
  # Precentage of missing values
  for col in data.columns.tolist():
    missing = data[col].isnull().sum()
    if missing > 0:
      print(f'{col} = {(missing/data.shape[0])*100}%')

In [15]:
# Checking for misssing values
# Train dataset
PercentageMissing(train)

Age = 19.865319865319865%
Cabin = 77.10437710437711%
Embarked = 0.22446689113355783%


In [16]:
# Checking for misssing values
# Test dataset
PercentageMissing(test)

Age = 20.574162679425836%
Fare = 0.23923444976076555%
Cabin = 78.22966507177034%


Missing values in the Cabin column form more that 75% of the dataset in both sets, therefore, it will be dropped.

In [17]:
# Dropping the Cabin column
# Train dataset
train.drop('Cabin', axis = 1, inplace = True)
# Test dataset
test.drop('Cabin', axis = 1, inplace = True)

**Imputing missing values**

The age column will be imputed based on the mean age per passenger class.

*Train dataset*

In [18]:
# Unique passenger class values
train.Pclass.unique()

array([3, 1, 2])

In [22]:
# Function used to fill in missign values using the column mean or highest column
# frequency.
def Fillna(class_values,  ref_col, target_col, fill_value, data):
  means = []
  freqs = []
  if fill_value == 'mean':
    for val in class_values:
      means.append(int(data[data[ref_col] == val][target_col].mean()))
    for val, fill in zip(class_values, means):
      data.loc[(data[ref_col] == val) & (data[target_col].isnull() == True), target_col] = fill
    
  elif fill_value == 'freq':
    for val in class_values:
      freqs.append(data[data[ref_col] == val][target_col].value_counts().index[0])
    for val, fill in zip(class_values, freqs):
      data.loc[(data[ref_col] == val) & (data[target_col].isnull() == True), target_col] = fill


In [23]:
# Imputing mising values in the age column using the mean age per passenger class.
class_values = list(train.Pclass.unique())
Fillna(class_values, 'Pclass', 'Age', 'mean', train)
# Imputing mising values in the embarked column using the most frequent label
# per passenger class.
Fillna(class_values, 'Pclass', 'Embarked', 'freq', train)

In [24]:
# Confirming that changes have been made.
# Train dataset
train.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

*Test dataset*

In [28]:
# Imputing mising values in the age and fare columns 
# column using the mean age per passenger class.
# Passenger class list
class_values = list(test.Pclass.unique())
# Age column
Fillna(class_values, 'Pclass', 'Age', 'mean', test)
# Fare column
Fillna(class_values, 'Pclass', 'Fare', 'mean', test)

In [29]:
# Confirming that changes have been made.
# Test dataset
test.isnull().sum()

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

All missing values have been dealth with.

### c) Consistency

Checking for duplicates.

In [31]:
# Train dataset
train.duplicated().any().any()

False

In [32]:
# Test dataset
test.duplicated().any().any()

False

There are no duplicates in both datasets.

### d) Uniformity

Checking uniformity of column names.

In [33]:
# Train dataset
train.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

In [34]:
# Test dataset
test.columns

Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], dtype='object')

Converting the column name case to lower case for convenience during refrencing.

In [35]:
# Converting column names to lower case
# Train dataset
train.columns = train.columns.str.lower()
# Test dataset
test.columns = test.columns.str.lower()

Checking changes.

In [37]:
# New column case
print(f'Train dataset: {train.columns}')
print(f'Test dataset: {test.columns}')

Train dataset: Index(['survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked'],
      dtype='object')
Test dataset: Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'], dtype='object')


### e) Outliers

Checking fot outliers.

In [38]:
# Outliers function
def outliers(data):
  # IQR
  Q1, Q3, IQR = 0, 0, 0
  outliers = pd.DataFrame()
  # Numerical columns
  numerical = data.select_dtypes(include = ['int64', 'float64'])
  Q1 = numerical.quantile(0.25)
  Q3 = numerical.quantile(0.75)
  IQR = Q3 - Q1
  # Outliers
  outliers = numerical[((numerical < (Q1 - 1.5 * IQR)) |(numerical > (Q3 + 1.5 * IQR))).any(axis=1)]
  print(f'Number of outliers = {outliers.shape[0]}')
  print(f'Percentage = {(outliers.shape[0]/data.shape[0])*100}%')

In [39]:
# Train dataset
outliers(train)

Number of outliers = 302
Percentage = 33.89450056116723%


In [40]:
# Test dataset
outliers(test)

Number of outliers = 133
Percentage = 31.818181818181817%


Outliers will be retained, as the test set has outliers. A train dataset without outliers will be created to compare the performance.

In [41]:
# Dataset without outliers
# Removing outliers 
Q1 = train.quantile(0.25)
Q3 = train.quantile(0.75)
IQR = Q3 - Q1
train_no =  train[~ ((train< (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 * IQR))).any(axis=1)]
train_no.shape

(589, 9)

The test set outliers will not be dropped.as it will be used to make predictions.