# K-Nearest Neighbor Model Implementation


#### PIPELINE FOR HYPERPARAMETER TUNING - FINDING THE BEST K VALUE FOR THE OPTIMAL FEATURE COMBINATION

Here, we implement the KNN Model in the following steps.

1. Initial split of training and testing data on a 75:25 basis with random_state = 2 (this train set will be used for 10-fold cross-validation to find the model which gives the lowest misclassification taking in a considerable number of features. The test set will be used later to check how well the model predicts unseen data.)
2. We have a separate function "_get_all_feature_combinations_" which generates all the feature combinations (i.e. 2^8 - 1 = 8191) in an array.
3. Then we run the _X_train_ and _Y_train_ (normalized using StandardScalar) and all feature combinations through the K-NN model iterator which runs 10-fold cross validation and give us the best K value along with it's misclassification for each iteration (out of 8191 iterations). We get all the results combined as a CSV output [here](../data/knn_iteration_results_8191_04_12_2022_v1.csv).
4. From the results, we manually choose a K value and a feature combination for our final model.
5. Then we train the selected model with the 75% train dataset we split initially and try to run predictions on the unseen 25% test dataset. In this process also we normalize the data using StandardScalar.
6. The final output is given in the following format:

Train Misclassification Error: XX%
Train Accuracy: XX%
Test Misclassification Error: XX%
Test Accuracy: XX%
    
![KNN%20-%20Model%20for%20Group%20Assigment.jpg](attachment:KNN%20-%20Model%20for%20Group%20Assigment.jpg)

#### IMPORTS


In [1]:
import numpy as np
import pandas as pd
import sklearn.neighbors as skl_nb
import sklearn.model_selection as skl_ms
from sklearn.model_selection import train_test_split
import sklearn.preprocessing as prep
import matplotlib.pyplot as plt

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

import sys
sys.path.append("..")

from utils.loading_data  import load_to_df_from_csv, get_all_feature_combinations
from utils.knn_functions import find_best_k_with_misclassification_cv, model_iterator_cv, data_normalizer, generate_prediction_results, generate_prediction_results_without_scaling

#### LOADING DATA


In [2]:
# Loading the train.csv as the main dataset
data = load_to_df_from_csv("../data/train.csv")

# Column Transformation to lowercase and underscored spaces
data.columns = data.columns.str.replace(' ', '_')
data.columns = data.columns.str.replace('-', '_')
data.columns = data.columns.str.lower()

X = data.loc[:, data.columns != 'lead']
y = data.loc[:, data.columns == 'lead']

#### SPLITTING DATA (75% : 25%)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
[X_train.shape, X_test.shape, y_train.shape, y_test.shape]

[(779, 13), (260, 13), (779, 1), (260, 1)]

#### EXPLORING DATA


In [4]:
data.shape

(1039, 14)

In [5]:
data.describe()


Unnamed: 0,number_words_female,total_words,number_of_words_lead,difference_in_words_lead_and_co_lead,number_of_male_actors,year,number_of_female_actors,number_words_male,gross,mean_age_male,mean_age_female,age_lead,age_co_lead
count,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0
mean,2334.256015,11004.368624,4108.256978,2525.024062,7.767084,1999.862368,3.507218,4561.85563,111.149182,42.353766,35.929588,38.716073,35.486044
std,2157.216744,6817.397413,2981.251156,2498.747279,3.901439,10.406632,2.088526,3417.855987,151.761551,7.81711,8.957193,12.285902,12.046696
min,0.0,1351.0,318.0,1.0,1.0,1939.0,1.0,0.0,0.0,19.0,11.0,11.0,7.0
25%,904.0,6353.5,2077.0,814.5,5.0,1994.0,2.0,2139.5,22.0,37.480769,29.5,30.0,28.0
50%,1711.0,9147.0,3297.0,1834.0,7.0,2000.0,3.0,3824.0,60.0,42.6,35.0,38.0,34.0
75%,3030.5,13966.5,5227.0,3364.0,10.0,2009.0,5.0,5887.5,143.5,47.333333,41.5,46.0,41.0
max,17658.0,67548.0,28102.0,25822.0,29.0,2015.0,16.0,31146.0,1798.0,71.0,81.333333,81.0,85.0


#### GET ALL FEATURE COMBINATIONS

In [4]:
feature_combinations = get_all_feature_combinations(X.columns)

#### RUN MODEL ITERATION (10-FOLD CROSS VALIDATION FOR 8191 FEATURE COMBINATIONS)

In [7]:
results = model_iterator_cv(X_train, y_train, feature_combinations, iterations = 10) # 8191 if needed to run for all combinations

4096 OUT OF 8191 ITERATIONS COMPLETED - 50.00610426077402%
4097 OUT OF 8191 ITERATIONS COMPLETED - 50.01831278232206%
4098 OUT OF 8191 ITERATIONS COMPLETED - 50.0305213038701%
4099 OUT OF 8191 ITERATIONS COMPLETED - 50.04272982541814%
4100 OUT OF 8191 ITERATIONS COMPLETED - 50.05493834696618%
4101 OUT OF 8191 ITERATIONS COMPLETED - 50.06714686851422%
4102 OUT OF 8191 ITERATIONS COMPLETED - 50.07935539006226%
4103 OUT OF 8191 ITERATIONS COMPLETED - 50.0915639116103%
4104 OUT OF 8191 ITERATIONS COMPLETED - 50.10377243315835%
4105 OUT OF 8191 ITERATIONS COMPLETED - 50.11598095470639%
4106 OUT OF 8191 ITERATIONS COMPLETED - 50.12818947625443%
4107 OUT OF 8191 ITERATIONS COMPLETED - 50.14039799780247%
4108 OUT OF 8191 ITERATIONS COMPLETED - 50.152606519350506%
4109 OUT OF 8191 ITERATIONS COMPLETED - 50.164815040898546%
4110 OUT OF 8191 ITERATIONS COMPLETED - 50.177023562446585%
4111 OUT OF 8191 ITERATIONS COMPLETED - 50.189232083994625%
4112 OUT OF 8191 ITERATIONS COMPLETED - 50.20144060554


#### GENERATE PREDICTION RESULTS

From the results taken from the iterations, we choose K = 10 with the following feature combinations (10 out of 13 features)

In [12]:
selected_features = [
        'number_words_female',
        'number_of_words_lead',
        'difference_in_words_lead_and_co_lead',
        'number_of_male_actors',
        'number_of_female_actors',
        'number_words_male',
        'gross',
        'mean_age_female',
        'age_lead',
        'age_co_lead'
    ]

generate_prediction_results(X_train[selected_features], y_train, X_test[selected_features], y_test, 10)

Train Misclassification Error: 31.75428159929866%
Train Accuracy: 68.24571840070134%
Test Misclassification Error: 28.71597633136095%
Test Accuracy: 71.28402366863905%
