# K-Nearest Neighbor Model Implementation


### STEPS:

#### INITIAL MODEL

---

1. Load train data (have to split this dataset as training and testing)
2. Assess the requirement for normalization
3. Normalize the data IF required
4. Split train / test datasets
5. Fit the model for K = 1 (write it in an iterable way)
6. Predict test data --> build confusion matrix --> calculate accuracy / calculate missclassification error

#### FEATURE ITERATION

---

1. Write a function to get all possible feature combinations into a 2D list
2. Parameterize INITIAL MODEL to take a set of features and predict
3. Provide an output of the results (for each K value which gives the lowest Missclassification Error)

#### FINAL MODEL EVALUATION

---

1. Write a function to take feature combinations in iteratively and run the model
2. Save the output of each iteration into a list to be taken for evaluation


#### IMPORTS


In [1]:
import numpy as np
import pandas as pd
import sklearn.neighbors as skl_nb
import sklearn.model_selection as skl_ms
import matplotlib.pyplot as plt

import sys
sys.path.append("..")

from utils.loading_data  import load_to_df_from_csv, get_all_feature_combinations
from utils.knn_functions import find_best_k_with_misclassification, plot_misclassification, model_iterator

#### STEP 1: LOADING DATA


In [2]:
# Loading the train.csv as the main dataset
data = load_to_df_from_csv("../data/train.csv")

# Column Transformation to lowercase and underscored spaces
data.columns = data.columns.str.replace(' ', '_')
data.columns = data.columns.str.replace('-', '_')
data.columns = data.columns.str.lower()


### Exploring data


In [3]:
data.shape


(1039, 14)

In [4]:
data.describe()


Unnamed: 0,number_words_female,total_words,number_of_words_lead,difference_in_words_lead_and_co_lead,number_of_male_actors,year,number_of_female_actors,number_words_male,gross,mean_age_male,mean_age_female,age_lead,age_co_lead
count,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0,1039.0
mean,2334.256015,11004.368624,4108.256978,2525.024062,7.767084,1999.862368,3.507218,4561.85563,111.149182,42.353766,35.929588,38.716073,35.486044
std,2157.216744,6817.397413,2981.251156,2498.747279,3.901439,10.406632,2.088526,3417.855987,151.761551,7.81711,8.957193,12.285902,12.046696
min,0.0,1351.0,318.0,1.0,1.0,1939.0,1.0,0.0,0.0,19.0,11.0,11.0,7.0
25%,904.0,6353.5,2077.0,814.5,5.0,1994.0,2.0,2139.5,22.0,37.480769,29.5,30.0,28.0
50%,1711.0,9147.0,3297.0,1834.0,7.0,2000.0,3.0,3824.0,60.0,42.6,35.0,38.0,34.0
75%,3030.5,13966.5,5227.0,3364.0,10.0,2009.0,5.0,5887.5,143.5,47.333333,41.5,46.0,41.0
max,17658.0,67548.0,28102.0,25822.0,29.0,2015.0,16.0,31146.0,1798.0,71.0,81.333333,81.0,85.0


### STEP 2: SPLIT TRAIN / TEST DATASETS


In [5]:
X_train, X_test, y_train, y_test = skl_ms.train_test_split(
    data.iloc[:, 0:data.shape[1] - 1], data.iloc[:, -1], test_size=0.30, random_state=123)


### STEP 3: GET ALL FEATURE COMBINATIONS TO A LIST


In [6]:
feature_combinations = get_all_feature_combinations(X_train.columns)

# feature_combinations[8191]


### STEP 4: RUN THE MODEL FOR EACH FEATURE COMBINATION AND FIND THE BEST K VALUE WITH IT'S MISCLASSIFICATION FOR EACH 


In [7]:
# 8191 if needed to run for all combinations
results = model_iterator(X_train, y_train, X_test, y_test, feature_combinations, iterations=10)

# results.to_csv(r'/Users/dininduseneviratne/Library/CloudStorage/OneDrive-Uppsalauniversitet/Statistical Machine Learning/results_8191.csv')


  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
  results = results.append(row, ignore_index=True)
