<a href="https://colab.research.google.com/github/alexjohnson21/ubiquitous-sniffle/blob/master/cse450_prove03_part01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prove 03: kNN with custom datasets

## Imports and initial setup

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn import preprocessing as pp
from sklearn.preprocessing import StandardScaler

In [0]:
# Datasets stored in Google Drive - this mounts drive to session

from google.colab import drive
drive.mount("/content/drive")

## Read car.data into array for preprocessing

In [5]:
car = pd.read_csv("/content/drive/My Drive/prove03_data/car.data")
car.head(5)

Unnamed: 0,vhigh,vhigh.1,2,2.1,small,low,unacc
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


This data will require some preprocessing before we can train the classifier and regressor.

* Change headers for easier readability
* Assign numeric values to non-numerics
* Ensure presence of all data **(website says there are no missing values)**




## Change headers

In [6]:
car = pd.read_csv("/content/drive/My Drive/prove03_data/car.data", header=None)

# The website hosting the dataset describes the headers as follows (even though
# they're not present in the .csv file...?)

car.columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "target"]
car.head(5)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,target
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


## Assign numeric values to non-numerics
Using one-hot encoding as described in the reading for true-numeric values (doors and persons) and label encoding for the rest

In [0]:
# One-Hot Encoding for true numerics
car = pd.get_dummies(car, columns=car[["doors", "persons"]].columns)

# Label Encoding for everything else
car["buying"] = car["buying"].astype("category").cat.reorder_categories(["low", "med", "high", "vhigh"]).cat.codes
car["maint"] = car["maint"].astype("category").cat.reorder_categories(["low", "med", "high", "vhigh"]).cat.codes
car["lug_boot"] = car["lug_boot"].astype("category").cat.reorder_categories(["small","med","big"]).cat.codes
car["safety"] = car["safety"].astype("category").cat.reorder_categories(["low","med","high"]).cat.codes

## Ensure presence of all data

In [8]:
car.isna().sum(axis=0)

buying          0
maint           0
lug_boot        0
safety          0
target          0
doors_2         0
doors_3         0
doors_4         0
doors_5more     0
persons_2       0
persons_4       0
persons_more    0
dtype: int64

The website says no data is missing - this proved to be correct.

## Split data into features and targets

In [0]:
features = car.drop('target', axis=1)
features = features.to_numpy()

targets = car.target.to_numpy()

## Use sklearn to assign training and testing features and targets

In [0]:
train_features, test_features, train_targets, test_targets = train_test_split(features, targets, test_size=.3)

## Create, train, and test an sklearn kNN classifier

In [11]:
knnClassifier = KNeighborsClassifier(n_neighbors=5)
knnClassifier.fit(train_features, train_targets)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [12]:
class_predictions = knnClassifier.predict(test_features)

accuracy_score(test_targets, class_predictions)

0.9229287090558767

## Prepare data for regression tests

In [13]:
car.head(5)

Unnamed: 0,buying,maint,lug_boot,safety,target,doors_2,doors_3,doors_4,doors_5more,persons_2,persons_4,persons_more
0,3,3,0,0,unacc,1,0,0,0,1,0,0
1,3,3,0,1,unacc,1,0,0,0,1,0,0
2,3,3,0,2,unacc,1,0,0,0,1,0,0
3,3,3,1,0,unacc,1,0,0,0,1,0,0
4,3,3,1,1,unacc,1,0,0,0,1,0,0


## Create, train, and test an sklearn kNN regressor

In [14]:
knnRegressor = KNeighborsRegressor(n_neighbors=5)
knnRegressor.fit(train_features, train_targets)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [15]:
regress_predictions = knnRegressor.predict(test_features)
regress_predictions

TypeError: ignored