#IS 470 Lab 5: K Nearest Neighbor

---

This data set contains information of cars purchased at the Auction.
<br>
We will use this file to predict the quality of buying decisions and visualize decision processes.
<br>
<br>
VARIABLE DESCRIPTIONS:<br>
Auction: Auction provider at which the  vehicle was purchased<br>
Color: Vehicle Color<br>
IsBadBuy: Identifies if the kicked vehicle was an avoidable purchase<br>
MMRCurrentAuctionAveragePrice: Acquisition price for this vehicle in average condition as of current day<br>
Size: The size category of the vehicle (Compact, SUV, etc.)<br>
TopThreeAmericanName:Identifies if the manufacturer is one of the top three American manufacturers<br>
VehBCost: Acquisition cost paid for the vehicle at time of purchase<br>
VehicleAge: The Years elapsed since the manufacturer's year<br>
VehOdo: The vehicles odometer reading<br>
WarrantyCost: Warranty price (term=36month  and millage=36K)<br>
WheelType: The vehicle wheel type description (Alloy, Covers)<br>
<br>
Target variable: **IsBadBuy**

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [3]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing

## 1.Upload and clean data

In [None]:
# Read data
car_kick = pd.read_csv("/content/drive/MyDrive/IS470_data/car_kick.csv")
car_kick

In [5]:
# Select the desired columns only
desired_columns = ['Auction', 'Color', 'IsBadBuy', 'MMRCurrentAuctionAveragePrice', 'Size','TopThreeAmericanName',
'VehBCost', 'VehicleAge', 'VehOdo', 'WarrantyCost', 'WheelType']
car_kick_desired = car_kick [desired_columns]

In [6]:
# Replacing 1 with Yes and 0 with No in the target column IsBadBuy
carAuction = car_kick_desired.copy() #why?
carAuction.loc[:, 'IsBadBuy'] = carAuction['IsBadBuy'].replace({0: 'No', 1: 'Yes'})

In [None]:
# Examine variable type
carAuction.dtypes

In [8]:
# Change categorical variables to "category"
carAuction['Auction'] = carAuction['Auction'].astype('category')
carAuction['Color'] = carAuction['Color'].astype('category')
carAuction['IsBadBuy'] = carAuction['IsBadBuy'].astype('category')
carAuction['Size'] = carAuction['Size'].astype('category')
carAuction['TopThreeAmericanName'] = carAuction['TopThreeAmericanName'].astype('category')
carAuction['WheelType'] = carAuction['WheelType'].astype('category')

In [None]:
# Examine variable type
carAuction.dtypes

In [None]:
# Examine tope five rows
carAuction.head()

In [None]:
# Create dummy variables
carAuction = pd.get_dummies(carAuction, columns=['Auction','Color','Size','TopThreeAmericanName','WheelType'], drop_first=True)
carAuction

In [None]:
# Take the target and examine the porportion of target variable for each class
target = carAuction['IsBadBuy']
print(target.value_counts(normalize=True))

In [13]:
# Drop the target variable and put all the predictors in a new dataframe
predictors = carAuction.drop(['IsBadBuy'],axis=1)

In [None]:
# Apply minmax normalization on predictors
min_max_scaler = preprocessing.MinMaxScaler()
predictors_normalized = pd.DataFrame(min_max_scaler.fit_transform(predictors))
predictors_normalized.columns = predictors.columns
predictors_normalized

## 2.Partition and balance the data set for K Nearest Neighbor model

In [None]:
# Partition the data
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors_normalized, target, test_size=0.3, random_state=0)
print(predictors_train.shape, predictors_test.shape, target_train.shape, target_test.shape)

In [77]:
# Taking steps to balance the train data
# Combine predictors_train and target_train into a single DataFrame
combined_train_df = pd.concat([predictors_train, target_train], axis=1)

# Separate majority and minority classes
majority_df = combined_train_df[combined_train_df['IsBadBuy'] == 'No']
minority_df = combined_train_df[combined_train_df['IsBadBuy'] == 'Yes']
#print(len(majority_df), len(minority_df))
# Undersample the majority class randomly
undersampled_majority = majority_df.sample(n=int(1*len(minority_df)), random_state=55)

# Combine the undersampled majority class and the minority class
undersampled_data = pd.concat([undersampled_majority, minority_df])

# Shuffle the combined DataFrame to ensure randomness
balanced_data = undersampled_data.sample(frac=1, random_state=1)

# Split the balanced_data into predictors_train and target_train
predictors_train = balanced_data.drop(columns=['IsBadBuy'])
target_train = balanced_data['IsBadBuy']

In [None]:
# Examine the porportion of target variable for training data set
print(target_train.value_counts(normalize=True))

In [None]:
# Examine the porportion of target variable for testing data set
print(target_test.value_counts(normalize=True))

## 3.K Nearest Neighbor model prediction

### Build a K Nearest Neighbor model with n_neighbors = 1

In [80]:
# Build a K Nearest Neighbor model on training data with n_neighbors = 1 (1 points)
model1 = KNeighborsClassifier(n_neighbors = 1)


In [81]:
# Make predictions on training and testing data (1 points)
prediction_on_train =
prediction_on_test =

In [None]:
# Examine the evaluation results on training data: confusion_matrix
cm = confusion_matrix(target_train, prediction_on_train)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model1.classes_).plot()

In [None]:
# Examine the evaluation results on training data: accuracy, precision, recall, and f1-score (1 points)
print(classification_report(--- ,--- ))

In [None]:
# Examine the evaluation results on testing data: confusion_matrix
cm = confusion_matrix(target_test, prediction_on_test)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model1.classes_).plot()

In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score (1 points)
print(classification_report(---, ---))

Q1. Compare the performances on training and testing sets by answering the following: <br>

a. Why we have perfect evaluation results on the training data? (1 points)<br>


b. Does the KNN model with n_neighbors = 1 generalize well on the testing set? why? (1 points)<br>


### Build a K Nearest Neighbor model with n_neighbors = 2

In [None]:
# Build a K Nearest Neighbor model on training data with n_neighbors = 4 (1 pts)
model2 = KNeighborsClassifier(n_neighbors = 4)
model2.fit(---, ---)

In [87]:
# Make predictions on training and testing data
prediction_on_train =
prediction_on_test =

In [None]:
# Examine the evaluation results on training data: confusion_matrix
cm = confusion_matrix(target_train, prediction_on_train)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model2.classes_).plot()

In [None]:
# Examine the evaluation results on training data: accuracy, precision, recall, and f1-score
print(classification_report(target_train, prediction_on_train))

In [None]:
# Examine the evaluation results on testing data: confusion_matrix
cm = confusion_matrix(target_test, prediction_on_test)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model2.classes_).plot()

In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score (1 points)
print(classification_report(---, ---))

Q2. Which KNN model is the best for identifying bad buy cars (n_neighbors=1 or 4)? and why?  (1 points)<br>

Q3. How the KNN model would performs if we increase the value of n_neighbors very large? (1 points)<br>

In [92]:
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/IS470_lab05.ipynb"

[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/IS470_lab05.ipynb to html
[NbConvertApp] Writing 674966 bytes to /content/drive/MyDrive/Colab Notebooks/IS470_lab05.html
