## C S 329E HW 6

# KNN 

## Your name here (and your partner's name if you are working in a pair)
Yixing Ma, Daniel Lam, pair 40

For this week's homework we are going explore one new classification technique:

  - k nearest neighbors

We are using a different version of the Melbourne housing data set, to predict the housing type as one of three possible categories:

  - 'h' house
  - 'u' duplex
  - 't' townhouse

At the end of this homework, I expect you to understand how to build and use a kNN model, and practice your data cleaning and data preparation skills. 

In [1]:
# These are the libraries you will use for this assignment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import calendar
%matplotlib inline

# Starting off loading a training set
df_melb = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/81b236aecee57f6cf65e60afd865d2bb/raw/56ddb53aa90c26ab1bdbfd0b8d8229c8d08ce45a/melb_data_train.csv')

## Q1 - Fix a column of data to be numeric
If we inspect our dataframe, `df_melb` using the `dtypes` method, we see that the column "Date" is an object.  However, we think this column might contain useful information so we want to convert it to [seconds since epoch](https://en.wikipedia.org/wiki/Unix_time). Use only the exiting imported libraries to create a new column "unixtime". Be careful, the date strings in the file might have some non-uniform formatting that you have to fix first.  Print out the min and max epoch time to check your work.  Drop the original "Date" column. Please use the python [reference for time](https://docs.python.org/3/library/time.html) to help you do the string to Unix time conversion. 

In [2]:
# normalize date accepts the date string as shown in the df_melb 'Date' column,
# and returns a data in a standarized format
def standardize_date(d):
    # Your code here
    # calendar.timegm
    day, month, year = d.split('/')
    day, month, year = day.strip(), month.strip(), year.strip()
    if len(day) < 2: day = '0' + day
    if len(month) < 2: month = '0' + month
    if len(year) < 4: year = "20" + year
    standard_date = day + '.' + month + '.' + year
    return standard_date
    # Your code here

In [3]:
df_melb['Date'] = df_melb['Date'].apply( lambda x : standardize_date(x)) 
df_melb['unixtime'] = df_melb['Date'].apply(lambda x : int(calendar.timegm(time.strptime(x, "%d.%m.%Y"))))
df_melb = df_melb.drop(columns="Date")

print("The min unixtime is {:d} and the max unixtime is {:d}".format(df_melb['unixtime'].min(),df_melb['unixtime'].max()))

The min unixtime is 1454544000 and the max unixtime is 1506124800


## Q2 Use Imputation to fill in missing values
kNN doesn't work when the attributes are not valid for all of the attribute columns, so fill in all the missing values in `df_melb` with the mean of that column.  Save the mean of each column in a dictionary, `dict_imputation`, whose key is the attribute column name, so we can apply the same imputation to the test set later. Show your `dict_imputation` dictionary and the head of your `df_melb` dataframe.  The target classfication (aka the class label) is stored in the column `'Type'`, so we are going to define a variable target_col so we can reference the target_col using a variable. (hint: during imputation you skip the target column)

In [4]:
target_col = 'Type'

In [5]:
dict_imputation = dict()
for col in df_melb.columns:
    # your code here
    if col != target_col:
        dict_imputation[col] = df_melb[col].dropna().mean()
        df_melb[col].fillna(dict_imputation[col], inplace=True)

In [6]:
dict_imputation

{'Bathroom': 1.44,
 'BuildingArea': 121.7832,
 'Car': 1.503006012024048,
 'Distance': 10.524599999999985,
 'Landsize': 638.91,
 'Postcode': 3113.122,
 'Price': 932558.7,
 'Rooms': 2.71,
 'YearBuilt': 1970.9417475728155,
 'unixtime': 1485178502.4}

In [7]:
df_melb.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,2,h,399000,8.7,3032,1,1.0,904,53.0,1985.0,1462579200
1,3,h,1241000,13.9,3165,1,1.0,643,121.7832,1970.941748,1472342400
2,2,u,550000,3.0,3067,1,1.0,1521,121.7832,1970.941748,1499472000
3,3,u,691000,8.4,3072,1,1.0,170,121.7832,1970.941748,1498262400
4,2,u,657500,4.6,3122,1,1.0,728,73.0,1965.0,1479513600


## Q3 Normalize all the attributes to be between [0,1]
Normalize all the attribute columns in `df_melb` so they have a value between zero and one (inclusive). Save the (min,max) tuple used to normalize to a dictionary, `dict_normalize`, so we can apply it to the test set later.  The dataframe `df_melb` is now your "model" that you can use to classify new data points. (hint: during normalization you skip the class label column)

In [8]:
dict_normalize = dict()
for col in df_melb.columns:
    # your code here
    dict_normalize[col] = (df_melb[col].min(), df_melb[col].max())
    if col != target_col:
        df_melb[col] = ((df_melb[col]-df_melb[col].min()) / (df_melb[col].max()-df_melb[col].min()))
        

In [9]:
dict_normalize

{'Bathroom': (0, 4),
 'BuildingArea': (0.0, 475.0),
 'Car': (0.0, 4.0),
 'Distance': (0.7, 47.3),
 'Landsize': (0, 41400),
 'Postcode': (3002, 3810),
 'Price': (291000, 5020000),
 'Rooms': (1, 6),
 'Type': ('h', 'u'),
 'YearBuilt': (1890.0, 2015.0),
 'unixtime': (1454544000, 1506124800)}

In [10]:
df_melb.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,0.2,h,0.022838,0.171674,0.037129,0.25,0.25,0.021836,0.111579,0.76,0.155779
1,0.4,h,0.200888,0.283262,0.201733,0.25,0.25,0.015531,0.256386,0.647534,0.345059
2,0.2,u,0.054768,0.049356,0.080446,0.25,0.25,0.036739,0.256386,0.647534,0.871022
3,0.4,u,0.084584,0.165236,0.086634,0.25,0.25,0.004106,0.256386,0.647534,0.847571
4,0.2,u,0.077501,0.083691,0.148515,0.25,0.25,0.017585,0.153684,0.6,0.484087


## Q4 Load in the Test data and prep it for classification
Everything we did to our "train" set, we need to now do in our "test" set. 

In [11]:
df_test = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/c3d53303cebbd986b166591d19254bac/raw/94eb3b2d500d5f7bbc0441a8419cd855349d5d8e/melb_data_test.csv')

In [12]:
# Your code here to fix date
df_test['Date'] = df_test['Date'].apply( lambda x : standardize_date(x))
df_test['unixtime'] = df_test['Date'].apply(lambda x : int(calendar.timegm(time.strptime(x, "%d.%m.%Y"))))
df_test = df_test.drop(columns="Date")
print("The min unixtime is {:d} and the max unixtime is {:d}".format(df_test['unixtime'].min(),df_test['unixtime'].max()))

The min unixtime is 1454544000 and the max unixtime is 1506124800


In [13]:
# Your code here for imputation - must use dictionary from above!
dict_imputation = dict()
for col in df_test.columns:
    # your code here
    if col != target_col:
        dict_imputation[col] = df_test[col].dropna().mean()
        df_test[col].fillna(dict_imputation[col], inplace=True)
df_test.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,3,h,1116000,17.9,3192,1,2.0,610,150.72913,1967.396226,1498867200
1,3,h,2030000,11.2,3186,2,2.0,366,150.72913,1967.396226,1472342400
2,3,h,1480000,10.7,3187,2,2.0,697,143.0,1925.0,1478476800
3,3,u,1203500,12.3,3166,2,2.0,311,127.0,2000.0,1495843200
4,3,h,540000,14.7,3030,2,2.0,353,135.0,2011.0,1504396800


In [14]:
# Your code here for scaling - must use dictionary from above!

dict_normalize = dict()
for col in df_test.columns:
    # your code here
    dict_normalize[col] = (df_test[col].min(), df_test[col].max())
    if col != target_col:
        df_test[col] = ((df_test[col]-df_test[col].min()) / (df_test[col].max()-df_test[col].min()))
df_test.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,0.333333,h,0.31245,0.447802,0.433014,0.0,0.333333,0.121466,0.219185,0.624163,0.859296
1,0.333333,h,0.674861,0.263736,0.41866,0.5,0.333333,0.072879,0.219185,0.624163,0.345059
2,0.333333,h,0.45678,0.25,0.421053,0.5,0.333333,0.138789,0.202198,0.282258,0.463987
3,0.333333,u,0.347145,0.293956,0.370813,0.5,0.333333,0.061928,0.167033,0.887097,0.80067
4,0.333333,h,0.08406,0.35989,0.045455,0.5,0.333333,0.070291,0.184615,0.975806,0.966499


## Q5 Write the kNN classifier function
Your function `knn_class`, should take five parameters, the training dataframe (that includes the target column), the hyper parameter `k`, the name of the target column, a single observation row (a series or attributes the same length as the attributes in `df_train`) of the test dataframe, and a boolean `use_weighted_vote`.  When `use_weighted_vote` is set to true, use weighted voting, otherwise use majority voting. We are assuming that the parameter `df_train` contains all of the attributes, and the target class in the same dataframe. The function returns the predicted target classification for that observation. To find the distance between the single observation and the training data frame you should use the [L2 norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html)

In [15]:
def knn_class(df_train, k, target_col, observation, use_weighted_vote ):
   # your code here
  observation = observation.drop(labels = [target_col])
  dist_m = []
  dist_w = []
  for row in df_train.iterrows():
    train_row = row[1].drop(labels = [target_col])
    dist_m.append(np.linalg.norm(observation - train_row))
    dist_w.append(1 / (np.linalg.norm(observation - train_row)**2))

  df_train['Distance_from'] = dist_m 
  df_train['Weighted_score'] = dist_w

  if use_weighted_vote:
    dist_weight_sort = df_train.sort_values('Weighted_score', ascending = False)
    knn = dist_weight_sort.head(1)

    return ((knn[target_col].mode()[0]))

  else:
    dist_sort = df_train.sort_values('Distance_from', ascending = False)
    knn = dist_sort.head(k)
    return  ((knn[target_col].mode()[0]))

## Q6 Compute the accuracy using different k values
For each value of $k$ in the set $\{1,3,13,25,50,100\}$ calculate the class prediction for each oberservation in the test set, and the overall accuracy of the classifier.  Plot the accuracy as a function of $k$ when `use_weighted_vote` is `True` and when `use_weighted_vote` is `False`.

Which value of $k$ would you chose, and would you use weighted voting or majority voting?

Note, this took 20 seconds for me on Google Colab. 

In [16]:
poss_k = [1,3,13,25,50,100] # possible k's
acc_k_majority = list(np.zeros(len(poss_k))) # Accuracy for each value of k using majority voting
acc_k_weighted = list(np.zeros(len(poss_k))) # Accuracy for each value of k using weighted voting

# Your code here
for k in range(len(poss_k)):
    for index, obs in df_test.iterrows():
        if (knn_class(df_melb, poss_k[k], target_col, obs, False) == obs["Type"]):
            acc_k_weighted[k] +=1
        acc_k_weighted[k] = acc_k_weighted[k]/len(df_test)
        

print(acc_k_weighted)

KeyboardInterrupt: ignored

In [None]:
# plot code here

plt.plot(poss_k, acc_k_majority, marker="o", color='pink')
plt.plot(poss_k, acc_k_weighted, marker="o", color='blue')
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Accuracy of KNN")
plt.show()

➡️ Answer containing your analysis of the I would choose $k = <value> $ and voting scheme because _reasons_ here ⬅️