## C S 363D HW 5

# KNN and Naive Bayes

## Ana Williams and Fronrich Puno

For this week's homework we are going explore two new classification techniques:

  - k nearest neighbors, and
  - Naive Bayes
  
Along with brushing up on the application of probability theory and Bayes Theorem. 

We are using a different version of the Melbourne housing data set, to predict the housing type as one of three possible categories:
  - 'h' house
  - 'u' duplex
  - 't' townhouse

At the end of this homework, I expect you to understand how to build a model using each of our 2 new techniques, and refresh a few concepts from your course in probability theory

## Section 1 - kNN 

In [1]:
# These are the libraries you will use for this assignment, you may not import anything else
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import calendar
%matplotlib inline

# Starting off loading a training set
df_melb = pd.read_csv('melb_data_train.csv')

## Q1.1 - Fix a column of data to be numeric
If we inspect our dataframe, `df_melb` using the `dtypes` method, we see that the column "Date" is an object.  However, we think this column might contain useful information so we want to convert it to [seconds since epoch](https://en.wikipedia.org/wiki/Unix_time). Use only the existing imported libraries to create a new column "unixtime". Be careful, the date strings in the file have some non-uniform formatting that you have to fix first.  Print out the min and max epoch time to check your work.  Drop the original "Date" column. 

In [2]:
dates = df_melb['Date'].str.split('/', expand=True)
dates[2] = '20' + dates[2].str.slice(start=-2).astype(str)
dates = dates.rename(columns = {0 : 'day', 1 : 'month', 2 : 'year'})
dates['utc'] = pd.to_numeric(pd.to_datetime(dates))

In [3]:
df_melb['Unixtime'] = dates['utc']//10**9
df_melb = df_melb.drop(['Date'], axis = 1)

In [4]:
df_melb['Unixtime'].min()

1454544000

In [5]:
df_melb['Unixtime'].max()

1506124800

## Q1.2 Use Imputation to fill in missing values
kNN doesn't work when all of the attributes are not valid for all of the attribute columns, so fill in all the missing values in `df_melb` with the mean of that column.  Save the mean of each column in a dictionary, `dict_imputation`, whose key is the column name, so we can apply the same imputation to the test set later. Show your `dict_imputation` dictionary and the head of your `df_melb` dataframe

In [6]:
df_melb = df_melb.fillna(df_melb.mean())
dict_imputation = pd.DataFrame(df_melb.mean()).to_dict()[0]

In [7]:
# Print out dict_imputation
dict_imputation

{'Rooms': 2.710769230769231,
 'Price': 941972.2953846154,
 'Distance': 10.206256410256412,
 'Postcode': 3110.873846153846,
 'Bathroom': 1.4543589743589744,
 'Car': 1.4938398357289528,
 'Landsize': 514.2184615384615,
 'BuildingArea': 131.379476861168,
 'YearBuilt': 1971.0204429301366,
 'Unixtime': 1485036288.0}

## Q1.3 Normalize all the attributes to be between [0,1]
Normalize all the attribute columns in `df_melb` so they have a value between zero and one (inclusive). Save the (min,max) tuple used to normalize to a dictionary, `dict_normalize`, so we can apply it to the test set later.  The dataframe `df_melb` is now your "model" that you can use to classify new data points.

In [8]:
df_melb.head()
MinMax = df_melb.drop('Type', axis = 1)
MinMax = pd.Series(index = MinMax.columns, data = list(zip(MinMax.min(),MinMax.max())))
dict_normalize = MinMax.to_dict()

In [9]:
# print out dict_normalize
dict_normalize

{'Rooms': (1.0, 7.0),
 'Price': (210000.0, 5020000.0),
 'Distance': (0.7, 47.3),
 'Postcode': (3000.0, 3810.0),
 'Bathroom': (0.0, 5.0),
 'Car': (0.0, 8.0),
 'Landsize': (0.0, 41400.0),
 'BuildingArea': (0.0, 3558.0),
 'YearBuilt': (1850.0, 2016.0),
 'Unixtime': (1454544000.0, 1506124800.0)}

In [10]:
temp = df_melb['Type']
df_melb = df_melb.drop('Type', axis=1)
for colName, colData in df_melb.iteritems():
    df_melb[colName] = (colData - dict_normalize[colName][0])/(dict_normalize[colName][1] - dict_normalize[colName][0])
df_melb['Type'] = temp
df_melb.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Unixtime,Type
0,0.333333,0.108524,0.10515,0.124691,0.2,0.125,0.021836,0.030916,0.783133,0.289782,t
1,0.333333,0.164449,0.255365,0.024691,0.2,0.625,0.021232,0.036925,0.729039,0.659966,h
2,0.166667,0.082121,0.143777,0.228395,0.2,0.125,0.01744,0.036925,0.722892,0.155779,u
3,0.333333,0.113825,0.388412,0.209877,0.4,0.125,0.003502,0.036925,0.729039,0.835846,h
4,0.5,0.106237,0.369099,0.101235,0.4,0.25,0.014565,0.036925,0.729039,0.988275,h


## Q1.4 Load in the Test data and prep it for classification
Everything we did to our "train" set, we need to now do in our "test" set. 

In [11]:
df_test = pd.read_csv("melb_data_test.csv")

In [12]:
# Add unixtime and column remove 'Date' string column
dates = df_test['Date'].str.split('/', expand=True)
dates[2] = '20' + dates[2].str.slice(start=-2).astype(str)
dates = dates.rename(columns = {0 : 'day', 1 : 'month', 2 : 'year'})
dates['utc'] = pd.to_numeric(pd.to_datetime(dates))
df_test['Unixtime'] = dates['utc']//10**9
df_test = df_test.drop(['Date'], axis = 1)

In [13]:
# Imputation - must use dictionary from above!
df_test = df_test.fillna(df_test.mean())

In [14]:
# Scale - must use dictionary from above!
MinMax = pd.DataFrame({'Min': df_test.min(), 'Max': df_test.max()})
MinMax = MinMax.drop('Type')
dict_normalize = MinMax.to_dict('index')
temp = df_test['Type']
df_test = df_test.drop('Type', axis=1)
for colName, colData in df_test.iteritems():
    df_test[colName] = (colData - dict_normalize[colName]['Min'])/(dict_normalize[colName]['Max'] - dict_normalize[colName]['Min'])
df_test['Type'] = temp
df_test.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Unixtime,Type
0,0.166667,0.183188,0.263736,0.083732,0.5,0.166667,0.041418,0.246521,0.967742,0.835846,t
1,0.333333,0.407216,0.197802,0.145933,0.0,0.333333,0.182397,0.290426,0.624069,0.425461,h
2,0.666667,0.98414,0.129121,0.315789,0.5,0.333333,0.122859,0.290426,0.624069,0.345059,h
3,0.333333,0.206979,0.244505,0.055024,0.0,0.166667,0.11808,0.228628,0.645161,0.355109,h
4,0.333333,0.191118,1.0,0.449761,0.0,0.333333,0.166268,0.228628,0.564516,0.871022,h


## Q1.5 Write the kNN classifier function
Your function `knn_class`, should take four parameters, the training dataframe, the hyper parameter `k`, the name of the target column, and a single observation row (a series generated from iterrows) of the test dataframe.  It should return a single target classification. To find the distance between the single observation and the training data frame you may use the [L2 norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html)

In [15]:
# create function to calculate distance
def calc_distance(point_a, point_b, p=1):
    # store number of dimensions
    dimensions = len(point_a)
    # set inital distance to 0
    distance = 0
    
    
    
    # calc distance
    for dim in range(dimensions):
        # print("point_a[dim]" + str(point_a[dim]) + "\t\t\tpoint_b[dim]" + str(point_b[dim]))
        distance += abs(point_a[dim] - point_b[dim])**p
        
    distance = distance**(1/p)
    return distance

In [16]:
# sniff test distance function
df_melb.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Unixtime,Type
0,0.333333,0.108524,0.10515,0.124691,0.2,0.125,0.021836,0.030916,0.783133,0.289782,t
1,0.333333,0.164449,0.255365,0.024691,0.2,0.625,0.021232,0.036925,0.729039,0.659966,h
2,0.166667,0.082121,0.143777,0.228395,0.2,0.125,0.01744,0.036925,0.722892,0.155779,u
3,0.333333,0.113825,0.388412,0.209877,0.4,0.125,0.003502,0.036925,0.729039,0.835846,h
4,0.5,0.106237,0.369099,0.101235,0.4,0.25,0.014565,0.036925,0.729039,0.988275,h


In [17]:
def knn_class(df_train, k, target_col, observation ):
    # df_train - the training data set
    # k - hyperparameter
    # target_col - the target column
    # observation - series generated from iterrows
    
    # get distances
    test_x = df_train.copy().drop(target_col, axis=1)
    
    test_pt = observation
    
    distances = []
    for i in test_x.index:
        distances.append(calc_distance(test_pt, test_x.iloc[i]))

    # get k nearest neigbors
    df_dists = pd.DataFrame(data=distances, index=test_x.index, columns=['dist'])
    df_nn = df_dists.sort_values(by=['dist'], axis=0)[:k]
    return df_nn

In [18]:
# test
# create observation series
targ = 'Type'
obs = df_melb.copy().drop(targ, axis=1).iloc[0]
knn_test = knn_class(df_train=df_melb, k=5, target_col=targ, observation=obs)
knn_test  

Unnamed: 0,dist
0,0.0
635,0.189898
11,0.222082
476,0.264582
81,0.276023


## Q1.6 Compute the accuracy using different k values
For each value of $k$ in the set $\{1,3,13,25,50,100\}$ calculate the class prediction for each observation in the test set, and the overall accuracy of the classifier.  Plot the accuracy as a function of $k$.

Please note: this can be slow, and took the 13" Macbook I used about 5 mins to complete.  When testing your code, you might want to use a smaller test/train data set until you are sure your code is working, and then run this cell using the entire dataframe/series.  

In [19]:
poss_k = [1,3,13,25,50,100]

In [20]:
# iterate through all k
targ = 'Type'
obs = df_melb.copy().drop(targ, axis=1).iloc[0]
k_res = {}
for x in range(len(poss_k)):
    k_val = poss_k[x]
    k_res[k_val] = knn_test = knn_class(df_train=df_melb, k=k_val, target_col=targ, observation=obs)
k_res

{1:    dist
 0   0.0,
 3:          dist
 0    0.000000
 635  0.189898
 11   0.222082,
 13:          dist
 0    0.000000
 635  0.189898
 11   0.222082
 476  0.264582
 81   0.276023
 568  0.281001
 412  0.323599
 772  0.333176
 351  0.336661
 574  0.340375
 822  0.347040
 733  0.347440
 569  0.352887,
 25:          dist
 0    0.000000
 635  0.189898
 11   0.222082
 476  0.264582
 81   0.276023
 568  0.281001
 412  0.323599
 772  0.333176
 351  0.336661
 574  0.340375
 822  0.347040
 733  0.347440
 569  0.352887
 522  0.353804
 525  0.357934
 72   0.366596
 745  0.373104
 186  0.373455
 185  0.374036
 609  0.380799
 848  0.381669
 183  0.385371
 96   0.394576
 102  0.394993
 447  0.397572,
 50:          dist
 0    0.000000
 635  0.189898
 11   0.222082
 476  0.264582
 81   0.276023
 568  0.281001
 412  0.323599
 772  0.333176
 351  0.336661
 574  0.340375
 822  0.347040
 733  0.347440
 569  0.352887
 522  0.353804
 525  0.357934
 72   0.366596
 745  0.373104
 186  0.373455
 185  0.374036


## Section 2 - Naive Bayes 

### Q2.1 Theoretical exercise on Probability Review - Joint Probability of dependent events
In my neighborhood in Austin, I took a survey and found that 42% of the houses have at least one dog, and 25% of the houses that own a dog also own a cat.  In addition, I found that 31% of families own a cat.

Answer the following by typing out your answers below using markdown cells.  Define all of your terms and show your work. When you define your terms you should write out the meaning of each variable. (Note markdown supports LaTex so, you can make fractions like this: `$\frac{a}{b}$` = $\frac{a}{b}$)

### Q2.1a
the probability that a randomly selected house in my neighborhood owns a cat and a dog

$P(Dog) = .42$
<br>
$P(Cat) = .31$
<br>
$P(Cat|Dog) = .25$
<br>
Want to find(dependent): $P(Dog,Cat) = P(Dog)*P(Cat|Dog) = 0.42*0.25 = 0.105$

### Q2.1b
the conditional probability that a randomly selected family owns a dog given that it owns a cat? 

$P(Dog) = .42$
<br>
$P(Cat) = .31$
<br>
$P(Cat|Dog) = .25$
<br>
Want to find: $P(Dog|Cat) = \frac{P(Cat|Dog) * P(Dog)}{P(Cat)} = \frac{0.25 * 0.42}{0.31} = 0.339$

### Q2.2 Theoretical exercise on Probability Review - Marginals and  Bayes Theorem 
In Austin, 45% of registered voters are Democrats, 37% of registered voters are Republicans, and the remaining 18% are Independents. In the last election 35% of the Democrats, 62% of the Republications, and 58% of the Independents voted. A voter is chosen at random.  

Answer the following by typing out your answers below using markdown cells.  Define all of your terms and show your work. When you define your terms you should write out the meaning of each variable. (Note markdown supports LaTex so, you can make fractions like this: `$\frac{a}{b}$` = $\frac{a}{b}$)

### Q2.2a
What fraction of registered voters voted in the election? 

$P(Dem) = 0.45$
<br>
$P(Rep) = 0.37$
<br>
$P(Ind) = 0.18$
<br>
$P(Vote|Dem) = 0.35$
<br>
$P(Vote|Rep) = 0.62$
<br>
$P(Vote|Ind) = 0.58$
<br>
Want to find:
<br>
$P(Vote) = P(Vote|Dem)*P(Dem) + P(Vote|Rep)*P(Rep) + P(Vote|Ind)*P(Ind) = 0.35*0.45 + 0.62*0.37 + 0.58*0.18 = 0.4913$
<br>

### Q2.2b
What is the probability that someone who voted is a Republican? 

$P(Dem) = 0.45$
<br>
$P(Rep) = 0.37$
<br>
$P(Ind) = 0.18$
<br>
$P(Vote|Dem) = 0.35$
<br>
$P(Vote|Rep) = 0.62$
<br>
$P(Vote|Ind) = 0.58$
<br>
$P(Vote) = 0.49$
<br>
Want to find:
<br>
$P(Rep|Vote) = \frac{P(Vote|Rep)*P(Rep)}{P(Vote)} = \frac{0.62*0.37}{0.49} = .4682$

## Q2.3 Loading in the housing data and calculating the prior probabilities
Let us load in the test set again fresh, since Naive Bays and kNN have much different requirements. Load the prior probabilities for each possible 'Type' in a dictionary, `dict_priors`, where the key is the possible 'Type' values and the value is the prior probabilities. Show the dictionary.

In [21]:
df_melb = pd.read_csv('melb_data_train.csv')

In [22]:
dict_priors = df_melb['Type'].value_counts(normalize = True).to_dict()

In [23]:
# Show the dictionary
dict_priors

{'h': 0.4512820512820513, 'u': 0.4, 't': 0.14871794871794872}

## Q2.4 Create a model for the distribution of all of the continuous attributes
First, let us just drop the 'Date' column.  Now, for each class, and for each attribute calculate the sample mean and sample standard deviation.  You should store the model in a nested dictionary, `dict_nb_model`, such that `dict_nb_model['h']['Rooms']` is a tuple containing the mean and standard deviation for the target Type 'h' and the attribute 'Rooms'.  Show the model for target type 'u'.   

In [24]:
df_melb = df_melb.drop('Date', axis = 1)

In [25]:
# Gathering by type
df_t = df_melb.loc[df_melb['Type'] == 't']
df_t = df_t.drop('Type', axis = 1)
df_h = df_melb.loc[df_melb['Type'] == 'h']
df_h = df_h.drop('Type', axis = 1)
df_u = df_melb.loc[df_melb['Type'] == 'u']
df_u = df_u.drop('Type', axis = 1)

In [26]:
#Getting Mean and standard deviation 
MeanDevt = pd.Series(index = df_t.columns, data = list(zip(df_t.mean(), df_t.std())))
MeanDevh = pd.Series(index = df_h.columns, data = list(zip(df_h.mean(), df_h.std())))
MeanDevu = pd.Series(index = df_u.columns, data = list(zip(df_u.mean(), df_u.std())))

In [27]:
dict_nb_model = {
't' : MeanDevt,
'h' : MeanDevh,
'u' : MeanDevu
}

In [28]:
# Show the model for target type 'u'
dict_nb_model['u']

Rooms            (2.0435897435897434, 0.5961723978138356)
Price             (641847.8205128205, 249107.83420195838)
Distance          (8.584102564102558, 5.1403593434394175)
Postcode            (3119.7179487179487, 74.959543061831)
Bathroom        (1.2025641025641025, 0.42115472276533567)
Car              (1.1615384615384616, 0.5709845304318577)
Landsize          (358.6871794871795, 1119.5762141636783)
BuildingArea       (83.86896551724139, 43.13879135095902)
YearBuilt         (1976.903474903475, 23.807088618695047)
dtype: object

## Q2.5 Write a function that calculates the probability of a Gaussian
Given the mean ($\mu$), standard deviation ($\sigma$), and a observed point, `x`, return the probability.  
Use the formula $p(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$ ([wiki](https://en.wikipedia.org/wiki/Normal_distribution)).  You should use [numpy's exp](https://numpy.org/doc/stable/reference/generated/numpy.exp.html) function in your solution. 

In [29]:
def get_p( mu, sigma, x):
    denom = sigma * np.sqrt(2*np.pi)
    expVal = (-1/2)*np.square(((x-mu)/sigma))
    result = (1/denom)*np.exp(expVal)
    return result

## Q2.6 Write the Naive Bayes classifier function
The Naive Bayes classifier function, `nb_class`, should take as a parameter the prior probability dictionary. `dict_priors`, the dictionary containing all of the gaussian distribution information for each attribute, `dict_nb_model`, and a single observation row (a series generated from iterrows) of the test dataframe. It should return a single target classification. For this problem, all of our attributes are represented as Gaussians, so we don't worry about categorical data. Make sure to skip attributes that do not have a value in the observation. 

In [30]:
def nb_class(dict_priors, dict_nb_model, observation):
    dictProbs = dict()
    for types, typeColValues in dict_nb_model.items():
        for i in range(len(typeColValues)):
            mean, stdev = typeColValues[i]
            dictProbs[types] = dict_priors[types] * get_p(mean, stdev, observation[i])
    value = max(dictProbs, key=dictProbs.get)
    return value

## Q2.7 Calculate the accuracy using Naive Bayes classifier function on the test set
Load the test set from file, classify each row using your `nb_class`, and then show the accuracy. 

In [31]:
df_test = pd.read_csv('melb_data_test.csv')
result = df_test['Type']
df_test = df_test.drop('Type', axis=1) 
df_test = df_test.drop('Date', axis=1) 

In [32]:
correct = 0
calc = 0;
for index, row in df_test.iterrows():
    if (row.isnull().values.any() == False) :
        pred = nb_class(dict_priors, dict_nb_model, row)
        calc += 1
        if (pred == result[index]):
            correct += 1

In [34]:
# Show the accuracy 
acc = correct/calc
print(acc)

0.4230769230769231
