# Machine Learning Homework 2

### Submission By:  
<ul>
    <li>Akshay Prakash Tambe (apt321@nyu.edu)</li>
    <li>Snahil Singh (ss11381@nyu.edu)</li>
</ul>

Please run below command if you need any packages to be installed while running this notebook:
```
import sys
!{sys.executable} -m pip install <package-name>
```

# Part II: Programming Exercise

### Question 1 (a):   
##### Implement a version of k-Nearest Neighbor to classify the test examples, using the training examples.

In [1]:
import pandas as pd
import numpy as np

def load_data():
    # Loading Comma Seperated Data using read_table pandas function in 'reviews_train_data' dataframe
    reviews_train_data = pd.read_csv("reviewstrain.txt", delimiter="\t", header=None)
    reviews_train_data = pd.DataFrame(reviews_train_data)

    # Splitting Dataset into "labels" and "reviews_text"
    reviews_train_data = pd.DataFrame(reviews_train_data[0].str.split(' ',1).tolist(), columns = ['label','reviews_text'])

    # Loading Comma Seperated Data using read_table pandas function in 'reviews_test_data' dataframe
    reviews_test_data = pd.read_csv("reviewstest.txt", delimiter="\t", header=None)
    reviews_test_data = pd.DataFrame(reviews_test_data)

    # Splitting Dataset into "labels" and "reviews_text"
    reviews_test_data = pd.DataFrame(reviews_test_data[0].str.split(' ',1).tolist(), columns = ['label','reviews_text'])
    return reviews_train_data, reviews_test_data

In [2]:
reviews_train_data, reviews_test_data = load_data()
print(reviews_train_data)

     label                                       reviews_text
0        1  a stirring , funny and finally transporting re...
1        1  a real winner -- smart , funny , subtle , and ...
2        0  a dim-witted and lazy spin-off of the animal p...
3        1  ` anyone with a passion for cinema , and indee...
4        0  a crass and insulting homage to great films li...
5        0                              an infuriating film .
6        0  stealing harvard will dip into your wallet , s...
7        1  a poignant lyricism runs through balzac and th...
8        0  it 's always disappointing when a documentary ...
9        1  there are scenes of cinematic perfection that ...
10       0  what ensues are much blood-splattering , mass ...
11       1                             an impressive hybrid .
12       0  the script 's judgment and sense of weight is ...
13       0  it ends up being neither , and fails at both e...
14       1                    it is definitely worth seeing .
15      

In [3]:
# Checking if dataset is balanced for training --> Balanced Dataset
print(reviews_train_data['label'].value_counts())

1    811
0    689
Name: label, dtype: int64


In [4]:
print(reviews_test_data)

    label                                       reviews_text
0       1  funny , sexy , devastating and incurably roman...
1       1       cool gadgets and creatures keep this fresh .
2       1  fathers and sons , and the uneasy bonds betwee...
3       0  after that it becomes long and tedious like a ...
4       0  what should have been a cutting hollywood sati...
5       0  nothing about the film -- with the possible ex...
6       1  the overall result is an intelligent , realist...
7       1                 `` red dragon '' is entertaining .
8       0  despite all the closed-door hanky-panky , the ...
9       1  a genuinely funny ensemble comedy that also as...
10      1  the film is moody , oozing , chilling and hear...
11      0  it appears to have been made by people to whom...
12      1  as the movie traces mr. brown 's athletic expl...
13      1  the art direction is often exquisite , and the...
14      1  by turns touching , raucously amusing , uncomf...
15      1  this chicago 

In [5]:
# Distance Function --> inter(c1, c2) is the set of distinct tokens appearing in both x and y.
import numpy as np
def inter(x, y):
    common_tokens = set(y).intersection(x)
    if len(common_tokens) == 0:
        distance = np.nan
    else:
        distance = 1/len(common_tokens)
    return distance

In [6]:
# Get K Nearest Neighbours Function
def get_k_neighbours(k, dataframe):
    # Sort the training examples by distance from the test example, smallest to largest
    dataframe = dataframe.sort_values(by=['distance'])
    
    # If there are other neighbors at the same distance as the 5th one on this list, include them also
    top_k = dataframe.head(k)
    top_kth_distance = dataframe[(k-1):].distance.values[0]
    dataframe = dataframe.drop(dataframe.index[0:k])
    dataframe = dataframe[dataframe['distance']==top_kth_distance]
    top_k = top_k.append(dataframe)
    
    return top_k

In [7]:
# Get Prediction Labels for given test dataset
def get_predictions(k,reviews_test_data, reviews_train_data):
    #i = 0 #Counter for debugging
    predictions = []
    # Reading each test row
    for test_row in reviews_test_data.itertuples():
        test_row_tokens = test_row.reviews_text.split()
        distance = []
        # Comparing each test row with each training row
        for train_row in reviews_train_data.itertuples():
            train_text = train_row.reviews_text
            train_row_tokens = train_row.reviews_text.split()
            # Calculating distance of test row from each training row
            distance.append(inter(train_row_tokens, test_row_tokens))
        # Dataframe with distances stored for each test row
        string_match={
            'k_nearest_train_text': reviews_train_data.reviews_text,
            'distance': distance,
            'label': reviews_train_data.label
        }
        string_match = pd.DataFrame(string_match)
        # Get K Nearest Neighbours for each test row
        string_match = get_k_neighbours(k,string_match)
        # Prediction Label for each test row
        try:
            predicted_num0 = string_match['label'].value_counts()['0']
        except KeyError:
            predicted_num0 = 0
        try:
            predicted_num1 = string_match['label'].value_counts()['1']
        except KeyError:
            predicted_num1 = 0
        if predicted_num0 > predicted_num1:
            predictions.append(0)
        else:
            predictions.append(1)
        # Deleting Dataframe to optimize memory
        del string_match
        #i = i + 1 #Counter for debugging
    
    # Final Prediction Dataframe
    predicted_test_data = {
        'testdata_reviews_text': reviews_test_data.reviews_text,
        'actual_label': reviews_test_data.label,
        'predicted_label': predictions
    }
    predicted_test_data = pd.DataFrame(predicted_test_data)

    return predicted_test_data

In [8]:
# Get Accuracy, TP/TN/FP/FN of the Model
def get_model_metrics(predicted_test_data):
    # Initializing variables
    correct_pred = 0
    total_pred = 0
    true_positive = 0
    true_negative = 0
    false_positive = 0
    false_negative = 0
    
    for row in predicted_test_data.itertuples():
        total_pred = total_pred + 1
        # Correct Predictions
        if int(row.actual_label) == int(row.predicted_label):
            correct_pred = correct_pred + 1
            # True Positive
            if (int(row.actual_label) == 1 and int(row.predicted_label) == 1):
                true_positive = true_positive + 1
            # True Negative
            elif  (int(row.actual_label) == 0 and int(row.predicted_label) == 0):   
                    true_negative = true_negative + 1
        # False Positive
        elif  (int(row.actual_label) == 0 and int(row.predicted_label) == 1):   
                false_positive = false_positive + 1
        # False Negative
        elif  (int(row.actual_label) == 1 and int(row.predicted_label) == 0):   
                false_negative = false_negative + 1
    # Accuracy
    accuracy = (correct_pred/total_pred)*100
    tpr = true_positive/(true_positive + false_negative)
    fpr = false_positive/(false_positive + true_negative)
    
    # Model Metrics Dataframe
    model_metrics = {
        'total_predictions': total_pred, 
        'correct_predictions': correct_pred, 
        'accuracy': accuracy, 
        'true_positive': true_positive, 
        'false_positive': false_positive, 
        'true_negative': true_negative, 
        'false_negative': false_negative,
        'tpr': tpr,
        'fpr': fpr
    }
    
    return model_metrics

## Running KNN Classifier for K=1 for given training and testing dataset

In [9]:
k = 1
predicted_test_data = get_predictions(k, reviews_test_data, reviews_train_data)
model_metrics = get_model_metrics(predicted_test_data)

### Question i) : 
For k = 1, what is the predicted label for the following example in the test set: "It leaves little doubt that Kidman has become one of our best actors ." (This is line 18 of the test file.)

In [10]:
predicted_label = predicted_test_data['predicted_label'][predicted_test_data['testdata_reviews_text'] == \
        "it leaves little doubt that kidman has become one of our best actors ."].values[0]
print("Answer 1.a.i) Predicted label for the following example in the test set is: ", predicted_label)

Answer 1.a.i) Predicted label for the following example in the test set is:  1


### Question ii)
What is the confusion matrix (on the test set) for k = 1?

In [11]:
print("Answer 1.a.ii)")
print("True Positive = " + str(model_metrics['true_positive']) \
      + "\tFalse Negative = " + str(model_metrics['false_negative']))
print("False Positive = " + str(model_metrics['false_positive']) \
      + "\tTrue Negative = " + str(model_metrics['true_negative']))

Answer 1.a.ii)
True Positive = 209	False Negative = 64
False Positive = 134	True Negative = 93


### Question iii)
Report the accuracy, the true positive rate, and the false positive rate, on the test set for k = 1.

In [12]:
print("Answer 1.a.iii)")
accuracy = model_metrics['accuracy']
tpr = model_metrics['tpr']
fpr = model_metrics['fpr']
print("Accuracy = ", accuracy)
print("True Positive Rate = ", tpr)
print("False Positive Rate = ", fpr)

Answer 1.a.iii)
Accuracy =  60.4
True Positive Rate =  0.7655677655677655
False Positive Rate =  0.5903083700440529


## Running KNN Classifier for K=5 for given training and testing dataset

In [13]:
k = 5
predicted_test_data = get_predictions(k, reviews_test_data, reviews_train_data)
model_metrics = get_model_metrics(predicted_test_data)

### Question iv)
For k = 5, what is the predicted label for the following example in the test set: "It leaves little doubt that Kidman has become one of our best actors ."(This is line 18 of the test file.)

In [14]:
predicted_label = predicted_test_data['predicted_label'][predicted_test_data['testdata_reviews_text'] == \
        "it leaves little doubt that kidman has become one of our best actors ."].values[0]
print("Answer 1.a.iv) Predicted label for the following example in the test set is: ", predicted_label)

Answer 1.a.iv) Predicted label for the following example in the test set is:  1


### Question v)
What is the confusion matrix (on the test set) for k = 5?

In [15]:
print("Answer 1.a.v)")
print("True Positive = " + str(model_metrics['true_positive']) \
      + "\tFalse Negative = " + str(model_metrics['false_negative']))
print("False Positive = " + str(model_metrics['false_positive']) \
      + "\tTrue Negative = " + str(model_metrics['true_negative']))

Answer 1.a.v)
True Positive = 212	False Negative = 61
False Positive = 136	True Negative = 91


### Question vi)
Report the accuracy, the true positive rate, and the false positive rate, on the test set for k = 5.

In [16]:
print("Answer 1.a.vi)")
accuracy = model_metrics['accuracy']
tpr = model_metrics['true_positive']/(model_metrics['true_positive'] + model_metrics['false_negative'])
fpr = model_metrics['false_positive']/(model_metrics['false_positive'] + model_metrics['true_negative'])
print("Accuracy = ", accuracy)
print("True Positive Rate = ", tpr)
print("False Positive Rate = ", fpr)

Answer 1.a.vi)
Accuracy =  60.6
True Positive Rate =  0.7765567765567766
False Positive Rate =  0.5991189427312775


### Question vii)
What is the accuracy on the test set for k = 5?

In [17]:
print("Accuracy = ", accuracy)

Accuracy =  60.6


### Question viii)
Suppose we used the very simple Zero-R classifier on this dataset, rather than k-NN. That is, we classify all examples in the test set as belonging to the class that is more common in the training set. What is the resulting confusion matrix (on the test set)?

## Zero-R Classifier

In [18]:
# Zero-R Classifier
def zero_r_classifier(reviews_test_data, reviews_train_data):
    label_1 = reviews_train_data['label'].value_counts()[0]
    label_0 = reviews_train_data['label'].value_counts()[1]
    
    # Majority in training set is Zero-R Label
    if label_1 > label_0:
        zero_r_label = 1
    else: 
        zero_r_label = 0
    
    # Adding Zero-R Predicted Label in Test Dataset
    predicted_test_data = {
        'testdata_reviews_text': reviews_test_data.reviews_text,
        'actual_label': reviews_test_data.label,
        'predicted_label': zero_r_label
    }
    predicted_test_data = pd.DataFrame(predicted_test_data)
    return predicted_test_data

In [19]:
# Running Zero R Classifier on given datasets
predicted_test_data = zero_r_classifier(reviews_test_data, reviews_train_data)
model_metrics = get_model_metrics(predicted_test_data)
print("Answer 1.a.viii)")
print("True Positive = " + str(model_metrics['true_positive']) \
      + "\tFalse Negative = " + str(model_metrics['false_negative']))
print("False Positive = " + str(model_metrics['false_positive']) \
      + "\tTrue Negative = " + str(model_metrics['true_negative']))

Answer 1.a.viii)
True Positive = 273	False Negative = 0
False Positive = 227	True Negative = 0


## Question 1 (c):
Implement 5-fold cross-validation on the training set to determine which of the following values of k works better in k-NN: 3, 7, 99. (When there are more than k possible nearest 7 neighbors, because of multiple points at the same distance from the test set, handle this analogously to how you handled it in part 1.)

In [20]:
k_list = [3, 7, 99]
k_df = {}
for k in k_list:
    prediction = pd.DataFrame(columns=['testdata_reviews_text','actual_label','predicted_label'])
    for i in range(5):
        #print('Cross Validation Fold Spilt: ', i)
        reviews_test_data1 = reviews_train_data[(i*300):(i*300)+300]
        reviews_train_data1 = reviews_train_data.drop(reviews_train_data.index[(i*300):(i*300)+300])
        predicted_test_data = get_predictions(k,reviews_test_data1,reviews_train_data1)
        prediction = prediction.append(predicted_test_data)
    correct_pred = 0
    total_pred = 0
    for row in prediction.itertuples():
        total_pred = total_pred + 1
        if int(row.actual_label) == int(row.predicted_label):
            correct_pred =correct_pred + 1
    print("For K = ", k)
    #print('Correction Predictions = ', correct_pred)
    #print('Total Predictions = ', total_pred)
    accuracy = (correct_pred/total_pred)*100
    print('Accuracy = ', accuracy)
    print("----------------------------------------------------------------------")
    k_df.update({k : accuracy})

For K =  3
Accuracy =  66.06666666666666
----------------------------------------------------------------------
For K =  7
Accuracy =  65.86666666666666
----------------------------------------------------------------------
For K =  99
Accuracy =  61.199999999999996
----------------------------------------------------------------------


### Question i)
For each of the 3 values of k, what is the cross-validation accuracy?

In [21]:
print("Answer 1.b.i)")
print("Cross Validation Accuracies")
print(k_df)

Answer 1.b.i)
Cross Validation Accuracies
{3: 66.06666666666666, 7: 65.86666666666666, 99: 61.199999999999996}


### Question ii)
Take the k that had the highest cross-validation accuracy. Run k-NN on the entire training set for this value of k, and then test on the test set. Give the confusion matrix and the accuracy (for the test set).

In [22]:
# Taking Highest K
max_key = max(k_df, key=lambda k: k_df[k])
print("Highest K = ", max_key)

Highest K =  3


In [23]:
# Run k-NN on the entire training set for this value of highest k, and then test on the test set
k = max_key
predicted_test_data = get_predictions(k, reviews_test_data, reviews_train_data)
model_metrics = get_model_metrics(predicted_test_data)

In [24]:
print("Answer 1.b.ii)")

# Confusion Matrix
print("True Positive = " + str(model_metrics['true_positive']) \
      + "\tFalse Negative = " + str(model_metrics['false_negative']))
print("False Positive = " + str(model_metrics['false_positive']) \
      + "\tTrue Negative = " + str(model_metrics['true_negative']))

# Accuracy
accuracy = model_metrics['accuracy']
print("Accuracy = ", accuracy)

Answer 1.b.ii)
True Positive = 212	False Negative = 61
False Positive = 144	True Negative = 83
Accuracy =  59.0


## Question 1 (d):
Experiment with using a different distance function

## 1. Refined Intersection Distance

In [25]:
from nltk.corpus import stopwords
from nltk.stem.porter import *
def refine_tokens_inter_distance(dataframe):
    # Remove special characters, numbers, punctuations
    dataframe['reviews_text'] = dataframe['reviews_text'].str.replace("[^a-zA-Z#]", " ")
    
    # Removing Short Words
    dataframe['reviews_text'] = dataframe['reviews_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
    
    # Stemming
    stemmer = PorterStemmer()
    dataframe = dataframe.apply(lambda x: [stemmer.stem(i) for i in x]) 
    
    # Removing of Stop Words
    stop = stopwords.words('english')
    dataframe['reviews_text'] = dataframe['reviews_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    
    return dataframe

## 2. Jaccard Similarity Distance

In [26]:
def jaccard_similarity_distance(x, y):
    z = x.intersection(y)
    return 1 - float(len(z)) / (len(x) + len(y) - len(z))

In [27]:
# Get Prediction Labels for given test dataset using Jaccard Similarity Distance
def get_predictions_jaccard(k,reviews_test_data, reviews_train_data):
    #i = 0 #Counter for debugging
    predictions = []
    # Reading each test row
    for test_row in reviews_test_data.itertuples():
        test_row_tokens = test_row.reviews_text.split()
        distance = []
        # Comparing each test row with each training row
        for train_row in reviews_train_data.itertuples():
            train_text = train_row.reviews_text
            train_row_tokens = train_row.reviews_text.split()
            # Calculating distance of test row from each training row using jaccard
            distance.append(jaccard_similarity_distance(set(train_row_tokens), set(test_row_tokens)))
        # Dataframe with distances stored for each test row
        string_match={
            'k_nearest_train_text': reviews_train_data.reviews_text,
            'distance': distance,
            'label': reviews_train_data.label
        }
        string_match = pd.DataFrame(string_match)
        # Get K Nearest Neighbours for each test row
        string_match = get_k_neighbours(k,string_match)
        # Prediction Label for each test row
        try:
            predicted_num0 = string_match['label'].value_counts()['0']
        except KeyError:
            predicted_num0 = 0
        try:
            predicted_num1 = string_match['label'].value_counts()['1']
        except KeyError:
            predicted_num1 = 0
        if predicted_num0 > predicted_num1:
            predictions.append(0)
        else:
            predictions.append(1)
        # Deleting Dataframe to optimize memory
        del string_match
        #i = i + 1 #Counter for debugging
    
    # Final Prediction Dataframe
    predicted_test_data = {
        'testdata_reviews_text': reviews_test_data.reviews_text,
        'actual_label': reviews_test_data.label,
        'predicted_label': predictions
    }
    predicted_test_data = pd.DataFrame(predicted_test_data)

    return predicted_test_data

In [28]:
def get_predictions_choice(function, k, reviews_test_data, reviews_train_data):
    if function == '2':
        print("Using Refined Intersection Distance...")
        refined_reviews_test_data = refine_tokens_inter_distance(reviews_test_data)
        refined_reviews_train_data = refine_tokens_inter_distance(reviews_train_data)
        predicted_test_data = get_predictions(k, refined_reviews_test_data, refined_reviews_train_data)
    elif function == '3':
        print("Using Jaccard Distance...")
        predicted_test_data = get_predictions_jaccard(k, reviews_test_data, reviews_train_data)
    else:
        print("Using Default Distance...")
        predicted_test_data = get_predictions(k, reviews_test_data, reviews_train_data)
    return predicted_test_data

In [39]:
# Take User Input
import sys
if sys.version_info[0] >= 3:
    raw_input = input
user_input = raw_input("1: Default Distance\n2: Refined Intersection Distance\n3: Jaccard Distance\n")
user_input

1: Default Distance
2: Refined Intersection Distance
3: Jaccard Distance
2


'2'

In [40]:
import pandas as pd
import numpy as np

reviews_train_data, reviews_test_data = load_data()

k = 1
predicted_test_data = get_predictions_choice(user_input, k, reviews_test_data, reviews_train_data)
model_metrics = get_model_metrics(predicted_test_data)
model_metrics

Using Refined Intersection Distance...


{'accuracy': 72.8,
 'correct_predictions': 364,
 'false_negative': 44,
 'false_positive': 92,
 'fpr': 0.4052863436123348,
 'total_predictions': 500,
 'tpr': 0.8388278388278388,
 'true_negative': 135,
 'true_positive': 229}

## Running for all Distances for Comparison

In [41]:
input_list = ['1', '2', '3']
k_list = [1, 5]
for distance_function in input_list:
    for k in k_list:
        print("For User Input = "+str(distance_function))
        print("For K = "+str(k))
        reviews_train_data, reviews_test_data = load_data()
        predicted_test_data = get_predictions_choice(distance_function, k, reviews_test_data, reviews_train_data)
        model_metrics = get_model_metrics(predicted_test_data)
        print(model_metrics)
        print("------------------------------------------------------------------------------------------------------")

For User Input = 1
For K = 1
Using Default Distance...
{'total_predictions': 500, 'correct_predictions': 302, 'accuracy': 60.4, 'true_positive': 209, 'false_positive': 134, 'true_negative': 93, 'false_negative': 64, 'tpr': 0.7655677655677655, 'fpr': 0.5903083700440529}
------------------------------------------------------------------------------------------------------
For User Input = 1
For K = 5
Using Default Distance...
{'total_predictions': 500, 'correct_predictions': 303, 'accuracy': 60.6, 'true_positive': 212, 'false_positive': 136, 'true_negative': 91, 'false_negative': 61, 'tpr': 0.7765567765567766, 'fpr': 0.5991189427312775}
------------------------------------------------------------------------------------------------------
For User Input = 2
For K = 1
Using Refined Intersection Distance...
{'total_predictions': 500, 'correct_predictions': 364, 'accuracy': 72.8, 'true_positive': 229, 'false_positive': 92, 'true_negative': 135, 'false_negative': 44, 'tpr': 0.8388278388278388

### Question i):
Describe your distance function. How is the distance between two comments computed? Include an example in your explanation.  

We have used two distance functions where I have achieved higher accuracy in the model than our default distance function.  
### 1. Refined Intersection Distance  
This distance function manipulates the string text by removing redundantant texts whuch can be considered for comparison and then calculates the distance using our same formula. For this method, we will get different distance scores for some of the text which refines our accuracy.
The following function performs below manipulation before taking distance:
<ul>
    <li>**Remove special characters, numbers, punctuations:** We can also think of getting rid of the punctuations, numbers and even special characters since they wouldn’t help in differentiating different kinds of reviews. Hence, it is better to remove them from the text.</li>
    <li>**Remove Short Words:** Most of the smaller words do not add much value. For example, 'pdx', 'his', 'all', 'the', 'her'. So, we will try to remove them as well from our data. We were little careful here in selecting the length of the words which we want to remove. So, we decided to remove all the words having length 3 or less. For example, terms like “hmm”, “oh” are of very little use. It is better to get rid of them.</li>
    <li>**Stemming:**: Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. For example, For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”.</li>
    <li>**Removing of Stop Words:** Removing stop words like 'a', 'the' will help to calculate similarity distances more effectively.</li>   
</ul>

### 2. Jaccard Similarity Distance
The Jaccard similarity index (sometimes called the Jaccard similarity coefficient) compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations.

The formula to find the Index is:

```
Jaccard Index = (the number in both sets) / (the number in either set) * 100
```

A simple example using set notation: How similar are these two sets?
```
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = |A∩B| / |A∪B| = |{0,2,5}| / |{0,1,2,3,4,5,6,7,9}| = 3/9 = 0.33.
```

We have used Jaccard distance which is a measure of how dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard Index from 100%.
```
For the above example, the Jaccard distance is 1 – 33.33% = 66.67%.
```

In set notation, subtract from 1 for the Jaccard Distance:
```
D(X,Y) = 1 – J(X,Y)
```




### Question ii)
Why did you think that your distance function would do better than the first one?

In the first distance function, we consider complete set of tokens from a text and then calculate distance. In this way, we also consider reduntant tokens which are not neccessary for comparison of text for sentimental analysis.  

In my distance calculation methods, especially "Refined Intersection Distance", we have removed the redundant texts which gives us only relevant tokens for comparing and calculating distance.

### Question iii)
What is the confusion matrix for k = 1?  
(Calculations done under **"Running for all Distances for Comparison"** Section)  
**Using Refined Intersection Distance:**    
True Positive = 229     False Negative = 44  
False Positive = 92	    True Negative = 135  

**Using Jaccard Distance:**  
True Positive = 206     False Negative = 67  
False Positive = 112	True Negative = 115  

### Question iv)
Report the accuracy, the true positive rate, and the false positive rate, on the test set for k = 1?  
(Calculations done under **"Running for all Distances for Comparison"** Section)

**Using Refined Intersection Distance:**
Accuracy = 72.8  
True Positive Rate = 0.8388278388278388  
False Positive Rate = 0.4052863436123348  

**Using Jaccard Distance:**
Accuracy = 64.2  
True Positive Rate = 0.7545787545787546   
False Positive Rate =  0.4933920704845815

### Question v)
What is the confusion matrix for k = 5?

(Calculations done under **"Running for all Distances for Comparison"** Section)  
**Using Refined Intersection Distance:**    
True Positive = 236     False Negative = 37  
False Positive = 93     True Negative = 134  

**Using Jaccard Distance:**  
True Positive = 220     False Negative = 53  
False Positive = 113	True Negative = 114  

### Question vi)
Report the accuracy, the true positive rate, and the false positive rate, on the test set for k = 5?  
(Calculations done under **"Running for all Distances for Comparison"** Section)

**Using Refined Intersection Distance:**
Accuracy = 74.0  
True Positive Rate = 0.8644688644688645  
False Positive Rate = 0.40969162995594716 

**Using Jaccard Distance:**
Accuracy = 66.8   
True Positive Rate = 0.8058608058608059   
False Positive Rate =  0.4977973568281938

## Question vii)
Did your distance function achieve higher accuracy (for k = 1 and k = 5) than the first distance function?  
For the Comparison shown in **"Running for all Distances for Comparison"** Section, both of our distance functions achieve higher accuracy than first distance function:  

**Using Default Distance:**
<ul>
    <li>For K = 1, Accuracy = 60.4</li> 
    <li>For K = 5, Accuracy = 60.6</li>
</ul>

**Using Refined Intersection Distance:**
<ul>
    <li>For K = 1, Accuracy = 72.8</li> 
    <li>For K = 5, Accuracy = 74.0</li>
</ul>

**Using Jaccard Distance:**
<ul>
    <li>For K = 1, Accuracy = 64.2</li> 
    <li>For K = 5, Accuracy = 66.8</li>
</ul>