# Worksheet 12

Name:  Esther Choi
UID: U24585295

### Topics

- Introduction to Classification
- K Nearest Neighbors

### Introduction to Classification

a) For the following examples, say whether they are or aren't an example of classification.

1. Predicting whether a student will be offered a job after graduating given their GPA.
2. Predicting how long it will take (in number of months) for a student to be offered a job after graduating, given their GPA.
3. Predicting the number of stars (1-5) a person will assign in their yelp review given the description they wrote in the review.
4. Predicting the number of births occuring in a specified minute.

1. Classification
2. No
3. Classification
4. No

b) Given a dataset, how would you set things up such that you can both learn a model and get an idea of how this model might perform on data it has never seen?

Utilize a training and testing set. Use the training set to train your model by choosing a combination of features that best generalizes the trend of the dataset without underfitting or overfitting.
Then test it on the testing set. The model should generalize both sets well without being too specific to the training set or being too simple that it generalizes any trend. 

c) In your own words, briefly explain:

- underfitting
- overfitting 

and what signs to look out for for each.

- underfitting: over-generalization of data because the model is too simple. 
- overfitting: learning something that is too specific to data that we collected, there could be bias. 

For example, once the accuracy report for the test and training sets start to have an inversely proportional relationship- where the training set gets more accurate while the test set gets less accurate- then you know that the model is becoming too specific. 

### K Nearest Neighbors

In [None]:
import numpy as np
import matplotlib.pyplot as plt

data = {
    "Attribute A" : [3.5, 0, 1, 2.5, 2, 1.5, 2, 3.5, 1, 3, 2, 2, 2.5, 0.5, 0., 10],
    "Attribute B" : [4, 1.5, 2, 1, 3.5, 2.5, 1, 0, 3, 1.5, 4, 2, 2.5, 0.5, 2.5, 10],
    "Class" : [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0],
}

a) Plot the data in a 2D plot coloring each scatter point one of two colors depending on its corresponding class.

In [None]:
colors = np.array([x for x in 'bgrcmyk'])
x_coordinates, y_coordinates = data["Attribute A"], data["Attribute B"]
color_coordinates = colors[data["Class"]].tolist()
plt.scatter(x_coordinates, y_coordinates, color=color_coordinates)
plt.show()

Outliers are points that lie far from the rest of the data. They are not necessarily invalid points however. Imagine sampling from a Normal Distribution with mean 10 and variance 1. You would expect most points you sample to be in the range [7, 13] but it's entirely possible to see 20 which, on average, should be very far from the rest of the points in the sample (unless we're VERY (un)lucky). These outliers can inhibit our ability to learn general patterns in the data since they are not representative of likely outcomes. They can still be useful in of themselves and can be analyzed in great depth depending on the problem at hand.

b) Are there any points in the dataset that could be outliers? If so, please remove them from the dataset.

Yes, there is one point with coordinate (10,10). Here's the removed version below:

In [None]:
x_no_outlier = (data["Attribute A"])[:-1]
y_no_outlier = (data["Attribute B"])[:-1]
color_no_outlier = (data["Class"])[:-1]
plt.scatter(x_no_outlier, y_no_outlier, color=colors[color_no_outlier].tolist())
plt.show()

Noise points are points that could be considered invalid under the general trend in the data. These could be the result of actual errors in the data or randomness that we could attribute to oversimplification (for example if missing some information / feature about each point). Considering noise points in our model can often lead to overfitting.

c) Are there any points in the dataset that could be noise points?

Yes, the point at (0,3.5) could be considered a noise point because it doesn't follow the general linear trend.

For the following point

|  A  |  B  |
|-----|-----|
| 0.5 |  1  |

d) Plot it in a different color along with the rest of the points in the dataset.

In [None]:
x,y = 0.5, 1
specific_point_color = 'red'
plt.scatter(x_no_outlier, y_no_outlier, color=colors[color_no_outlier].tolist())
plt.scatter(x, y, color=specific_point_color, label='Specific Point')
plt.show()

e) Write a function to compute the Euclidean distance from it to all points in the dataset and pick the 3 closest points to it. In a scatter plot, draw a circle centered around the point with radius the distance of the farthest of the three points.

In [None]:
# gets the closest n points to example using data
def n_closest_to(example, n, data):
    dataset_x, dataset_y = data
    x,y = example
    distances = np.sqrt((dataset_x - x)**2 + (dataset_y - y)**2)
    closest_indices = np.argsort(distances)[:n]
    closest_distances = np.array([distances[i] for i in closest_indices])
    return closest_distances, closest_indices

# call the function n_closest_to
dataset_x, dataset_y = np.array(x_no_outlier), np.array(y_no_outlier)
closest_distances, closest_indices = n_closest_to((x,y), 3, (dataset_x, dataset_y))

# plot the circle and points in relation to its radius
location = ( x , y )
radius = max(closest_distances)
_, axes = plt.subplots()
axes.scatter(dataset_x, dataset_y, color='blue', label='Data Points')
axes.scatter(dataset_x[closest_indices], dataset_y[closest_indices], color='red', label='Closest Points')
cir = plt.Circle(location, radius, fill = False, alpha=0.8)
axes.add_patch(cir)
axes.set_aspect('equal') # necessary so that the circle is not oval
plt.show()

f) Write a function that takes the three points returned by your function in e) and returns the class that the majority of points have (break ties with a deterministic default class of your choosing). Print the class assigned to this new point by your function.

In [None]:
def majority(points):
    closest_classes = [data["Class"][i] for i in points]
    class_counts = {0: 0, 1: 0}
    for label in closest_classes:
        class_counts[label] += 1
    print(class_counts)
    majority_class = max(class_counts, key=class_counts.get)
    return majority_class

g) Re-using the functions from e) and f), you should be able to assign a class to any new point. In this exercise we will implement Leave-one-out cross validiation in order to evaluate the performance of our model.

For each point in the dataset:

- consider that point as your test set and the rest of the data as your training set
- classify that point using the training set
- keep track of whether you were correct with the use of a counter

Once you've iterated through the entire dataset, divide the counter by the number of points in the dataset to report an overall testing accuracy.

In [None]:
count = 0

for i in range(len(dataset_x)):
    actual_class = data["Class"][i]
    test_point = (dataset_x[i], dataset_y[i])
    
    training_x = x_no_outlier[:i] + x_no_outlier[i+1:]
    training_y = y_no_outlier[:i] + y_no_outlier[i+1:]
    
    training_set = (training_x, training_y)
    
    closest_distances, closest_indices = n_closest_to(test_point, 3, training_set)

    prediction = majority(closest_indices)
    if prediction == actual_class:
        count += 1


print("overall accuracy = ", count/len(dataset_x))