### Joshua Campos
## Lab Course Machine Learning
# Exercise 7

------------

We import all the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We define some general functions to split our data into train and test set, to calculate the accuracy of our classification model and to calculate the RMSE of our regression model. 

In [2]:
def split_train_test(x_data,y_data,train_percentage):
    row_selection = np.random.rand(len(x_data)) < train_percentage
    x_train = x_data[row_selection]
    y_train = pd.DataFrame(y_data[row_selection])
    x_test = x_data[~row_selection]
    y_test = pd.DataFrame(y_data[~row_selection])
    return (x_train,y_train,x_test,y_test)

def calculate_accuracy(y_test,y_predicted):
    correct_predictions = 0
    y_test_arr = np.array(y_test)
    for i in range(len(y_test)):
        if y_predicted[i] == y_test_arr[i]:
            correct_predictions += 1
    accuracy = correct_predictions/len(y_test)
    num_samples = len(y_test)
    wrong_predictions = num_samples-correct_predictions
    return (accuracy,correct_predictions,wrong_predictions,num_samples)

def calculate_rmse(y_test,y_predicted):
    y_test_arr = np.array(y_test)
    return np.sqrt(np.square(np.subtract(y_test_arr,y_predicted)).mean())

### 1. Implement K-Nearest Neighbors (KNN)

**KNN Functions:**

We define the functions we are going to use for our KNN algorithm. These functions are used to calculate the euclidian distance between data points, get the neighbors according to the given 'k', get the neighbors' index and class, get the majority voting for our classification model, and finally the KNN prediction function for classification and regression models.

In [3]:
def calculate_euclidian_distance(X_train,X_test_instance):
    train_samples = X_train.shape[0]
    total_distance = []
    for i in range(train_samples):
        distance = 0
        features = X_train.columns
        for feature in features:
            distance += np.sqrt(np.square(X_train[feature].iloc[i] - X_test_instance[feature]))
        total_distance.append(distance)
    return np.array(total_distance)

def get_neighbors(k,distances):
    dist_arr = np.copy(distances)
    k_neighbors = []
    for i in range(k):
        min_index = np.argmin(dist_arr)
        min_dist = np.min(dist_arr)
        k_neighbors.append(min_dist)
        dist_arr = np.delete(dist_arr,min_index)
    return k_neighbors

def get_neighbors_index(distances,neighbors):
    indices = []
    for value in neighbors:
        index = np.where(distances == value)
        indices.append(index[0][0])
    return indices

def get_neighbors_class(y_train,indices):
    classes = []
    y_classes = np.array(y_train)
    for index in indices:
        classes.append(y_classes[index])
    return classes

def get_majority_voting(classes):
    uniques = np.unique(classes)
    class_count = []
    for element in uniques:
        class_count.append(classes.count(element))
    max_class = np.max(class_count)
    max_index = np.where(class_count == max_class)[0][0]
    pred_class = uniques[max_index]
    return pred_class

def predict_knn_class(X_train,y_train,X_test,y_test,k):
    predictions = []
    for i in range(len(X_test)):
        test_sample = X_test.iloc[i,:]
        total_distance = calculate_euclidian_distance(X_train,test_sample)
        neighbors = get_neighbors(k,total_distance)
        indices = get_neighbors_index(total_distance,neighbors)
        classes = get_neighbors_class(y_train,indices)
        pred_class = get_majority_voting(classes)
        predictions.append(pred_class)
    return predictions

def predict_knn_reg(X_train,y_train,X_test,y_test,k):
    predictions = []
    for i in range(len(X_test)):
        test_sample = X_test.iloc[i,:]
        total_distance = calculate_euclidian_distance(X_train,test_sample)
        neighbors = get_neighbors(k,total_distance)
        indices = get_neighbors_index(total_distance,neighbors)
        classes = get_neighbors_class(y_train,indices)
        class_mean = round(np.mean(classes))
        predictions.append(class_mean)
    return predictions

**Cross Validation Functions:**

We define the functions for our cross validation, which we will be using to determine the optimal value of 'k'. We create a function to create folds/groups from our data to use as training and validations sets, and a function to peform the cross validation across the previously created folds given the 'k' values. 

In [4]:
def create_folds(X_train,y_train,num_folds): 
    folds = [] 
    data = np.hstack((X_train, y_train))
    np.random.shuffle(data) 
    fold_size = data.shape[0] // num_folds
    for i in range(num_folds): 
        fold = data[i * fold_size:(i + 1)*fold_size, :]
        X_fold = fold[:, :-1] 
        y_fold = fold[:, -1].reshape((-1, 1))
        folds.append((X_fold, y_fold))  
    return folds

def k_fold_cross_validation(folds,k_array,problem_type,column_names):
    parameters_df = pd.DataFrame(columns=['K Neighbors','Avg. Measure'])
    for k in k_array:
        fold_measures = []
        for i in range(len(folds)-1):
            x_test_f = pd.DataFrame(folds[i][0],columns=column_names[:-1])
            y_test_f = pd.DataFrame(folds[i][1],columns=[column_names[-1]])
            x_train_f = pd.DataFrame(columns=column_names[:-1])
            y_train_f = pd.DataFrame(columns=[column_names[-1]])
            
            for n in [x for x in range(len(folds)) if x != i]:
                x_train_f = x_train_f.append(pd.DataFrame(folds[n][0],columns=column_names[:-1]))
                y_train_f = y_train_f.append(pd.DataFrame(folds[n][1],columns=[column_names[-1]]))
            if problem_type == 'classification':
                predictions = predict_knn_class(x_train_f,y_train_f,x_test_f,y_test_f,k)
                y_test_arr = np.array(y_test_f)
                accuracy = calculate_accuracy(y_test_arr,predictions)[0]
                fold_measures.append(accuracy)
            elif problem_type == 'regression':
                predictions = predict_knn_reg(x_train_f,y_train_f,x_test_f,y_test_f,k)
                y_test_arr = np.array(y_test_f)
                rmse = calculate_rmse(y_test_arr,predictions)
                fold_measures.append(rmse)
            else:
                raise Exception('Invalid problem type: choose "classification" or "regression"')
        
        measure_mean = np.mean(fold_measures)
        parameters_df = parameters_df.append({'K Neighbors':k,'Avg. Measure':measure_mean},
                                             ignore_index=True)
    
    if problem_type == 'classification':
        optimal_row = parameters_df[parameters_df['Avg. Measure'] == np.max(
            parameters_df['Avg. Measure'])]
    elif problem_type == 'regression':
        optimal_row = parameters_df[parameters_df['Avg. Measure'] == np.min(
            parameters_df['Avg. Measure'])]
    
    optimal_k = optimal_row['K Neighbors'].values[0]
    best_measure = optimal_row['Avg. Measure'].values[0]
    return (parameters_df,optimal_k,best_measure)

### Iris Dataset

We read our iris data file and name the columns accordingly. We then print the head to understand how our data looks like. After this, we can separate our data into independent and dependent variables, which will then be separated into our training and testing sets. 

In [5]:
features = ['sepal length','sepal width','petal length','petal width','class']
iris_data = pd.read_csv('iris.data',names=features)
print(iris_data.head())

X = iris_data.drop('class',axis=1)
y = iris_data['class']

train_percentage = 0.7
X_train,y_train,X_test,y_test = split_train_test(X,y,train_percentage)

   sepal length  sepal width  petal length  petal width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


We run our KNN Classification algorithm to predict the class of our test data, using a 'k' of 5. After this, we print the accuracy of our model, as well as the correct predictions, wrong predictions and total number of samples predicted. We decided to use the accuracy as quality criterion for our model, because it gives us a general idea of how our model is performing. We want to know how many predictions were correct out of the whole test set. 

In [6]:
k = 5
predictions = predict_knn_class(X_train,y_train,X_test,y_test,k)
accuracy,correct_predictions,wrong_predictions,num_samples = calculate_accuracy(y_test,predictions)

print('KNN Classification: Iris Dataset\nK: {}\nAccuracy: {}\nCorrect Predictions: {}\
      \nIncorrect Predictions: {}\nTotal Samples: {}'.format(k,accuracy,correct_predictions,
      wrong_predictions,num_samples))

KNN Classification: Iris Dataset
K: 5
Accuracy: 0.9
Correct Predictions: 36      
Incorrect Predictions: 4
Total Samples: 40


#### Determine the Optimal Value for K in KNN

We split our train data into four folds, so we can perform the cross validation with different values for 'k', which were defined to be all the odd numbers from one to 15. After the cross validation, we print the average accuracy for each 'k' used, and also the optimal 'k' with its accuracy measure. 

In [7]:
folds = create_folds(X_train,y_train,4)
k_list = [1,3,5,7,9,11,13,15]
problem_type = 'classification'
column_names = iris_data.columns

parameters_df,optimal_k,best_measure = k_fold_cross_validation(folds,k_list,problem_type,column_names)

print(parameters_df)
print('\nOptimal K: {}\nBest Measure: {}'.format(optimal_k,best_measure))

   K Neighbors  Avg. Measure
0          1.0      0.925926
1          3.0      0.962963
2          5.0      0.987654
3          7.0      0.987654
4          9.0      0.975309
5         11.0      0.975309
6         13.0      0.962963
7         15.0      0.975309

Optimal K: 5.0
Best Measure: 0.9876543209876543


Now we run again the KNN Classification algorithm, but this time we use the optimal 'k' value given in the cross validation. We then print the model accuracy, correct predictions, wrong predictions and the total test samples. 

In [8]:
k = int(optimal_k)
predictions = predict_knn_class(X_train,y_train,X_test,y_test,k)
accuracy,correct_predictions,wrong_predictions,num_samples = calculate_accuracy(y_test,predictions)

print('KNN Classification: Iris Dataset\nK: {}\nAccuracy: {}\nCorrect Predictions: {}\
      \nIncorrect Predictions: {}\nTotal Samples: {}'.format(k,accuracy,correct_predictions,
      wrong_predictions,num_samples))

KNN Classification: Iris Dataset
K: 5
Accuracy: 0.9
Correct Predictions: 36      
Incorrect Predictions: 4
Total Samples: 40


**Comment:** An observation for this case is that the optimal 'k' value given here is not the definitive best value. The algorithm was run several times and it didn't always give the same optimal 'k'. This value depends a lot on the dataset and how it is divided for the train and test set. It is easier to predict the classes if the values are centered between data points of the same class, but if a datapoint is close to the border between another class, it might miss the prediction. We can also see that after doing the cross-validation, we best performing 'k' value was 5, as we chose previously, but this wasn't always the case when we ran the algorithm several times before. 

#### Wine Dataset

We read the wine data file and split it into our independent and dependent variables, which we then split into our training and testing set. 

In [9]:
wine_data = pd.read_csv('winequality-red.csv',sep=';')

X = wine_data.drop('quality',axis=1)
y = wine_data['quality']

train_percentage = 0.7
X_train,y_train,X_test,y_test = split_train_test(X,y,train_percentage)

We run our KNN Regression algorithm to predict the class of our test data, using a 'k' of 40, which we defined by getting the squared root of the size of the dataset. After this, we print the accuracy of our model, as well as the correct predictions, wrong predictions and total number of samples predicted. We decided to use the RMSE and the accuracy as quality criterion for our model, because the RMSE gives us an idea of the error rate that we had for our model and the accuracy gives us a general idea of how our model is performing, by calculating how many predictions were correct out of the whole test set. Usually, accuracy is not used to calculate quality in regression models, but because the class we are predicting are whole numbers and our model's predictions are also whole numbers, we can use this method as well to see if our model manages to predict correctly each class. 

In [10]:
k = int(round(np.sqrt(len(wine_data))))
predictions = predict_knn_reg(X_train,y_train,X_test,y_test,k)
accuracy,correct_predictions,wrong_predictions,num_samples = calculate_accuracy(y_test,predictions)
rmse = calculate_rmse(y_test,predictions)

print('KNN Regression: Wine Dataset\nK: {}\nRMSE: {}\nAccuracy: {}\nCorrect Predictions: {}\
      \nIncorrect Predictions: {}\nTotal Samples: {}'.format(k,rmse,accuracy,correct_predictions,
      wrong_predictions,num_samples))

KNN Regression: Wine Dataset
K: 40
RMSE: 0.9611399681493781
Accuracy: 0.5221238938053098
Correct Predictions: 236      
Incorrect Predictions: 216
Total Samples: 452


#### Determine the Optimal Value for K in KNN

We split our train data into four folds, so we can perform the cross validation with different values for 'k', which were defined to be every 10 numbers from 5 to 75. For this model we choose to use higher values for 'k', because the dataset is larger, so we want to see how the model performs with higher values. After the cross validation, we print the average accuracy for each 'k' used, and also the optimal 'k' with its accuracy measure. 

In [11]:
folds = create_folds(X_train,y_train,4)
k_list = [5,15,25,35,45,55,65,75]
problem_type = 'regression'
column_names = wine_data.columns

parameters_df,optimal_k,best_measure = k_fold_cross_validation(folds,k_list,problem_type,column_names)

print(parameters_df)
print('\nOptimal K: {}\nBest RMSE: {}'.format(optimal_k,best_measure))

   K Neighbors  Avg. Measure
0          5.0      0.996256
1         15.0      0.961934
2         25.0      0.949158
3         35.0      0.945225
4         45.0      0.942646
5         55.0      0.939697
6         65.0      0.934872
7         75.0      0.931723

Optimal K: 75.0
Best RMSE: 0.9317231087823933


Now we run again the KNN Regression algorithm, but this time we use the optimal 'k' value given in the cross validation. We then print the model accuracy, correct predictions, wrong predictions and the total test samples. 

In [12]:
k = int(optimal_k)
predictions = predict_knn_reg(X_train,y_train,X_test,y_test,k)
accuracy,correct_predictions,wrong_predictions,num_samples = calculate_accuracy(y_test,predictions)
rmse = calculate_rmse(y_test,predictions)

print('KNN Regression: Wine Dataset\nK: {}\nRMSE: {}\nAccuracy: {}\nCorrect Predictions: {}\
      \nIncorrect Predictions: {}\nTotal Samples: {}'.format(k,rmse,accuracy,correct_predictions,
      wrong_predictions,num_samples))

KNN Regression: Wine Dataset
K: 75
RMSE: 0.9514663409598269
Accuracy: 0.5088495575221239
Correct Predictions: 230      
Incorrect Predictions: 222
Total Samples: 452


**Comment:** We can see that something really interesting is happening in this case. When we used the optimal 'k' we got a lower RMSE, which is better for our problem, which is a KNN Regression, but we got a lower accuracy, which is worse, than when we used 40 as fixed 'k' value. If we think about this, we can understand that we missed more exact class predictions with the optimal 'k', but the error for each missed class was lower than with a fixed 'k' value, as we can tell from the lower RMSE. We can also see from the cross-validations, that the higher the 'k', the better the model performed, which might be because we get a more accurate average from its neighbors. In conclusion, we can say that our model performed better, because we are using RMSE as the main quality criteria for our regression. 