<a href="https://colab.research.google.com/github/aw3444/blank-app/blob/main/12_15_24_Group_Project_3_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import csv
import pandas as pd
import numpy as np

# Part 1 - Reading and Preprocessing Data
In this part, we define the load_data and split_data functions. They are in charge of loading the data from the csv files and splitting them into training and testing splits respectively.

Load Data Function:

In [None]:
#define load_data function
def load_data(filename):
    #load csv as dataframe, then convert to np array
    df = pd.read_csv(filename)
    data = df.to_numpy()

    #the data is a numpy array, but since the values were separated by semi-colons,
    #the nd array has only 1 column. In the next part reformat this array
    #so that it has 11 columns:


    #create an array that matches the dimensionality of our data
    formatted_data = np.zeros([len(data),11])
    #print(formatted_data.shape)
    i = 0

    #iterate through each row of the data, create a new array by splitting the ; array
    #insert the new array as a row in formatted_data
    for row in data:
        formatted_row = str(row[0]).split(';')
        formatted_row = np.array(formatted_row)
        formatted_data[i] = formatted_row[0:11]
        i += 1

    #print(formatted_data)


    #return the formatted array!
    return formatted_data

Split Data Function:

In [None]:
#define split_data function
def split_data(dataset, ratio):

    #get the length of the dataset
    length = len(dataset)

    #the length of the training set
    train_length = int(length * ratio)

    #isolate the training set
    train = dataset[0:train_length]

    #isolate the test set
    test = dataset[train_length:]

    #return the traiing and test sets as a tuple
    return train, test

Next, we use these functions to check if they work!

In [None]:
#first get the nd array of the data from the csv
redwine_data = load_data('redwine.csv')

#Then get a train/test tuple with the split data function
redwine_dataset = split_data(redwine_data,0.9)

#Check the data
redwine_data

FileNotFoundError: [Errno 2] No such file or directory: 'redwine.csv'

The data looks properly formatted! Next we load in the data for the white wine dataset:

In [None]:
whitewine_data = load_data('whitewine.csv')

whitewine_dataset = split_data(whitewine_data,0.9)

Then, we separate each dataset into its respective training and test sets, and print out the shapes to make sure that everything is working properly:

In [None]:
rw_train, rw_test = redwine_dataset[0], redwine_dataset[1]
ww_train, ww_test = whitewine_dataset[0], whitewine_dataset[1]

print(f"Red Wine Training Data Shape: {rw_train.shape}")
print(f"Red Wine Testing Data Shape: {rw_test.shape}")
print()

print(f"White Wine Training Data Shape: {ww_train.shape}")
print(f"White Wine Testing Data Shape: {ww_test.shape}")

The shapes look good! Time to move on to part 2!

# Part 2 - Nearest Centroid Classifier
In this part, we define all the necessary functions to perform Nearest Centroid Classification on the data. The required functions are the Compute Centroid Function, the Get Distance Function, and the Experiment function

Compute Centroid Function:

In [None]:
#Modeled off the lecture notes
def compute_centroid(samples):
    #sum the data from all of the samples and get the average point
    return sum(samples[:,:]) / samples.shape[0]

Get Distance Function:

In [None]:
#Again modeled off of the lecture notes
def get_distance(data,centroid):
    #calculate euclidean distance between a datapoint and the centroid
    return np.linalg.norm(data-centroid)

Experiment Function:

In [None]:
def experiment(ww_training, ww_test, rw_training, rw_test):

    #First determine the centroids for each class using the compute_centroid fucntion
    ww_centroid = compute_centroid(ww_training)
    rw_centroid = compute_centroid(rw_training)
    #It is important to note that the centroids are ONLY calculated using the training set



    #Then declare two variables, one keeps track of the number of correct predictions
    #The other keeps track of the total number of predictions
    correct = 0
    predictions = 0

    #Iterate through the rows in the test set for the white wine
    for row in ww_test:

        #Increment predictions
        predictions += 1

        #If the point is closer to the white wine centroid, then the classifier is correct!
        if get_distance(row,ww_centroid) < get_distance(row,rw_centroid):
            label = "white"
            correct += 1
        else:
            #if not then the classifier is wrong
            label = "red"

    #Do the same iterative process on the red wine data
    for row in rw_test:
        predictions += 1
        if get_distance(row,ww_centroid) < get_distance(row,rw_centroid):
            label = "white"
        else:
            label = "red"
            correct += 1

    #Determine accuracy by normalizing correct prediction count by total number of predictions
    accuracy = correct / predictions

    #Print out required stats
    print(f"Total Predictions: {predictions}\nTotal Correct: {correct}\nAccuracy: {accuracy}")

    #Return the accuracy
    return accuracy

With the model defined, we can run the experiment to see how well it does!

In [None]:
model_accuracy = experiment(ww_train,ww_test,rw_train,rw_test)

The nearest centroid model appears to do quite well, boasting almost a 91% accuracy rate. In the next step, we perform cross validation to ensure that this high accuracy rate is not just a fluke in the test set:

# Part 3 - Cross Validation
In this step, we perform cross validation. To do this, we define the Cross Validation function

In [None]:
def cross_validation(ww_data, rw_data,k):
    #Set the average accuracy intially to 0
    average_accuracy = 0

    #Define a step size that is a function of k
    step_size = int(len(ww_data)/k)
    #print(step_size)

    #Iterate k times through the dataset, using the classifying model each repetition
    for i in range(k):
        print(f"Experiment {i+1}:")

        #First split up the dataset - begin by getting the starting index of the test set
        test_start = i*step_size

        #If, for instance, i is the last iteration in the loop, we might cut off some
        #Data at the end, especially if the step size is not a whole number
        #The following if statement accounts for this
        #This means that the test set for the last loop will be slightly larger
        if i != k-1:
            test_end = test_start + step_size
        else:
            test_end = len(ww_data)-1

        #print(f"\nTest Start Index: {test_start}\nTest End Index: {test_end}\n")

        #Slice the dataset with the starting and ending indices of the test set
        ww_test = ww_data[test_start:test_end]
        rw_test = rw_data[test_start:test_end]

        #Then stack the remainder of the data to create the training set
        ww_train = np.vstack((ww_data[0:test_start],ww_data[test_end:]))
        rw_train = np.vstack((rw_data[0:test_start],rw_data[test_end:]))

        #With the data formatted, run the model
        model_accuracy = experiment(ww_train,ww_test,rw_train,rw_test)

        #Increment the average accuracy
        average_accuracy += model_accuracy

        #Print statement for formatting purposes
        print()

    #Finally, divide the average accuracy by k to make it a real average and not just a sum
    average_accuracy /= k

    #Return the average accuracy
    return average_accuracy

With the cross validation function complete, we can see exactly how well the model works!

In [None]:
#Using k = 20 to ensure robust results and sufficiently sized training data
avg_accuracy = cross_validation(whitewine_data, redwine_data,20)
print(f"Average Accuracy: {avg_accuracy}")

The average accuracy is right around 88%. This is an indicator that the model performs well!

Thanks for a great semester!

-Ming, Joyce, and Robert