# Artificial Intelligence
# 464/664
# Assignment #7

## General Directions for this Assignment

00. We're using a Jupyter Notebook environment (tutorial available here: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html),
01. Output format should be exactly as requested (it is your responsibility to make sure notebook looks as expected on Gradescope),
02. Check submission deadline on Gradescope, 
03. Rename the file to Last_First_assignment_7, 
04. Submit your notebook (as .ipynb, not PDF) using Gradescope, and
05. Do not submit any other files.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".

## Neural Networks

For this assignment we will explore Neural Networks; in particular, we are going to explore model complexity. We will use the same dataset from Assignment #6 to classify a mushroom as either edible ('e') or poisonous ('p'). You are free to use PyTorch, TensorFlow, scikit-learn -- to name a few resources. The goal is to explore different model complexities (architectures) before declaring a winner. Either start with a simple network and make it more complex; or start with a complex model and pare it down. Either way, your submission should clearly demonstrate your exploration. 


Your output for each model should look like the output of `cross_validate` from Assignment #6:

```
Fold: 0	Train Error: 15.38%	Validation Error: 0.00%
Fold: 1
...

Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 100.00%(0.00%) Test Error: 100.00%(0.00%)
```

Notice that "Test Error" has been replaced by "Validation Error." Split your dataset into train, test, and validation sets. 


Start with a simple network. Train using the train set. Observe model's performance using the validation set. 


Increase the complexity of your network. Train using the train set. Observe model's performance using the validation set. 


Model complexity in Assignment #6 was depth limit. You can think of it here as the architecture of the network (number of layers and units per layer). Try at least three different network architectures. 


We're trying to find a model complexity that generalizes well. (Recall high bias vs high variance discussion in class.) 


Pick the network architecture that you deem best. Use the test set to report your winning model's performance. This is the ONLY time you use the test set.


No other directions for this assignment, other than what's here and in the "General Directions" section. You have a lot of freedom with this assignment. Don't get carried away. Try at least three different models; more importantly, document your process. Graders are not going to run your notebooks. The notebook will be read as a report on how different models were explored: what the results were, how the winning model was determined, what was the winning model's performance on the test data. Clearly highlight these items to receive full credit. Since you'll be using libraries, the emphasis will be on your ability to communicate your findings.

In [1]:
# Implementation and exploration.

In [2]:
import random
import math
import copy
from copy import deepcopy
from typing import List, Dict, Tuple, Callable
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

Many of the functions from Assignment #6 will be borrowed to load the data, create the folds, and print the error rates.

<a id="create_folds"></a>
## create_folds


With n-fold cross validation, we divide our data set into n subgroups called "folds" and then use those folds for training and testing. For data set with 100 observations (or records), n set to 10 would have 10 observations in each fold.

* **data** List: a list (data_lecture, for instance)
* **n** int: number of folds


**returns** 
folds, which is a list of n items, where each item is a list containing a subgroup of xs

In [3]:
def create_folds(data: List, n: int) -> List[List[List]]:
    k, m = divmod(len(data), n)
    return list(data[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

<a id="create_train_test"></a>
## create_train_test


This function takes the n folds and returns the train and test sets. One of the n folds is used to test, the others are used for training.

* **folds** List[List[List]]: see `create_folds`
* **index** int: fold index that is used for testing


**returns** 
folds, which is a list of n items, where each item is a list containing a subgroup of xs

In [4]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

<a id="parse_data"></a>
## parse_data

Opens a file, splits on comma, and shuffles data before returning as a List of list. 

* **file_name** Str: filename for data


**returns** 
Data as a list of a list.

In [5]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [ord(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

<a id="get_stats"></a>
## get_stats

This function computes the mean and the standard deviation for a given list of observations. 

* **observations** List[float]: A list of observations


**returns** (mean, standard deviation) Tuple[float,float]: tuple consisting of mean and the standard deviation

In [6]:
def get_stats(observations: List[float]) -> Tuple[float,float]:
    mean = sum(observations) / len(observations)
    variance = sum([(elem - mean)**2 for elem in observations]) / len(observations)
    std_dev = math.sqrt(variance)
    return mean, std_dev

In [7]:
data_mushroom = parse_data("agaricus-lepiota.data")

In [8]:
data_mushroom = [record[1:]+[record[0]] for record in data_mushroom]

In [9]:
attribute_names_mushroom = ['cap-shape',
                   'cap-surface',
                   'cap-color',
                   'bruises?',
                   'odor',
                   'gill-attachment',
                   'gill-spacing',
                   'gill-size',
                   'gill-color',
                   'stalk-shape',
                   'stalk-root',
                   'stalk-surface-above-ring',
                   'stalk-surface-below-ring',
                   'stalk-color-above-ring',
                   'stalk-color-below-ring',
                   'veil-type',
                   'veil-color',
                   'ring-number',
                   'ring-type',
                   'spore-print-color',
                   'population',
                   'habitat']

In [10]:
folds_mushroom = create_folds(data=data_mushroom, n=11)
test_fold = copy.deepcopy(folds_mushroom[10])
folds_mushroom = folds_mushroom[:-1]

I created 3 different models. Each model is roughly similar but different in the number of hidden layers, the number of neurons per layer, and the total number of neurons. I decided to use ReLu, the rectified linear unit, because this is commonly used in binary classification problems. By varying that which I have described above in the modules, I hope to see which number of neurons and layers is more effective.

Model 1: This model uses one layer of 6 neurons

In [11]:
class Model_1(nn.Module):
    def __init__(self, input_features=22, layer_1=6, output_features=2):
        super().__init__()
        self.fc1 = nn.Linear(input_features, layer_1)
        self.out = nn.Linear(layer_1, output_features)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.out(x)
        return x

Model 2: This model uses 2 layers of 6 and 3 neurons

In [12]:
class Model_2(nn.Module):
    def __init__(self, input_features=22, layer_1=6, layer_2=3, output_features=2):
        super().__init__()
        self.fc1 = nn.Linear(input_features, layer_1)
        self.fc2 = nn.Linear(layer_1, layer_2)
        self.out = nn.Linear(layer_2, output_features)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.out(x)
        return x

Model 3: The model uses 3 layers of 12, 6, and 3 neurons

In [13]:
class Model_3(nn.Module):
    def __init__(self, input_features=22, layer_1=12, layer_2=6, layer_3=3, output_features=2):
        super().__init__()
        self.fc1 = nn.Linear(input_features, layer_1)
        self.fc2 = nn.Linear(layer_1, layer_2)
        self.fc3 = nn.Linear(layer_2, layer_3)
        self.out = nn.Linear(layer_3, output_features)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.out(x)
        return x

In [14]:
m_one = Model_1()
m_two = Model_2()
m_three = Model_3()
models = [m_one, m_two, m_three]

This function calculates the error

In [15]:
def get_error(predicted, actual):
    error = 0
    for index in range(len(actual)):
        pred = predicted.tolist()[index].index(max(predicted.tolist()[index]))
        if (actual.tolist()[index][0] == 101):
            act = 0
        else:
            act = 1
        if (pred != act):
            error += (1 / len(actual))
    return error

<a id="cross_validate"></a>
## cross_validate

This function tests each model with each fold of data. The models are trained on the training test and tested on the testing set. Backpropagation is used in the training process. Error rate is computed based on the number of correct predictions.

In [16]:
def cross_validate(folds, models):
    for number, model in enumerate(models):
        train_error, test_error  = [], []
        error_list_train, error_list_test = [], []
        for fold_index in range(len(folds)):
            training_data, test_data = create_train_test(folds, fold_index)
            train_X = copy.deepcopy(training_data)
            train_y = copy.deepcopy(training_data)
            train_X = [data[:-1] for data in train_X]
            train_y = [[data[len(data) - 1]] for data in train_y]
            
            tr_X = torch.FloatTensor(train_X)
            tr_y = torch.LongTensor(train_y)
            
            criterion = nn.CrossEntropyLoss()
            optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
            
            train_y_hat = model.forward(tr_X)
            loss = criterion(train_y_hat, torch.max(tr_y, 1)[1])         
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            test_X = copy.deepcopy(test_data)
            test_y = copy.deepcopy(test_data)
            test_X = [data[:-1] for data in test_X]
            test_y = [[data[len(data) - 1]] for data in test_y]
            
            te_X = torch.FloatTensor(test_X)
            te_y = torch.LongTensor(test_y)
            
            with torch.no_grad():
                test_y_hat = model.forward(te_X) 
            
            error_rate_train = get_error(train_y_hat, tr_y)
            error_rate_test = get_error(test_y_hat, te_y)
            error_list_train.append(error_rate_train)
            error_list_test.append(error_rate_test)
            print(f"Fold: {fold_index}\tTrain Error: {error_rate_train*100:.2f}%\tValidation Error: {error_rate_test*100:.2f}%")
        print(f"***")
        print("Model Number: ", number + 1)
        print(f"\nMean(Std. Dev.) over all folds:\n-------------------------------")
        print(f"Train Error: {get_stats(error_list_train)[0]*100:.2f}%({get_stats(error_list_train)[1]*100:.2f}%) Validation Error: {get_stats(error_list_test)[0]*100:.2f}%({get_stats(error_list_test)[1]*100:.2f}%)")
        print("\n")
        
    print("Testing the best model, model 2:")
    test_X = copy.deepcopy(test_fold)
    test_y = copy.deepcopy(test_fold)
    test_X = [data[:-1] for data in test_X]
    test_y = [[data[len(data) - 1]] for data in test_y]
            
    te_X = torch.FloatTensor(test_X)
    te_y = torch.LongTensor(test_y)
            
    with torch.no_grad():
        test_y_hat = m_one.forward(te_X) 
        
    error_rate_test = get_error(test_y_hat, te_y)
    print(f"Error: {error_rate_test}%")

In [17]:
cross_validate(folds_mushroom, models)

Fold: 0	Train Error: 48.70%	Validation Error: 46.96%
Fold: 1	Train Error: 48.31%	Validation Error: 50.47%
Fold: 2	Train Error: 48.77%	Validation Error: 46.28%
Fold: 3	Train Error: 48.32%	Validation Error: 50.34%
Fold: 4	Train Error: 48.77%	Validation Error: 46.28%
Fold: 5	Train Error: 48.35%	Validation Error: 50.07%
Fold: 6	Train Error: 48.57%	Validation Error: 48.10%
Fold: 7	Train Error: 48.42%	Validation Error: 49.46%
Fold: 8	Train Error: 48.27%	Validation Error: 50.81%
Fold: 9	Train Error: 48.75%	Validation Error: 46.48%
***
Model Number:  1

Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 48.52%(0.20%) Validation Error: 48.52%(1.80%)


Fold: 0	Train Error: 48.73%	Validation Error: 46.96%
Fold: 1	Train Error: 48.31%	Validation Error: 50.47%
Fold: 2	Train Error: 48.77%	Validation Error: 46.28%
Fold: 3	Train Error: 48.32%	Validation Error: 50.34%
Fold: 4	Train Error: 48.77%	Validation Error: 46.28%
Fold: 5	Train Error: 48.35%	Validation Error: 50.07%
Fold:

Evaluation and Exploration:

First, I created 3 models which use rectified linear unit to form a neural network. I decided to add another layer to each module, and to have to number of neurons in each lay half from start to finish, ending in 3. I did some research on line and found that decreasing the number of neurons after each layer is best practice as it narrows towards the output. Thus, the total number of layers, as well as the total number of neurons varied. I read that the total number of neuron should be close to the number of traits which are being trained on to classify, so I predicted that the third model should work the best.

Next, the data was split into folds, and folds were split into training and validation sets. For each fold, each model was trainined then tested with the validation set. Originally, I also used backpropagation during the training, but this caused overfitting, despite not making too much of an impact.

Overall the models did not work very differently, or very well! All of the models are very prone to underfitting. The models did not fit the training data very well, only guessing right roughly half the time, and so did not learn the patterns properly and also performed poorly on validation data.

The first model performed the best by a marginal amount. Clearly, the extra layers and neurons were not necessary. Too many neurons in a layer can lead to overfitting, which did not happen. The extra layers may not have been necessary with the number of classes and outputs. Underfitting is usually caused by low variance and high bias.

How could this be fixed? Perhaps more training time. Data size does not seem to be the issue. The biggest problem could be the use of the linear unit itself. A non-linear approach may be beneficial for this type of data. The tree approach worked very well on this data in the last assignment. The ReLu model is too simplistic and does not capture the underlying relationships in the data.

Nevertheless, the first model worked relatively better. I chose it because it had the lowest average error rate. Still, this error rate was on average about 48%! There is a typo in the output, it should say "Testing the best model, model 1:". Surprisingly, its performance on the fold saved for the final test was the best out of all folds, coming in at 44%. The fact that there are only 2 categories and the model is only right half the time shows the poor fitting.

In conclusion, varying the number of layers and the number of neurons does not always make a large impact, especially when the model itself is not adequate for the data. After testing these three models, in addition to testing countless other layer and neuron combinations in the process, I have see that the key to better fitting can be to find a model that better fits the type of data on which one is working.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".