This assignemt uses the [_LED Display Domain Data Set_ from the _UCI Machine Learning Repository_](https://archive.ics.uci.edu/ml/datasets/LED+Display+Domain)

Dataset description:
> This simple domain contains 7 Boolean attributes and 10 concepts, the set of decimal digits. Recall that LED displays contain 7 light-emitting diodes -- hence the reason for 7 attributes. The problem would be easy if not for the introduction of noise. In this case, each attribute value has the 10% probability of having its value inverted. 

The authors provide two C programs for generating the actual data.
I chose to use the _led-creator.c_ which accepts arguments _numtrain seed outputfile noise_
where, as stated in the C code:
 - numtrain is the number of training instances requested
 - seed is an integer seed for the random number generator
 - outputfile: output file name for the generated instances
 - noise is the percent probability of noise per attribute (usually set to 10%...and reported by the program)
 
I used arguments _4000 11 nums.csv 10_ and the program reported _Percent Noise: Requested 10, Actual 9.939285_
</br></br>
Code inspiration and sources: </br>
Most code was taken from assignments or class materials </br>
The Bayes implementation example was taken from the [_sklearn documentation_](https://scikit-learn.org/stable/modules/naive_bayes.html)



In the first cell all imports are declared

In [24]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree

%matplotlib inline

The first thing to do is load the dataset. 
As the file doesn't have a row which contains column names, the names are passed as a parameter to the _read_csv_ function.

In [4]:
df = pd.read_csv('nums.csv', names=['led_1', 'led_2', 'led_3', 'led_4', 'led_5', 'led_6', 'led_7', 'label'])
df.head()

Unnamed: 0,led_1,led_2,led_3,led_4,led_5,led_6,led_7,label
0,0,0,1,0,1,1,0,1
1,1,1,0,1,0,1,1,9
2,0,0,1,0,0,0,0,1
3,1,1,0,1,1,1,1,6
4,1,1,1,0,1,0,1,0


Let's see if the data was processed as numbers or needs casting or some other preprocessing

In [5]:
df.dtypes

led_1    int64
led_2    int64
led_3    int64
led_4    int64
led_5    int64
led_6    int64
led_7    int64
label    int64
dtype: object

Splitting the data into X, y, training and test

In [6]:
cols = df.columns
cols = cols[cols != 'label']

X = df[cols].values
y = df['label'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, 
    test_size=0.30, shuffle=True, random_state=11)

print(X_train.shape, X_test.shape)

(2800, 7) (1200, 7)


As this is a classification problem, the first model I decided to try was k Nearest Neighbor with different k values

In [7]:
def distance(x1, x2):
    return np.linalg.norm(x1-x2)

class k_nearest_neighbor:
    def __init__(self):
        pass
    def fit(self, X, y):
        self.X = X
        self.y = y
    def get_nearest(self, x, n):
        distances = [distance(x, self.X[i]) 
                     for i in range(len(self.X))]
        return np.array(distances).argsort()[:n]
    def predict(self, x, n) :
        votes = self.y[self.get_nearest(x, n)]
        democracy = stats.mode(votes)
        return democracy.mode[0]

In [8]:
def run_k_nearest_neighbors(k, X_train, y_train, X_test, y_test) :
    knn = k_nearest_neighbor()
    knn.fit(X_train, y_train)
    y_pred = np.array([knn.predict(X_test[i],k) for i in range(len(X_test))])
    acc = np.sum(y_pred == y_test)/len(y_test)
    # print(f'The accuracy is {acc:.3f}')
    return acc

In [9]:
for i in range(1,8) :
    acc = run_k_nearest_neighbors(i, X_train, y_train, X_test, y_test)
    print(f'For k equals {i} the accuracy is {acc:.4f}')

For k equals 1 the accuracy is 0.6900
For k equals 2 the accuracy is 0.6900
For k equals 3 the accuracy is 0.7283
For k equals 4 the accuracy is 0.7142
For k equals 5 the accuracy is 0.7225
For k equals 6 the accuracy is 0.7250
For k equals 7 the accuracy is 0.7183


Normal knn took a long time to run, what if we tried Nearest Neigbor with exemplars, that should make computing faster

In [10]:
class nearest_neighbor:
    def __init__(self):
        pass
    def fit(self, X, y):
        self.X = X
        self.y = y
    def get_nearest(self, x):
        distances = [distance(x, self.X[i]) 
                     for i in range(len(self.X))]
        return np.argmin(distances)
    def predict(self, x) :
        return self.y[self.get_nearest(x)]

In [11]:
def run_nearest_neighbors(X_train, y_train, X_test, y_test) :
    nn = nearest_neighbor()
    nn.fit(X_train, y_train)
    y_pred = np.array([nn.predict(X_test[i]) for i in range(len(X_test))])
    acc = np.sum(y_pred == y_test)/len(y_test)
    print(f'The accuracy is {acc:.3f}')
    return acc

In [12]:
def class_examplars(X_train, y_train) :
    the_exemplars = []
#     range(10)
    for classs in set(y_train):
        x_of_class = X_train[y_train==classs]
        ptdsts = []
        
        for i in range(len(x_of_class)) :
            alldst = [np.sqrt(np.sum((x_of_class[j]-x_of_class[i])*(x_of_class[j]-x_of_class[i]))) for j in range(len(x_of_class))]
            ptdsts.append(sum(alldst))
            
        the_exemplars.append(x_of_class[np.argmin(ptdsts)])
        
        
    return the_exemplars

In [13]:
X_exemplars = class_examplars(X_train, y_train)
y_exemplars = [i for i in range(10)]

run_nearest_neighbors(X_exemplars, y_exemplars, X_test, y_test)

The accuracy is 0.733


0.7333333333333333

Computing speed definitely increased, but accuracy remained about the same
what about class means?

In [14]:
def class_means(X_train, y_train) :
    the_means = []
#     range(10)
    for classs in set(y_train):
        x_of_class = X_train[y_train==classs ]
        the_means.append(x_of_class.mean(axis=0))
        
    return the_means

In [15]:
X_means = class_means(X_train, y_train)
y_means = [i for i in range(10)]

run_nearest_neighbors(X_means, y_means, X_test, y_test)

The accuracy is 0.716


0.7158333333333333

As seen above, accuracy actually decreased a bit using class means.

Let's try another model, what about decision trees?
Here I ran a Decision Tree model using different max depths 

In [16]:

for i in range(1,15) :
    tree = DecisionTreeClassifier(max_depth=i, random_state=11)
    tree.fit(X, y)
    print(f'For max depth {i} the accuracy is {tree.score(X,y):.4f}')


For max depth 1 the accuracy is 0.1993
For max depth 2 the accuracy is 0.3475
For max depth 3 the accuracy is 0.5360
For max depth 4 the accuracy is 0.7080
For max depth 5 the accuracy is 0.7248
For max depth 6 the accuracy is 0.7460
For max depth 7 the accuracy is 0.7538
For max depth 8 the accuracy is 0.7538
For max depth 9 the accuracy is 0.7538
For max depth 10 the accuracy is 0.7538
For max depth 11 the accuracy is 0.7538
For max depth 12 the accuracy is 0.7538
For max depth 13 the accuracy is 0.7538
For max depth 14 the accuracy is 0.7538


As seen above, accuracy using a decision tree improved a bit to 75%. This makes me happy, because the dataset authors stated that 
>It's valuable to know the optimal Bayes rate for these databases. In this case, the misclassification rate is 26% (74% classification accuracy).

Which makes the accuracy I got here a little higher.
Also, it seems like the accuracy plateaued at a max depth of 7

Let's also try the Bayes model, since it was mentioned by the authors

In [17]:
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
acc = np.sum(y_pred == y_test)/len(y_test)
print(f'The accuracy is {acc}')

The accuracy is 0.7133333333333334


There isn't much improvement compared to knn, and it's lower than the decision tree.

In conclusion, the results show that for this dataset, a decision tree model of depth 7 was the best at predicting future values with an accuracy of 75%