In [None]:
# Get the datasets with linux commands
!wget http://huang.eng.unt.edu/CSCE-5218/train.dat
!wget http://huang.eng.unt.edu/CSCE-5218/test.dat

In [4]:
# get the datasets with windows commands
!curl.exe --output train.dat http://huang.eng.unt.edu/CSCE-5218/train.dat
!curl.exe --output test.dat http://huang.eng.unt.edu/CSCE-5218/test.dat

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 11244  100 11244    0     0  72227      0 --:--:-- --:--:-- --:--:-- 73012
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2844  100  2844    0     0  15591      0 --:--:-- --:--:-- --:--:-- 15800


# **CSCE 5218 / CSCE 4930 Deep Learning**

# **HW1a The Perceptron** (20 pt)


In [10]:
import math
import itertools
import re


# Corpus reader, all columns but the last one are coordinates;
#   the last column is the label
def read_data(file_name):
    f = open(file_name, 'r')
    data = []
    # Discard header line
    f.readline()
    for instance in f.readlines():
        if not re.search('\t', instance): continue
        instance = list(map(int, instance.strip().split('\t')))
        # Add a dummy input so that w0 becomes the bias
        instance = [-1] + instance
        data += [instance]
    return data

def dot_product(array1, array2):
    #the dot product is an operation that multiplies the items that shares the same items
    #and then add the results. it is useful to express an equation in its vectorized form
    # You do not to write code like this, but get used to it
    return sum([w * x for w, x in zip(array1, array2)])


def sigmoid(x):
    #the sigmoid function is the activation function that saqushes the 
    #output of each neuron into the range [0,1], wich is good for dealing with probabilities
    #and also this function is continuous.
    return 1 / (1 + math.exp(-x))


# Accuracy = percent of correct predictions
def get_accuracy(weights, instances):
    # You do not to write code like this, but get used to it
    correct = sum([1 if predict(weights, instance) == instance[-1] else 0
                   for instance in instances])
    return correct * 100 / len(instances)


# Predict a new instance; this is the definition of the perceptron
def predict(weights, instance):
    if sigmoid(dot_product(weights, instance)) >= 0.5:
        return 1
    return 0


# Train a perceptron with instances
#   and hyperparameters lr (leearning rate) and epochs
# The implementation comes from the definition of the perceptron
# Training consists on fitting the parameters
#   The parameters are the weights, that's the only thing training is responsible to fit
#     (recall that w0 is the bias, and w1..wn are the weights for each coordinate)
#   Hyperparameters (lr and epochs) are given to the training algorithm
# We are updating weights in the opposite direction of the gradient of the error,
#   so with a "decent" lr we are guaranteed to reduce the error after each iteration.
def train_perceptron(instances, lr, epochs):
    '''
    Inputs:
        instances: the input features
        lr: the learning rate
        epochs: number of iterations allowed.
        '''
    
    # initialize weights
    weights = [0] * (len(instances[0])-1)
    # weights = [0, 0, 0, ...,  0]
    
    # iterations
    while epochs > 0:
        #for every sample in the training dataset (instances)
        for instance in instances:
            in_value = dot_product(weights, instance) #get the features*weights
            output = sigmoid(in_value)#apply the activation function
            error = instance[-1] - output#get the error (linear cost function)
            for i in range(0, len(weights)):#backpropagate the error tu update the weights
                # weightUpdated = lr*error*(the proportion that specific weight impacted in the final result)
                weights[i] += lr * error * output * (1-output) * instance[i]

        epochs -= 1 #update iteration
        if epochs == 0:
            break

    return weights

In [None]:
# Get the datasets
!curl.exe --output train.dat http://huang.eng.unt.edu/CSCE-5218/train.dat
!curl.exe --output test.dat http://huang.eng.unt.edu/CSCE-5218/test.dat

In [11]:
instances_tr = read_data("train.dat")
instances_te = read_data("test.dat")
lr = 0.005
epochs = 5
weights = train_perceptron(instances_tr, lr, epochs)
accuracy = get_accuracy(weights, instances_te)
print(f"#tr: {len(instances_tr):3}, epochs: {epochs:3}, learning rate: {lr:.3f}; "
      f"Accuracy (test, {len(instances_te)} instances): {accuracy:.1f}")

#tr: 400, epochs:   5, learning rate: 0.005; Accuracy (test, 100 instances): 68.0


## Questions

Answer the following questions. Include your implementation and the output for each question.



### Question 1

In `train_perceptron(instances, lr, epochs)`, we have the following code:
```
in_value = dot_product(weights, instance)
output = sigmoid(in_value)
error = instance[-1] - output
```

Why don't we have the following code snippet instead?
```
output = predict(weights, instance)
error = instance[-1] - output
```

#### TODO Add your answer here (text only)
Because the predict function applies a threshold value to return either 0 or 1. This makes the convergence more difficult since the outcomes may jump suddenly from 0 to 1, changing the values of the cost function in an underirable way so the task of updating the weights get more complicated. That’s way we want to retain the outcome of the activation function (which is any value between 0 and 1) to smoothly approach the weights that get the minimum value in the cost function.

### Question 2
Train the perceptron with the following hyperparameters and calculate the accuracy with the test dataset.

```
tr_percent = [5, 10, 25, 50, 75, 100] # percent of the training dataset to train with
num_epochs = [5, 10, 20, 50, 100]              # number of epochs
lr = [0.005, 0.01, 0.05]              # learning rate
```

TODO Write your code below and include the output of your code.
The output should look like the following:
```
# tr:  20, epochs:   5, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
# tr:  20, epochs:  10, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
# tr:  20, epochs:  20, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
[and so on for all the combinations]
```
You will get different results with differet hyperparameters.

#### TODO Add your answer here (code and output in the format above) 

In [24]:
# with random sampling
import random
tr_percent = [5, 10, 25, 50, 75, 100] # percent of the training dataset to train with
num_epochs = [5, 10, 20, 50, 100]              # number of epochs
lr = [0.005, 0.01, 0.05]              # learning rate

for tr_ in tr_percent: 
    # sample without replacement
    random.seed(tr_)
    trainingBatch = random.sample(instances_tr, k = int(tr_/100*len(instances_tr)))
    for lr_ in lr:
        for epochs_ in num_epochs:
            weights = train_perceptron(trainingBatch, lr_, epochs_)
            accuracy = get_accuracy(weights, instances_te)
            print(f"#tr: {len(trainingBatch):3}, epochs: {epochs_:3}, learning rate: {lr_:.3f}; "
                  f"Accuracy (test, {len(instances_te)} instances): {accuracy:.1f}")

#tr:  20, epochs:   5, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  10, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  20, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  50, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs: 100, learning rate: 0.005; Accuracy (test, 100 instances): 69.0
#tr:  20, epochs:   5, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  10, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  20, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  50, learning rate: 0.010; Accuracy (test, 100 instances): 69.0
#tr:  20, epochs: 100, learning rate: 0.010; Accuracy (test, 100 instances): 73.0
#tr:  20, epochs:   5, learning rate: 0.050; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  10, learning rate: 0.050; Accuracy (test, 100 instances): 69.0
#tr:  20, epochs

In [26]:
# without random sampling

tr_percent = [5, 10, 25, 50, 75, 100] # percent of the training dataset to train with
num_epochs = [5, 10, 20, 50, 100]              # number of epochs
lr = [0.005, 0.01, 0.05]              # learning rate

for tr_ in tr_percent: 
    trainingBatch = instances_tr[0:int(tr_/100*len(instances_tr))]
    for lr_ in lr:
        for epochs_ in num_epochs:
            weights = train_perceptron(trainingBatch, lr_, epochs_)
            accuracy = get_accuracy(weights, instances_te)
            print(f"#tr: {len(trainingBatch):3}, epochs: {epochs_:3}, learning rate: {lr_:.3f}; "
                  f"Accuracy (test, {len(instances_te)} instances): {accuracy:.1f}")

#tr:  20, epochs:   5, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  10, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  20, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  50, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs: 100, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:   5, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  10, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  20, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  50, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs: 100, learning rate: 0.010; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:   5, learning rate: 0.050; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs:  10, learning rate: 0.050; Accuracy (test, 100 instances): 68.0
#tr:  20, epochs

### Question 3
Write a couple paragraphs interpreting the results with all the combinations of hyperparameters. Drawing a plot will probably help you make a point. In particular, answer the following:
1. Do you need to train with all the training dataset to get the highest accuracy with the test dataset?
2. How do you justify that training the second run obtains worse accuracy than the first one (despite the second one uses more training data)?
```
#tr: 100, epochs:  20, learning rate: 0.050; Accuracy (test, 100 instances): 71.0
#tr: 200, epochs:  20, learning rate: 0.005; Accuracy (test, 100 instances): 68.0
```
3. Can you get higher accuracy with additional hyperparameters (higher than `80.0`)?
4. Is it always worth training for more epochs (while keeping all other hyperparameters fixed)?

#### TODO Add your answer here (code and text)
1. As it can be seen in the iteration
```
#tr: 300, epochs: 100, learning rate: 0.010; Accuracy (test, 100 instances): 80.0
```
    it was not needed the whole training dataset to achieve the accuracy of 80.0, this means that it's performance also depends on the values of the hyperparameters (Lr and epochs). it can be also possible that a smaller training set generalizes well the data so the unseen data may have better results.
2. I'd say that the learning rate in the second run is so small that does not allow the model to move a lot in the cost function curve, it means that the steps are small so more epochs are needed to reach the minimum. Whereas the learning rate for the first run was higher, it allows the model to make bigger steps and be closer to the minimum.
3. the accuracy of 80.0 may be related to the shape of the loss function (where the minimum is located) so trying with a different one may increase the accuracy, also introducing a Regularization term to the update equation may make the model generalize better when unseen data is provided.
4. No, it is not. looking at the last 3 outcomes, we can see that the model has reached a steady state, so more than 20 epochs were not required in this case. So, in order to search for a better result, one solution may be making big steps at the beginning (big Lr) and then small steps that avoid the model keeps jumping a possible better minimum value for the cost function.



## pending Write code to plot outcomes