# Homework 02: Linear Classification

Deadline: Monday March 13, 2023 11:59 PM (without a late penalty)

---

This homework consists two sections, which covers the Perceptron algorithm and logistic regression classifiers that we learned. 

If you are running this notebook on your local computer, please make sure it has the following packages installed

- Numpy
- Sklearn

---

**Requirements**:

Please read the following requirements carefully, to avoid any unnecessary point deduction. 

1. Keep all the required results in your submission, our TAs will grade the homework based on the submitted results. (The TAs not run the code unless they have some questions about the implementation.)
2. Your submission should only be via Canvas.

## 1 The Perceptron Algorithm (5 points)

### 1.1 Implementation (3.5 points)

Let's implement the Perceptron algorithm described in our class lecture. 

First, we need to download a toy dataset from the course webpage and load it. 

In [180]:
import urllib.request
import numpy as np
url = "https://yangfengji.net/uva-ml-undergrad/data/separable.txt"
filename, headers = urllib.request.urlretrieve(url, filename="separable.txt")

data = arr = np.loadtxt("separable.txt", delimiter="\t", dtype=float)
X = data[:,:2] # Input
Y = data[:,-1] # Label

# Attach one column to X for capturing the bias term in classification
X = np.concatenate((X, np.ones((X.shape[0], 1))), axis=1)

print(X)
print(Y)

[[-0.11  0.59  1.  ]
 [ 0.35  0.79  1.  ]
 [ 0.25  0.46  1.  ]
 [ 0.03  0.26  1.  ]
 [-0.37  0.38  1.  ]
 [ 0.25  0.12  1.  ]
 [-0.06 -0.03  1.  ]
 [ 0.2  -0.2   1.  ]
 [ 0.55  0.08  1.  ]
 [ 0.81 -0.11  1.  ]
 [ 0.54 -0.23  1.  ]
 [-0.11 -0.23  1.  ]
 [ 0.14  1.03  1.  ]
 [ 0.59  1.21  1.  ]]
[-1. -1. -1. -1. -1.  1.  1.  1.  1.  1.  1.  1. -1. -1.]


Please implement the Perceptron algorithm in the following code block.

Make sure your implementation has the following components

- The Percpetron updating rule and the condition of when to run the updating rule;
- The convergence criterion -- stop training when the classifier can make correct prediction of all the training examples;
- Print out the classification weight in the end. 

In [187]:
def predict(X, weights,Y):
    Output = np.dot(X,weights) * Y
    return Output

def updateWeights(X,Y,weights):
    weights += X*Y
    return weights


def perceptron(X,Y):
    weights = np.zeros(3)
    error = True
    while error is True:
        error = False
        for i in range(len(Y)):
            guess = predict(X[i],weights,Y[i])
            if guess <= 0:
                error = True
                weights = updateWeights(X[i],Y[i],weights)
        if error is False:
            break
    return weights
                
            

print(perceptron(X,Y))

[ 2.22 -4.17  1.  ]


### 1.2 The Difference between Perceptron and Logisitic Regression (1.5 points)

Please list the major difference between the Perceptron model and the Logistic Regression model as we discussed in class. Please list at least three of them.

**Answer**

1. The perceptrons y can only be -1 or 1 while the logisitic regression model's y is continuous on [-1,1].
2. For logistic regression the weights are updated every prediction and for perceptron the weights are only updated if the prediction is wrong
3. Perceptron uses SGD while Logistic can use a wide range of algorithms to update its weights

## 2 Logistic Regression (9 points)

Logistic regression can handle both linear separable and unseparable cases. Therefore, we will use a real-world example datasets, instead of a synthetical one. 

We will use the [Logistic Regression]() model from Sklearn for the questions. In other words, you don't have to implement a logistic regression model by yourself, but you need to understand how to use it and how to interpret its outputs. 

### 2.0 Download the dataset

Run the following code and download the [xxx]() dataset from [OpenML](). Similar to what we explained in class, we will also use the [OridinalEncoder]() to convert the non-numeric features into numeric features. 

Furthermore, we will run the data split function to divide the whole dataset into a training set and a validation set. 

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, Y = fetch_openml("diabetes", version=1, return_X_y=True)
print("Read {} examples with {} features".format(X.shape[0], X.shape[1]))

#print(X)

trnX, valX, trnY, valY = train_test_split(X, Y, test_size=0.3, random_state=111)
print("The size of training set: {}".format(trnX.shape[0]))
print("The size of validation set: {}".format(valX.shape[0]))

Read 768 examples with 8 features
The size of training set: 537
The size of validation set: 231


### 2.1 Build the Classifier (2 points)

Please follow the instructions and complete the code block for building an LR classifier. 

- Use the [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model from Sklearn with its **default** parameters.
- Training the model and report its prediction accuracy on **both** the training and the validation sets.
- Keep the **printed** (instead of hand typing) results in your submission for grading. Make sure the printed results have enough information about which is training accuracy and which one is validation accuracy. For example, you can use 
```python
print("Training accuracy: {}".format(trn_acc))
```

In [None]:
# TODO: implementation

classifier = sklearn.linear_model.LogisticRegression()
classifier.fit(trnX,trnY)

trainpred = classifier.predict(trnX)
valpred = classifier.predict(valX)


trnY1 = trnY.to_list()
traincount = 0
for i in range(len(trainpred)):
    if trainpred[i] == trnY1[i]:
        traincount += 1

valcount = 0

valY1 = valY.to_list()
for i in range(len(valpred)):
    if valpred[i] == valY1[i]:
        valcount += 1


print("Training accuracy: {}".format(traincount/537))

print("Validation accuracy: {}".format(valcount/231))

Training accuracy: 0.7839851024208566
Validation accuracy: 0.7575757575757576


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### 2.2 Feature Pre-processing (2 points)

In our lecture, we talk about three different ways to represent the features before feeding into the classifier. 

Pick the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) to further processing the features, and building the classifier again. 

Please follow the instructions and complete the code block for building an LR classifier with feature pre-processing

- Pick one pre-processing method to further scale or encode the features
- Training the model and report its prediction accuracy on **both** the training and the validation sets.
- Keep the **printed** (instead of hand typing) results in your submission for grading. Make sure the printed results have enough information about which is training accuracy and which one is validation accuracy.

Note that, you can find many other pre-processing functions in Sklearn via [this link](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing). Since we only discussed the three abovementioned pre-processing methods in class, you are *not* required to do anything beyond the instruction. 

In [140]:
# TODO: implementation

X, Y = fetch_openml("diabetes", version=1, return_X_y=True)
enc = sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore")

#enc1 = sklearn.preprocessing.OneHotEncoder()
#Y = enc.fit_transform(Y)
trnX, valX, trnY, valY = train_test_split(X, Y, test_size=0.3, random_state=111)
trnx = enc.fit(trnX)
trnX = enc.transform(trnX)
#valX = enc.fit_transform(valX)
valX = enc.transform(valX)
classifier = sklearn.linear_model.LogisticRegression()
classifier.fit(trnX,trnY)

trainpred = classifier.predict(trnX)
valpred = classifier.predict(valX)


trnY1 = trnY.to_list()
traincount = 0
for i in range(len(trainpred)):
    if trainpred[i] == trnY1[i]:
        traincount += 1

valcount = 0

valY1 = valY.to_list()
for i in range(len(valpred)):
    if valpred[i] == valY1[i]:
        valcount += 1


print("Training accuracy: {}".format(traincount/537))

print("Validation accuracy: {}".format(valcount/231))

Training accuracy: 0.9739292364990689
Validation accuracy: 0.7142857142857143


### 2.3 Justification of Pre-processing (2 points)

If your implementation is correct, you should see a significant boost on both training and validation accuracies. 

Based on what we discussed in class, please explain why this one-hot encoding is helpful for improving classification performance.

**Answer**

 Most machine learning model inputs require that the dimesnions are numerical. one hot encoding transforms these categorigal dimensions into numerical dimensions which allow the model t take them in as input. In addition it removes the ordinality of these categorical variablesmeaning that each category in the categorical dimension is an independent dimension.

### 2.4 $\ell_2$ Regularization (1.5 points)

With the pre-processed data, please train the classifiers with three different regularization coefficient. In the [Logistic Regression]() model implemented in Sklearn, the associated function argument is $C$. 

Please use the following three code blocks and try three different $C$ values

- $C=100.0$
- $C=1.0$
- $C=0.01$

and pirnt out 

- the training accuracy, and 
- the validation accuracy 

by following the **same** format as the previous questions

$C=100.0$

In [141]:
# TODO: LR with C=100.0
classifier = sklearn.linear_model.LogisticRegression(C = 100)
classifier.fit(trnX,trnY)

trainpred = classifier.predict(trnX)
valpred = classifier.predict(valX)


trnY1 = trnY.to_list()
traincount = 0
for i in range(len(trainpred)):
    if trainpred[i] == trnY1[i]:
        traincount += 1

valcount = 0

valY1 = valY.to_list()
for i in range(len(valpred)):
    if valpred[i] == valY1[i]:
        valcount += 1


print("Training accuracy: {}".format(traincount/537))

print("Validation accuracy: {}".format(valcount/231))

Training accuracy: 1.0
Validation accuracy: 0.7229437229437229


$C=1.0$

In [142]:
# TODO: LR with C=0.1
classifier = sklearn.linear_model.LogisticRegression(C = 0.1)
classifier.fit(trnX,trnY)

trainpred = classifier.predict(trnX)
valpred = classifier.predict(valX)


trnY1 = trnY.to_list()
traincount = 0
for i in range(len(trainpred)):
    if trainpred[i] == trnY1[i]:
        traincount += 1

valcount = 0

valY1 = valY.to_list()
for i in range(len(valpred)):
    if valpred[i] == valY1[i]:
        valcount += 1


print("Training accuracy: {}".format(traincount/537))

print("Validation accuracy: {}".format(valcount/231))

Training accuracy: 0.7746741154562383
Validation accuracy: 0.7012987012987013


$C=0.01$

In [143]:
# TODO: C=0.01

classifier = sklearn.linear_model.LogisticRegression(C = 0.01)
classifier.fit(trnX,trnY)

trainpred = classifier.predict(trnX)
valpred = classifier.predict(valX)


trnY1 = trnY.to_list()
traincount = 0
for i in range(len(trainpred)):
    if trainpred[i] == trnY1[i]:
        traincount += 1

valcount = 0

valY1 = valY.to_list()
for i in range(len(valpred)):
    if valpred[i] == valY1[i]:
        valcount += 1


print("Training accuracy: {}".format(traincount/537))

print("Validation accuracy: {}".format(valcount/231))

Training accuracy: 0.6443202979515829
Validation accuracy: 0.6666666666666666


### 2.5 Explan the regularization effect (1.5 points)

Based on the classification accuracies above, explain the $\ell_2$ regularization effect for the classification performance

- What is the direct effect of $\ell_2$ regularization?
- Does it work all the time or when it may work?
- How do we know it works for helping avoid over-fitting?

Make sure your answer covers all the three questions above.

**Answer**

1. The direct effect of l2 is that it reduces the magnitude of the weights in the logistic regression model. Which in turn helps reduce the chances of overfitting.

2. l2 does not always work when the features are highly correlated which could lead to unstabl weight estimates.

3. When c is a large value and there is a large discrepancy in the training and validation accuracies this could mean there is overfitting because at the large c value the model doesn't penalize large weight.