# Project 3: Classification with Logistic Regression and SVM

Before we start, please put your name and CUID in following format

: Firstname LASTNAME, #00000000   //   e.g. Nianyi LI, #12345678

**Your Answer:**   
Your NAME, #XXXXXXXX

# General Rules of the Project Submission

Python 3 and [Matplotlib](https://matplotlib.org/) will be used throughout the semseter, so it is important to be familiar with them. It is strongly suggested to go through [Stanford CS231n](http://cs231n.github.io/python-numpy-tutorial/) and [CS228](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) for more detailed Python and numpy tutorials if you haven't had used Python before. 

In some cells and files you will see code blocks that look like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
pass
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

You should replace the `pass` statement with your own code and leave the blocks intact, like this:

```python
##############################################################################
#                    TODO: Write the equation for a line                     #
##############################################################################
y = m * x + b
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################
```

When completing the notebook, please adhere to the following rules:
- Do not write or modify any code outside of code blocks
- Follow the instruction of the project description carefully
- Run all cells before submitting. <span style="color:red">**You will only get credit for code that has been run!**.</span>

The last point is extremely important and bears repeating:

### We will not re-run your notebook -- <span style="color:red">you will only get credit for cells that have been run</span>

### File name
Your Python program should be named **yourlastname_yourfirstname_P3.ipynb**, then zip it and upload to Canvas

# Project Description

For this project we will apply both **Logistic Regression** and **SVM** to predict whether capacitors from a fabrication plant pass quality control based (QC) on two different tests. To train your system and determine its reliability you have a set of 118 examples. The plot of these examples is show below where a red x is a capacitor that failed QC and the green circles represent capacitors that passed QC.

<div>
<img src="https://nianyil.people.clemson.edu/CPSC_4430/P3_new.png" width="500"/>
</div>


## Data File

Two text files with the data is available on Canvas: a training set of 85 examples and a test set of 33 examples. Both are formatted as
- First line: **m** and **n**, tab separated
- Each line after that has two real numbers representing the results of the two tests, followed by a *1.0* if the capacitor *passed* QC and a *0.0* if it *failed* QC—tab separated.

You need to write a code to read data from the file. You **can** use packages, such as **panda**, to load the data.


In [150]:
##############################################################################
#         TODO: Write the code for reading data from file                    #
##############################################################################
import numpy as np
from tabulate import tabulate
x1, x2, y = np.loadtxt('P3train.txt', skiprows = 1, unpack = True)
x1_test, x2_test, y_test = np.loadtxt('P3test.txt', skiprows = 1, unpack = True)
#x = np.column_stack([np.ones(len(x1)),x1, np.power(x1, 2), x2, x1*x2, x1*np.power(x2, 2), np.power(x2, 2), np.power(x1, 2)* x2, np.power(x1, 2)*np.power(x2, 2)])
#X = np.matrix(x)
#print(x.shape)
#print(X)
X = np.column_stack([
    np.ones(len(x1)),                   # Bias term (1 for each data point)
    x1,                                 # x1
    np.power(x1, 2),                    # x1^2
    x2,                                 # x2
    x1 * x2,                            # x1 * x2
    x1 * np.power(x2, 2),               # x1 * x2^2
    np.power(x2, 2),                    # x2^2
    np.power(x1, 2) * x2,               # x1^2 * x2
    np.power(x1, 2) * np.power(x2, 2)   # x1^2 * x2^2
])

X = X.T
print(X.shape)
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################

(9, 85)


Your assignment is to use what you have learned from the class slides and homework to create (**from scratch in Python**, not by using Logistic Regression library function!) a **Logistic Regression** and **SVM** binary classifier to predict whether each capacitor in the test set will pass QC. 

## Logistic Regression

You are free to use any model variation and any testing or training approach we have discussed for logistic regression. In particular, since this data is not linear, I assume you will want to add new features based on power of the original two features to create a good decision boundary. $w_0 + w_1x_1 + w_2x_2$ is not going to work!
One choice might be
- $\textbf{w}^T \textbf{x} = w_0 + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 +w_6x_6 + w_7x_7 + w_8x_8$    where the new features are created as follows:

| New Features |From Original Features |
| --- | --- |
|$x_1$	| $x_1$|
|$x_2$	| $x_1^2$|
|$x_3$	| $x_2$||
|$x_4$	| $x_1x_2$|
|$x_5$	| $x_1x_2^2$|
|$x_6$	| $x_2^2$|
|$x_7$	| $x_1^2x_2$|
|$x_8$	| $x_1^2x_2^2$|

Note that it is easy to create a small Python program that reads in your  original features, uses a nested loop to create the new features and then writes them to a file:

```python
thePower = 2
for j in range(thePower+1): 
    for i in range(thePower+1):
        temp = (x1**i)*(x2**j)
        if (temp != 1):
            fout1.write(str(temp)+"\t") fout1.write(str(y)+"\n")
```

With a few additions to the code, you can make a program to create combinations of any powers of $x_1$ and $x_2$!

In [151]:
##############################################################################
#           TODO: Define the Logistic regression models                      #
##############################################################################
def logistic_model(x1, x2, w):
    e_z = np.exp(-(compute_wTx(x1, x2, w)))
    y_pred = 1 / (1 + (e_z))
    return y_pred
def compute_wTx(x1, x2, w):
    wTx = (
        w[0] + 
        w[1] * x1 + 
        w[2] * x1**2 + 
        w[3] * x2 + 
        w[4] * x1 * x2 + 
        w[5] * x1 * x2**2 + 
        w[6] * x2**2 + 
        w[7] * x2 * x1**2 + 
        w[8] * (x1**2) * (x2**2)
    )
    return wTx
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################

## Optimization using Gradient Decent

Once you have defined the logistic regression model, you need to find the weights using the Gradient Decent algorithm. You need to implement the Vanilla Gradient Decent from scratch in Python.

You need to specify the hyperparameters of GD, and plot the training loss curve (**J-curve**). The loss function should be the binary cross-entropy loss function that we introduced.

In [158]:
##############################################################################
#           TODO: Implement the Gradient Decent Algorithm                    #
##############################################################################
# Define the hyperparameters:
# Numbers of epoch (epoch_num), learning rate (lr), and the initial weights(w)
epoch_num = 1000
lr = 0.1
w = np.zeros(9)
cost_history = []

# Define the loss:
def cross_entropy_loss(y_pred,y):
    m = len(y)
    J = -(1/m) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
    cost_history.append(J)
    return J

# Calculate the gradient function:
def gradient_func(w,X,y,y_pred,x1, x2):
    y_pred = logistic_model(x1, x2, w)
    J = cross_entropy_loss(y_pred,y)
    print(J)
    #J.reshape(-1, 1)
    #gradient_value = w - (lr * J * X)
    #gradient_value = np.dot(X.T, J) / len(y)
    error = y_pred - y
    gradient_value = np.dot(X.T, error) / len(y)
    return gradient_value
    
# Implement the Gradient decent algorithm using for loop
def Vanilla_GD(epoch_num,lr,w,X,y,x1,x2):
    for epoch in range(epoch_num):
        gradient = gradient_func(w,X,y,y_pred,x1, x2)
        w = w - lr * gradient
    return w
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################

[0. 0. 0. 0. 0. 0. 0. 0. 0.]


Next, print out the final weights and plot the **J-curve/Loss curve** of training. 

In [157]:
##############################################################################
#                     TODO: Plot the J curve                                 #
##############################################################################
# Print out the final weights
w = Vanilla_GD(epoch_num,lr,w,X,y,x1,x2)
print(w)
print(cost_history)
# Plot the J curve w.r.t. the iteration numbers
import matplotlib.pyplot as plt
plt.plot(range(epoch_num), cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('J-Curve / Loss Curve of Training')
plt.show()
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################

0.6931471805599454


ValueError: shapes (85,9) and (85,) not aligned: 9 (dim 1) != 85 (dim 0)

Based on your data and plot, you should then briefly discuss how you can ensure that the model is well trained.

**Your Answer:**  

## Model Evaluation

Evaluate the performance on testing set:
- Print out the confusion matrix
- Calculate and print out the *accuracy*, *precision*, *recall*, and *F1* value of your model

**Note that:**
- For **undergrads** *(CPSC 4430)* the final accuracy of both algorithms on your test set should be higher than  <span style="color:red">**70%**</span>
- For **graduate-level** *(CPSC 6430)* the final accuracy of both algorithms on your test set should be higher than  <span style="color:red">**85%**</span>


In [133]:
##############################################################################
#                           TODO: Model Evaluation                           #
##############################################################################
def model_evaluation(y_test, y_t_pred):
    tp = np.sum((y_test == 1) & (y_pred == 1))
    tn = np.sum((y_test == 0) & (y_pred == 0))
    fp = np.sum((y_test == 0) & (y_pred == 1))
    fn = np.sum((y_test == 1) & (y_pred == 0))

    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision * recall) if (precision + recall) > 0 else 0

    return accuracy, precision, recall, f1_score
    
y_t_pred = logistic_model(x1_test, x2_test, w)

accuracy, precision, recall, f1_score = model_evaluation(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################

Accuracy: nan
Precision: 0
Recall: 0
F1 Score: 0


  accuracy = (tp + tn) / (tp + tn + fp + fn)


## Support Vector Machine (SVM)

In this part, you need to use the previous training and testing data file. 

You are **allowed** to use the svm functions in the **Scikit-learn** library and don’t need to implement the algorithm from scratch.

- You need to try at least **three** different kernel functions of SVM, and pick the **best** model.
- You need to print out the final weights got from your best SVM model.

**Note that:**
- For **undergrads** *(CPSC 4430)* the final accuracy of both algorithms on your test set should be higher than  <span style="color:red">**70%**</span>
- For **graduate-level** *(CPSC 6430)* the final accuracy of both algorithms on your test set should be higher than  <span style="color:red">**85%**</span>

In [None]:
##############################################################################
#                      TODO: Classfication using SVM                         #
##############################################################################
# Pick the best model
pass

# Print out the final weights
pass
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################

## Visualize Decision Boundary and Model Comparision

You need to plot the decision boundary of Logistic Regression and SVM that you previously trained separately. 

In [None]:
##############################################################################
#                   TODO: Plot the Decision Boundary                         #
##############################################################################
pass
##############################################################################
#                              END OF YOUR CODE                              #
##############################################################################

Based on your data and plot, you should then briefly discuss which one has better performance and why.

**Your Answer:**  