# Homework 4
# Perceptron, SVM, and PCA

# <p style="text-align: right;"> &#9989; Aidan Klinger</p>
# <p style="text-align: right;"> &#9989; aidank247</p>

# Goal for this homework assignment
We have worked some basics on perceptron, SVM, and PCA in the pre-class and in-class assignments. In this homework assignment, we will:

* Continue to use git as the version control tool
* Work on unfamiliar data
* Use perceptron to classify data 
* Use SVM to classify data
* Use principal component analysis to facilitate classification


**This assignment is due by 11:59 pm on Friday, April 25th. Note that ONLY the copy on GITHUB will be graded.**  **There are 60 standard points possible in this assignment including points for Git commits/pushes. The distribution of points can be found in the section headers**.

---
# Part 1: Git repository (6 points)

You're going to add this assignment to the `cmse202-s25-turnin` repository you previously created. The history of progress on the assignment will be tracked via git commitments. 

**&#9989; Do the following**:

1. Navigate to your `cmse202-s25-turnin` **local** repository and create a new directory called `hw-04`

2. Move this notebook into that **new directory** in your repository. 

5. Double check to make sure your file is at the correct directory.

6. Once you're certain that file and directory are correct, add this notebook to your repository, then make a commit and push it to GitHub. You may need to use `git push origin hw04` to push your file to GitHub.

Finally, &#9989; **Do this**: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below. **Points for this part will be given for correctly setting up branch, etc., above, and for doing git commits/pushes mentioned throughout the assignment.**

<font size=6 color="#009600">&#9998;</font> git clone https://github.com/aidank247/CMSE202-s25-turnin


**Important**: Double check you've added your Professor and your TA as collaborators to your "turnin" repository (you should have done this in the previous homework assignment).

**Also important**: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, **none of your changes will be tracked**!

If everything went as intended, the file should now show up on your GitHub account in the "`cmse202-s25-turnin`" repository inside the `hw-04` directory that you just created.

Periodically, **you'll be asked to commit your changes to the repository and push them to the remote GitHub location**. Of course, you can always commit your changes more often than that, if you wish.  It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the problems for a bit.

---
# Part 2: Deal with unfamiliar data (35 points)

## Warm up with perceptron for binary classification
## 2.1 Load up the dataset

This data is obtained from Kaggle/diabetes. It contains multiple measured values and a label for whether the patient is diagnosed as diabetic. 

* Use commands to dowdload the dataset from `https://raw.githubusercontent.com/huichiayu/cmse202-s25-supllemental_data/refs/heads/main/HW04/diabetes_prediction_dataset.csv`
* Use Pandas to load in the data and briefly examine it.
* Succeed data load-up gets **2 pt**. 

In [17]:
# put your code here
import pandas as pd

df = pd.read_csv('diabetes_prediction_dataset.csv')
df.head()
df.count()

gender                 100000
age                    100000
hypertension           100000
heart_disease          100000
smoking_history        100000
bmi                    100000
HbA1c_level            100000
blood_glucose_level    100000
diabetes               100000
dtype: int64

How many patients are in this dataset? What are features of the patients?

<font size=6 color="#009600">&#9998;</font> 100000 and 9 features

### Use your perceptron class built in Day18 and Day19 assignments to classify whether patients are diabetic.

* You should see that there are some features that are non-numerics.
* The first one is `gender`. Find the types of classes and convert them to numerics in your dataframe.
* The second one is `smoking_history`, convert those string labels to numerics.
* Note that since perceptron is a binary classifier, which only determines which side of the dividing line the data points reside, we should also convert the labels to `+1` and `-1`.
* Completing data conversion gets **5 pt**.

In [18]:
# put your code here
gender_mapping = {'Male': 0, 'Female': 1, 'Other': 2}
df['gender'] = df['gender'].map(gender_mapping)

df['smoking_history'] = df['smoking_history'].astype('category').cat.codes

df['diabetes'] = df['diabetes'].apply(lambda x: 1 if x == 1 else -1)

print(df.head())
print(df.dtypes)

   gender   age  hypertension  heart_disease  smoking_history    bmi  \
0       1  80.0             0              1                4  25.19   
1       1  54.0             0              0                0  27.32   
2       0  28.0             0              0                4  27.32   
3       1  36.0             0              0                1  23.45   
4       0  76.0             1              1                1  20.14   

   HbA1c_level  blood_glucose_level  diabetes  
0          6.6                  140        -1  
1          6.6                   80        -1  
2          5.7                  158        -1  
3          5.0                  155        -1  
4          4.8                  155        -1  
gender                   int64
age                    float64
hypertension             int64
heart_disease            int64
smoking_history           int8
bmi                    float64
HbA1c_level            float64
blood_glucose_level      int64
diabetes                 int64


### Now all feature varilables are numerics.

### &#128721; STOP (1 Point)
**Pause, save and commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.



---

## 2.2 Binary perceptron classifier

Copy your perceptron class to the cell below. 

* DO NOT use the one from statsmodel. We want to test the perceptron you built.
* Note that your predict method should output `+1` or `-1` for positive or negative values, respectively.
* A functional perceptron classifier gets **4 pt**.

In [19]:
# copy your perceptron class to his cell
import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, n_iters=1000):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.activation_func = self._unit_step_function
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for i in range(self.n_iters):
            for x_i, y_true in zip(X, y):
                linear_output = np.dot(x_i, self.weights) + self.bias
                y_predicted = self.activation_func(linear_output)

                update = self.lr * (y_true - y_predicted)
                self.weights += update * x_i
                self.bias += update

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        predictions = self.activation_func(linear_output)
        return predictions

    def _unit_step_function(self, x):
        return np.where(x >= 0, 1, -1)


* Split data to 70-30 train-test sets **1 pt**.
* Train your perceptron.
* Show the accuracy of your pereptron **2 pt**.

In [20]:
# put your code here
from sklearn.model_selection import train_test_split

X = df.drop('diabetes', axis=1).values  
y = df['diabetes'].values               

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1)

model = Perceptron(learning_rate=0.001, n_iters=1000)
model.fit(X_train, y_train)

* Use test set to evaulate the accuracy of your perceptron. What is your accuracy? (**2 pt**)

In [21]:
# put your code here
y_pred = model.predict(X_test)

accuracy = np.mean(y_pred == y_test)

accuracy

0.9256428571428571

* There may be some ways to increase the accruacy, such as increasing the number of train iterations or adjust learning rate. Give a try to train a perceptron you can best get. Record the values of parameters and the optimal accuracy. (**3 pt**)


In [None]:
# put your code here
best_accuracy = 0
best_lr = None
best_n_iters = None

for lr in [0.001, 0.01, 0.1]:
    for n_iters in [1000, 3000, 5000]:
        model = Perceptron(learning_rate=lr, n_iters=n_iters)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = np.mean(y_pred == y_test)
        
        print(f"Learning Rate: {lr}, Iterations: {n_iters}, Accuracy: {accuracy:.4f}")
        
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_lr = lr
            best_n_iters = n_iters
        print(best_accuracy)

### &#128721; STOP (1 Point)
**Pause, save and commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

---

### 2.3 Next we shall test perceptron's capability of multiple-label classification.

* Dowdload the dataset from `https://raw.githubusercontent.com/huichiayu/cmse202-s25-supllemental_data/refs/heads/main/HW04/Telecust1.csv`.
* This is a customer category dataset (Kraggle/Customer Classification). Each cusmtoer has several feature variables.
* There are five categories of customers, which are non-numerics. Thus, let's convert those string labels to numerics.
* Successful data load-up gets **2 pt**.

In [None]:
# Download and load the dataset. Convert non-numerical labels to numerics.
# put your code here

df2 = pd.read_csv('Telecust1.csv')

print(df['category'].unique())

category_mapping = {
    'region': 0,
    'income': 1,
    'age': 2,
    'tenure': 3,
    'martial': 4,
    'address': 5,
    'ed': 6,
    'employ': 7,
    'gender': 8,
    'reside': 9,
    'custcat': 10,
    
}

df['category'] = df['category'].map(category_mapping)

print(df.head())

---
### 2.4 Multi-label perceptron classification

* As we know, perceptron is a binary classifier. For multiple-label classification, we can use One-vs-Rest (OvR) Strategy.
* In this case, let's train five individual perceptrons. 
* For each classifier, it treats the current class as "positive" and all others as "negative."
* When classifying a new sample, each classifier gives a "score," and the class with the highest score is chosen.

Copy your perceptron to the code cell below. We need to add a score method, which outputs dot of weights and features, as opposed to the previous binary predict method. The score method should output a signed floating score value, not `+1` or `-1`. This can be done by removing the binary segmenting, i.e., directly outputing the dot value.

* Functioning score() method gets **2 pt**.

In [None]:
# put your modified perceptron class here
class Perceptron:
    def __init__(self, learning_rate=0.01, n_iters=1000):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iters):
            for idx, x_i in enumerate(X):
                linear_output = np.dot(x_i, self.weights) + self.bias
                y_predicted = self._unit_step_function(linear_output)
                update = self.lr * (y[idx] - y_predicted)
                self.weights += update * x_i
                self.bias += update

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        return self._unit_step_function(linear_output)

    def score(self, X):
        return np.dot(X, self.weights) + self.bias

    def _unit_step_function(self, x):
        return np.where(x >= 0, 1, -1)


* Now let's do a train-test split of the data with a test_size = 0.3.
* Since we are training 5 perceptrons, we should have have 5 class label sets. For instance, in the label set for category A, the label value will be `+1` if it's type A and otherwise `-1`.
* Setting label sets gets **4 pt**.

In [None]:
# put your code here

X_train, X_test, y_train_original, y_test_original = train_test_split(
    X, y, test_size=0.3, random_state=42
)

classes = ['A', 'B', 'C', 'D', 'E']

y_train_sets = {}
for c in classes:
    y_train_sets[c] = np.where(y_train_original == c, 1, -1)

y_test_sets = {}
for c in classes:
    y_test_sets[c] = np.where(y_test_original == c, 1, -1)



* Use training set and the 5 training label sets to train your 5 perceptrons. Report the accuracy of those five training.
* Efficiently train the five perceptrons using nest loop gets **5 pt**.


In [None]:
# put your code here
class Perceptron:
    def __init__(self, learning_rate=0.01, n_iters=1000):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.activation_func = self._unit_step_function
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iters):
            for idx, x_i in enumerate(X):
                linear_output = np.dot(x_i, self.weights) + self.bias
                y_predicted = self.activation_func(linear_output)
                
                update = self.lr * (y[idx] - y_predicted)
                self.weights += update * x_i
                self.bias += update

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        return self.activation_func(linear_output)

    def _unit_step_function(self, x):
        return np.where(x >= 0, 1, -1)


perceptrons = {}  
train_accuracies = {} 


for label_name in ['A', 'B', 'C', 'D', 'E']:
    print(label_name)
    

    clf = Perceptron(learning_rate=0.01, n_iters=1000)
    

    clf.fit(X_train, y_train_sets[label_name])
    

    perceptrons[label_name] = clf
    

    y_train_pred = clf.predict(X_train)
    
    acc = np.mean(y_train_pred == y_train_sets[label_name])
    train_accuracies[label_name] = acc

for label_name, acc in train_accuracies.items():
    print(f"Class {label_name}: {acc:.4f}")

* Use the test vector to examine the accuracy.
* For each feature set, there should be 5 output scores, each from a perceptron. The predicted label should be the label that corresponds to the highest score.
* Report your accuracy. (**3 pt**)

In [None]:
# put your code here
all_scores = []

for label_name in ['A', 'B', 'C', 'D', 'E']:
    raw_scores = np.dot(X_test, perceptrons[label_name].weights) + perceptrons[label_name].bias
    all_scores.append(raw_scores)

all_scores = np.vstack(all_scores)


predicted_indices = np.argmax(all_scores, axis=0)


index_to_label = {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'}
y_pred = np.array([index_to_label[idx] for idx in predicted_indices])


test_accuracy = np.mean(y_pred == y_test)

test_accuracy

How good is your multiple-label perceptron classification?

<font size=6 color="#009600">&#9998;</font> it is good but, the multiple classes effect the accuracy


### &#128721; STOP (1 Point)
**Pause, save and commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

---
## Part 3 SVM classifiers (19 points)

### 3.1 SVM 

Let's re-use the customer category data. There are five caterogies with multiple feature variables.

* Use sklearn library to build a SVM classifier. Since we do not know what the best parametes are, perform a GridSearch for best parameters.
* NOTE: Because the dataset contains a large number of points, it's expected to have a long computer running time for GridSearch. Thus, let's use only the first 200 data points for GridSearch. You can start the grid search parameter like the image below. However, **NOTE** that if the kernal used cannot find a hyperplane to classify data points, the GridSearch function will stall. You need to manually remove that kernal from the parameter set and re-run GridSearch.
  
<img src="https://i.ibb.co/JWrp6c4q/Grid-Search-Param.png" width="650">


* As in the previous section, make a 70-30 train-test split and train your SVM classifier.
* Complete GridSearch to extract best parameters gets **5 pt**.

In [None]:
# put your code here.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split


X_small = X[:200]
y_small = y[:200]


param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1],
    'kernel': ['linear', 'rbf']  
}


svc = SVC()
grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)


grid_search.fit(X_small, y_small)

best_params = grid_search.best_params_
print("Best parameters from GridSearch:", best_params)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

final_svc = SVC(**best_params)
final_svc.fit(X_train, y_train)

train_accuracy = final_svc.score(X_train, y_train)
test_accuracy = final_svc.score(X_test, y_test)

print(train_accuracy)
print(test_accuracy)

* Examine the accuracy of this SVC and report the accuracy. Draw a confusion matrix. **2 pt**

In [None]:
# put your code here
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay


y_pred = final_svc.predict(X_test)


test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy of the final SVC: {test_accuracy:.4f}")


cm = confusion_matrix(y_test, y_pred, labels=['A', 'B', 'C', 'D', 'E'])


disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['A', 'B', 'C', 'D', 'E'])
disp.plot(cmap='Blues')
plt.title("Confusion Matrix for Final SVC Classifier")
plt.show()


Does SVM classifier work much better than your percetron?

<font size=6 color="#009600">&#9998;</font> I think the SVM is better, personally I like the format more also, it is more consise

### &#128721; STOP (1 Point)
**Pause, save and commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.

---
### 3.2 PCA 

Although we only have 11 feature variables in the dataset, let's examine how much principal component analysis (PCA) can accelerate the classification. We will increase the PCA components from 1 to 11. For each case, we will perform a GridSearch and use test set to examine the accuracy. 

* Write a code to loop over n_components = 1 through 11. **4 pt**
* Record the accuracy of each case and plot the profile of accuracy versus n_components. In the mean time, record the computer run times and plot the profile of time versus n_components. **2 pt**




In [None]:
# put your code here
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
import time


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)


param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1],
    'kernel': ['linear', 'rbf']  
}

accuracies = []
times = []

for n_components in range(1, 12):
    print(n_components)


    start_time = time.time()


    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)

    svc = SVC()
    grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, cv=5, n_jobs=-1, verbose=0)
    grid_search.fit(X_train_pca, y_train)

    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test_pca)
    acc = accuracy_score(y_test, y_pred)

    elapsed_time = time.time() - start_time
    accuracies.append(acc)
    times.append(elapsed_time)

    print(f"n_components={n_components}: accuracy={acc:.4f}, time={elapsed_time:.2f} sec")

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(range(1, 12), accuracies, marker='o')
plt.title('Accuracy vs Number of PCA Components')
plt.xlabel('Number of PCA Components')
plt.ylabel('Accuracy')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(range(1, 12), times, marker='o')
plt.title('Time vs Number of PCA Components')
plt.xlabel('Number of PCA Components')
plt.ylabel('Time (seconds)')
plt.grid(True)

plt.tight_layout()
plt.show()



Please answer the following questions. 
* How is the overall accuracy of this SVM classifier?  **1 pt**
* If the performance is not good, what do you think the cause is? **2 pt**

<font size=6 color="#009600">&#9998;</font> It is lower then I thought, classes overlap could be a possibility, or the values of our variables

* Describe the curves of time vs n_components and accuracy vs n_components. **1 pt**
* Explain why the curves behave as they are in the figures **2 pt**

<font size=6 color="#009600">&#9998;</font> the curves go high quick and flatten our at the end, as time increases, the accuracy goes up a lot and then there is a point where you have enough components you cannot increase accuracy enough

### &#128721; STOP (1 Point)
**Pause, save and commit your FINAL changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository with a meaningful commit message.



---
## Assignment wrap-up


Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!**



In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/r/mB0YjLYvAA" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

## Congratulations, you're done!

&#169; Copyright 2025,  Department of Computational Mathematics, Science and Engineering at Michigan State University