### <p style="text-align: right;"> &#9989; Bella Said.</p>

# CMSE 202 Homework 03






### Assignment instructions

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

There are **25 points** possible on this assignment. Point values for each part are included in the section headers.

This assignment is due at 11:59 pm on **Friday October 23rd**. It should be uploaded into the "Homework Assignments" submission folder for Homework 3 in your D2L webpage. Submission instructions can be found at the end of the notebook.

**Hint**: It is possible you are asked to do something you are not familiar with. That's why you have internet access. Do some smart searches and see what you can find! 


### Our imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import statsmodels.api as sm
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

## Part 1: Setting up a repository for tracking changes (3 points)

For this assignment, you're going to add it to the CMSE202 repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to:

* Navigate to your `/CMSE202/repos` repository and create a new directory called `hw-03`.
* Move this notebook into that new directory in your repository, then add it and commit it to your repository.
 * Finally, to test that everything is working, "git push" the file so that it ends up in your GitHub repository.

Important: Make sure you've added your TA as a collaborators\ to your respository with "Read" access so that we can see your assignment.

* Section 001:  tuethan
* Section 002:  Luis-Polanco
* Section 003:  DavidRimel

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the noteobok, none of your changes will be tracked.


If everything went as intended, the file should now show up on your GitHub account CMSE202 repository under the `hw-03` directory that you just created. Periodically, you'll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.

## Part 2: Load, prepare and plot the data (5 points)

In this homework we will be working with the yeast dataset and building logistic regression and k-nearest neighbors classifier class. The data file is *yeast.data* and its description is in *yeast.names*. Read the description and get a sense of the meaning of the dataset. In this part, we will load and clean up the data.

**Question 2.1 (1 point)** Load the *yeast.data* as a pandas dataframe and give appropriate names to the columns. Then drop the columns **sequence name**, **pox** and **vac**. What's the size of this dataset now?

In [None]:
### Put your code here ###
yeast = pd.read_csv("yeast.data",sep = '\s+' , names = ['Sequence Name', 'mcg', 'gvh', 'alm', 'mit', 'erl', 'pox', 'vac', 'nuc', 'class label'])

In [None]:
yeast = yeast.drop(columns = ['Sequence Name', 'pox', 'vac'], axis=1)

In [None]:
yeast.head()

In [None]:
print('The size of the dataset is now,', yeast.size)

**Question 2.2 (1 point)** Find the number of unqiue entries in the class label column

In [None]:
labels = yeast['class label']
print('There are ', len(labels.unique()), 'unique entries in the class label column')

**Question 2.3 (1 point)** We are only interested in data with label **CYT (cytosolic or cytoskeletal)** and **MIT (mitochondrial)**. Make a new dataframe containing
data with only these two types of labels, and redefine label **CYT** into **0**, and **MIT** into **1**. What's the size of the dataset now?

In [None]:
### Put your code here ###
cyt = yeast[yeast['class label']=='CYT']
mit = yeast[yeast['class label']=='MIT']
data = [cyt, mit]
new_data = pd.concat(data)

In [None]:
new_data["class label"] = new_data["class label"].replace({'CYT':0})
new_data["class label"] = new_data["class label"].replace({'MIT':1})

In [None]:
print('The size of the dataset is now,', new_data.size)

**Question 2.4 (2 points)** Make a scatter plot including every sample in the dataset with: the mcg feature on the x-axis, the gvh feature on the y-axis, and different colors for each class label. Make your observation. Are the two classes distinguishable using only those two features?

In [None]:
### Put your code here ###
plt.scatter(new_data['mcg'], new_data['gvh'], c=new_data['class label'])
plt.xlabel('mcg')
plt.ylabel('gvh')
plt.title('gvh vs mcg')

No, the two classes are not distinguishable using just these two features. As seen in the scatter plot, the two classes are basically completely overlapped so it would be very difficult to look at these classes individually just based on these two features. 

# Logistic Regression

In the next part we will build a logistic regression model for the data classification.

## Part 3: Prepare data and build the logistic regression model (7 points)


**Question 3.1 (2 points)** Apply the "train_test_split" function in the *sklearn* package to split the data in 70% for training and 30% for testing.  Using common variable names like x_train, y_train, x_test and y_test might help later.

In [None]:
features = new_data.iloc[:, :-1]

In [None]:
### Put your code here ###
x_train_vectors, x_test_vectors, y_train_labels, y_test_labels = train_test_split(features, new_data['class label'],
                                                                                 train_size=0.7,test_size=0.30,
                                                                                 random_state = 0)

**Question 3.2 (2 points)** Perform the logistic regression. 
* Discuss your results. How well does your model fit your data? What evidence are you using to make the determination? 
* Based on the P values under "P > |z|", which two features **in this dataset** are the least significant and can be dropped?

In [None]:
### Put your code here ###
logit_model = sm.Logit(y_train_labels, sm.add_constant(x_train_vectors))
result = logit_model.fit()
print(result.summary())

The two features with the highest p values are erl and nuc. They both have statistically insignificant p values above .05, with nuc having a p value of .315 and erl having a p value close to 1 at .990. 

**Question 3.3 (3 points)** Drop the two least important features found in the previous question and perform the logistic regression again. Then use the use the `sklearn.metrics` we imported at the top and run the `accuracy_score` on the 0/1 predicted label and the test labels, and print the accuracy of this model.

* Discuss your results. How well does your reduced model fit your data? What evidence are you using to make the determination?

In [None]:
### Put your code here ###
train_vectors_reduced = x_train_vectors.drop(columns=['nuc', 'erl'], axis=1)
reduced_logit_model = sm.Logit(y_train_labels, sm.add_constant(train_vectors_reduced))
result_reduced = reduced_logit_model.fit()
print(result_reduced.summary())

In [None]:
test_vectors_reduced = x_test_vectors.drop(columns=['nuc', 'erl'], axis=1)

In [None]:
predicted_vals = result_reduced.predict(sm.add_constant(test_vectors_reduced))
values =[1 if i > .5 else 0 for i in predicted_vals]
metrics.accuracy_score(values, y_test_labels)

This model fits the data better than the full model. We can see that by looking at the p values and how they are all close to zero and statistically significant except for two features: alm and mcg. 

# K-Nearest Neighbors

In the next part we will be building a class that will use the k-nearest neighbors algorithm (kNN) to make predictions on the same dataset. From the previous part (logistic regression), you have selected **4 features** that are important for classification. We will **only** use those 4 features in this part.


## Part 4: KNN classifier, cross-validation and hyperparameter tuning (10 points)

**Question 4.1 (3 points)** Test drive the KNN classifier. Use the same train and test data you created in question 3.4 to build a KNN classifier with K=3. 
- make a `KNeighborsClassifier` with an argument of `n_neighbors=3`. This returns a knn classifier (let's just call it `knn`)
- call `knn.fit` on the training data
- use `knn.predict` on the testing data to generate the predicted values.
- print the confusion matrix.
- print the train and test score using `knn.score`.
- plot the ROC curve with the diagonal (the "chance line") also labeled. Using `sklearn.metrics`, print the `auc` for this model.

In [None]:
### Put your code here ###
knn = KNeighborsClassifier(n_neighbors=3)
classifier = knn.fit(train_vectors_reduced, y_train_labels)
predicted_results = knn.predict(test_vectors_reduced)

In [None]:
metrics.confusion_matrix(y_test_labels, predicted_results)

In [None]:
print('The train score is', knn.score(train_vectors_reduced, y_train_labels))
print('The test score is', knn.score(test_vectors_reduced, y_test_labels))

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test_labels, predicted_results)
roc_auc = metrics.auc(fpr,tpr)
plt.plot(fpr,tpr)
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.plot(fpr,fpr, linestyle ='dashed', label='chance line')
plt.title('ROC Curve')
plt.legend()
print('The AUC for this model is', roc_auc)

## k-Fold Cross-Validation
Cross-validation is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.
For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the model would be trained and tested 5 separate times so each group would get a chance to be the test set. This can be seen in the graph below.

<img src="https://miro.medium.com/max/1400/1*NyvaFiG_jXcGgOaouumYJQ.jpeg" width=700px>

The train-test-split method we used in earlier is called ‘holdout’. Cross-validation is better than using the holdout method because the holdout method score is dependent on how the data is split into train and test sets. Cross-validation gives the model an opportunity to test on multiple splits so we can get a better idea on how the model will perform on unseen data.




**Question 4.2 (2 points)** Look up `cross_val_score` in `sklearn.model_selection`. We will still use n_neighbors=3, and  a cross-validation value of 5. `cross_val_score` takes in our k-NN model and our data as parameters. Then it splits our data into 5 groups and fits and scores our data 5 seperate times, recording the accuracy score in an array each time. We will save the accuracy scores in the cv_scores variable. Then find the average of the cv_scores, that will provide you a more accurate understanding of the accuracy of the model.

* Discuss your results. How well do your models fit your data? 
* What are you using to judge that fit (i.e., how should we think about the accuracy score as a measure of quality of the model)?
* How does the quality of the KNN model compare to logistic regression?

In [None]:
### Put your code here ###
cv_scores = cross_val_score(knn, test_vectors_reduced, y_test_labels, cv=5)

In [None]:
avg = sum(cv_scores)/len(cv_scores)
print('The average of the cross validation scores is', avg)

The cross validation score is .77 which is good but for this model we do want a higher accuracy. Although accuracy is a good way to measure that model we would also want to look at precision of the model as well. Precision could be high while accuracy is low because we the model might be good at classifying data incorrectly however in the same way. To get a better idea of if our model is accurate or 'good' we do want to look at other methods besides just accuracy. The knn accuracy score is lower than the accuracy score of our reduced model, where our accuracy score with our logistic regression having an accuracy of .79 and our accuracy for our knn model is .78. Our average cross validation score is even lower than the accuracy of our knn model with a value of .77. 

## Hyperparameter tuning


Almost all machine learning models have hyperparamters. Hyperparameters are setting(s) in the model that the user needs to choose before learning takes place. For example, in k-nearest neighbors, the number of neighbors to consider  n_neighbors, is the hyperparameter. An important task in machine learning is hyperparameter tuning, which is finding the optimal hyperparmeter. We will now explore the optimal choice of this parameter for this dataset.

**Question 4.3 (3 points)** Consider the range of `n_neighbors` from 1 to 100, and fix the cross-validation value to be 5. 
- For each value of n_neighbors, compute the means of the cv_scores. 
- Make a plot with the x-axis being n_neighbors, y-axis being the mean of cv_scores.
- Find the optimal choice of n_neighbors with the largest value of the mean of cv_scores.

Discuss your results
* How does the quality of this model compare to the earlier models that you made with KNN and logisitic regression?


In [None]:
### Put your code here ###
scores_list =[]
for i in range(1,101,1):
    Knn = KNeighborsClassifier(n_neighbors=i)
    scores = cross_val_score(Knn, test_vectors_reduced, y_test_labels, cv=5)
    average = sum(scores)/len(scores)
    scores_list.append(average)

In [None]:
plt.plot(np.arange(1,101,1), scores_list)

In [None]:
print(max(scores_list))
scores_list.index(max(scores_list))

Because pythons starts counting at 0, we shift our n_neighbors value up 1, so the maximum average cv score was when n_neighbors = 22. So the optimal n_neighbor value should be 22. The quality of this model is higher than both the knn model and logistic regression model, but not by much. Both of the previous models had accuracy/cross validation scores in the high seventies while this model with n_neighbors = 22 has a cross validation average cross validation score of .816 which is this highest we've seen so far.  

Now we will use a more efficient method: `GridSearchCV` in `sklearn.model_selection` to find the optimal n_neighbors.

**Question 4.4 (2 points)** Look up `GridSearchCV` in `sklearn.model_selection`. We will still use a cross-validation value of 5.  Use `best_params_` in `GridSearchCV` to find the optimal n_neighbors. Does it agree with the results from question 4.3?

In [None]:
### Put your code here ###
param_grid = {'n_neighbors': np.arange(1,101,1)}
# make a classifier by searching over a classifier and the parameter grid
clf = GridSearchCV(KNeighborsClassifier(), param_grid, cv =5, n_jobs = -1)
clf = clf.fit(test_vectors_reduced, y_test_labels)

In [None]:
print(clf.best_params_)

best_params_ in GridSearchCV printed that when n_neighbors = 22 we have the optimal value for our n_neighbors and therefore maximum average cross validation score. This does agree with the n_neighbors value that we calculated by hand in question 4.3 


---
### Assignment wrap-up

Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!**

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://docs.google.com/forms/d/e/1FAIpQLSc0IBD2mdn4TcRyi-KNXVtS3aEg6U4mOFq2MOciLQyEP4bg1w/viewform?usp=sf_link" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

### Congratulations, you're done!

Submit this assignment by uploading it to the course Desire2Learn web page.  Go to the "Homework Assignments" folder, find the dropbox link for Homework 3, and upload your notebook **and the script you wrote**.