# CMSE 202 Homework 4 (Individual)

## Using SVM and PCA to predict the outcome of chess games

### Goals for this homework assignment

By the end of this assignment, you should be able to:

* Use `git` to track your work and turn in your assignment
* Read and impute data to prepare it for modeling
* Build, fit, and evaluate an SVC model of data
* Use PCA to reduce the number of important features
* Build, fit, and evaluate an SVC model of pca transformed data
* Systematically investigate the effects of the number of components on an SVC model of data


### Assignment instructions:

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

There are 25 points possible on this assignment. Point values for each part are included in the section headers.

This assignment is due at 11:59 pm on Friday, November 13th. It should be pushed to your repo (See Part 1). 

In [None]:
## Our imports
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.decomposition import PCA

---
## 1. Adding notebook to your turn-in repository

Like you did for Homework 3, you're going to add it to the CMSE202 repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to:

* Navigate to your /CMSE202/repos repository and create a new directory called hw-04.
* Move this notebook into that new directory in your repository, then add it and commit it to your repository.
   * Finally, to test that everything is working, "git push" the file so that it ends up in your GitHub repository.

Important: Make sure you've added your TA as a collaborators to your respository with "Read" access so that we can see your assignment. (*If you did this for Homework 3, you do not need to do it again*)

* Section 001: tuethan
* Section 002: Luis-Polanco
* Section 003: DavidRimel

Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the notebook, none of your changes will be tracked.

If everything went as intended, the file should now show up on your GitHub account CMSE202 repository under the hw-04 directory that you just created. Periodically, you'll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.

---
## 2. Chess Game Data

The data you will work are configurations of a chess end game. It assumes that a pawn is one move away from ["queening"](https://en.wikipedia.org/wiki/Promotion_(chess)) and 
the other pieces can be moved to perform different offensive or defensive actions. For each of the 36 potential features, there are several potential values for each (entries in a given column). **The details of the data matter a bit less for our purposes, but we are attempting to predict the won/loss by a given side.** If you really want to know about the data, you can look into a [classic text on Artificial Intelligence by Shapiro](https://www.amazon.com/Encyclopedia-Artificial-Intelligence-Stuart-Shapiro/dp/0471807486).

You will first do this with a full model, then investigate how well the model works after a PCA has been done on the data.

### 2.1 Read in the data

First you need to read in the data from `kr-vs-kp.data`. You can look at `kr-vs-kp.names` to see how the data is structured. But we give you the code for the column naming as there are so many features and they are unlabeled in the `.data` file.

```cols = ["bkblk","bknwy","bkon8","bkona","bkspr","bkxbq","bkxcr","bkxwp","blxwp","bxqsq","cntxt","dsopp","dwipd",
 "hdchk","katri","mulch","qxmsq","r2ar8","reskd","reskr","rimmx","rkxwp","rxmsq","simpl","skach","skewr",
 "skrxp","spcop","stlmt","thrsk","wkcti","wkna8","wknck","wkovl","wkpos","wtoeg","won"]```
 
<font size=8 color="#009600">&#9998;</font> Do this - Read in the data from `kr-vs-kp.data` using the columns listed above. Print the `.head()` of the dataframe.

In [None]:
## your code here
chess_data = pd.read_csv('kr-vs-kp.data', names =["bkblk","bknwy","bkon8","bkona","bkspr","bkxbq","bkxcr","bkxwp","blxwp","bxqsq","cntxt","dsopp","dwipd",
 "hdchk","katri","mulch","qxmsq","r2ar8","reskd","reskr","rimmx","rkxwp","rxmsq","simpl","skach","skewr",
 "skrxp","spcop","stlmt","thrsk","wkcti","wkna8","wknck","wkovl","wkpos","wtoeg","won"])
chess_data.head()

### 2.2 Imputing the data

There are no missing data in this data file, but there are some other issues. 

When you print the head of this data set, you probably noticed that all the features and labels are strings. We need to replace them with numerical values for modeling. For the `won` column replace winning with a 1 and losing with a 0. For the other columns, there are seven strings. Replace them using the following table:

| raw data | replaced |
| -------- | -------- |
| f | 1 |
| l | 2 |
| n | 3 |
| t | 4 |
| w | 5 |
| b | 6 |
| g | 7 |

**Note:** this choice really matters and for the models we have learned can really influence the results of our model. We do this because we need to for the model, but we haven't critically thought about the mapping that makes the most sense. There are other models (e.g., [tree-based alogrithms](https://en.wikipedia.org/wiki/Random_forest)) that can handle these categorical data without this mapping.

<font size=8 color="#009600">&#9998;</font> Do this - Replace the entries in the columns as indicated above. Print the `.head()` of the dataframe to show you have succesfull done so.

In [None]:
## your code here
chess_data['won'] = chess_data['won'].replace({'won':1})
chess_data['won'] = chess_data['won'].replace({'nowin':0})
chess_data = chess_data.replace({'f':1})
chess_data = chess_data.replace({'l':2})
chess_data = chess_data.replace({'n':3})
chess_data = chess_data.replace({'t':4})
chess_data = chess_data.replace({'w':5})
chess_data = chess_data.replace({'b':6})
chess_data = chess_data.replace({'g':7})

In [None]:
chess_data.head()

### 2.3 Separate features and class labels

As we have seen in our analyses using `sklearn` it is advantageous to separate our dataframes into `features` and `labels` for the analysis we are intending to do.

<font size=8 color="#009600">&#9998;</font> Do this - Separate the data frame into two: a features dataframe and a labels dataframe.

In [None]:
## your code here
features = chess_data.iloc[:,:-1]
class_labels = chess_data['won']

In [None]:
won = chess_data[chess_data['won'] ==1]
nowin = chess_data[chess_data['won'] ==0]
print(len(won))
print(len(nowin))

In [None]:
chess_data.columns

**Question:** How balanced is your outcome variable? Why does it matter for the outcome to be balanced?

<font size=8 color="#009600">&#9998;</font> There are about 100 more win labels that no win labels. This is a pretty balanced outcome variable, so that means that our model won't be better trained to classify one label more than another. If we had a signficantly higher amount of won labels then our model would be much better at classifying the won label and we might end up with a much lower accuraccy due to these misclassifications. This is why its crucial to have balanced data when training a model.

---
## 3. Building an SVC model

For this classification problem, we will use an support vector machine. As you learned in the midterm review, we could easily replace this with any `sklearn` classifier we choose. We will use a linear kernel.

### 3.1 Splitting the data

<font size=8 color="#009600">&#9998;</font> Do this - Split your data into a training and testing set with a train size representing 75% of your data. Print the lengths to show you have the right number of entries.

In [None]:
## your code here
train_vectors, test_vectors, train_labels, test_labels = train_test_split(features, class_labels, 
                                                                         train_size = .75, test_size = .25, 
                                                                         random_state =0)

### 3.2 Modeling the data and evaluting the fit

As you have done this a number of times, we ask you to do most of the analysis in one cell.

<font size=8 color="#009600">&#9998;</font> Do this - Build a linear SVC model (`C=100`), fit it to the training set, use the test features to predict the outcomes. Evaluate the fit using the confusion matrix and classification report.

 **Note:** You should look at the documentation on the confusion matrix because the way `sklearn` outputs false positives and false negatives is different from what most images on the web indicate.

In [None]:
## your code here
svm = SVC(kernel ='linear', C=100)
linear = svm.fit(train_vectors, train_labels)
y_pred = linear.predict(test_vectors)
confusion_matrix(test_labels, y_pred)

In [None]:
print(classification_report(test_labels, y_pred))

**Question:** How accurate is your model? What eveidence are you using to determine that? How many false positives and false negatives does it predict?

<font size=8 color="#009600">&#9998;</font> The model seems to be pretty accurate by looking at the precision and recall score which are both above .90. Also our confusion matrix shows that we have a lot of true prediction values. We have a high number of true positives and true negatives. We do have 16 false negatives and 19 false positives so we do have some misclassifications but overall we seem to have a good model. The recall being so close to 1 means that our model is pretty good at being able to find the positive samples and have accurate positive predictions. This further supports that our model is fairly accurate. 

---
## 4. Finding and using the best hyperparameters

We have fit one model and determined it's performance, but is it the best model? We can use `GridSearchCV` to find the best model (given our choices of parameters). Once we do that, we will use that best model going forward. **Note:** you would typically rerun this grid search in a production environment to continue to verify the best model, but we are not for the sake of speed.

### 4.1 Grid search

<font size=8 color="#009600">&#9998;</font> Do this - Using the following parameters (`C` = 1, 10, 100, 1000 and `gamma` = 1e-4, 1e-3, 0.01, 0.1) for both a `linear` and `rbf` kernel use `GridSearchCV` with the `SVC()` model to find the best fit parameters. Print the "best estimators".

In [None]:
##GridSearch for linear model
param_grid = {'C':[1,10,100,1000], 'gamma':[1e-4, 1e-3,0.01,0.1],}
clf = GridSearchCV(SVC(kernel='linear', class_weight='balanced'), param_grid)

clf = clf.fit(train_vectors, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)



In [None]:
##GridSearch for rbf model
param_grid = {'C':[1,10,100,1000], 'gamma':[1e-4, 1e-3,0.01,0.1],}
rbf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)

rbf = rbf.fit(train_vectors, train_labels)
print("Best estimator found by grid search:")
print(rbf.best_estimator_)

### 4.2 Evaluating the best fit model

Now that we have found the "best estimators", let's determine how good the fit is.

<font size=8 color="#009600">&#9998;</font> Do this - Use the test features to predict the outcomes for the best model. Evaluate the fit using the confusion matrix and classification report. 

**Note:** You should look at the documentation on the confusion matrix because the way `sklearn` outputs false positives and false negatives is different from what most images on the web indicate.

In [None]:
## your code here
linear_ypred = clf.predict(test_vectors)
rbf_ypred = rbf.predict(test_vectors)
print('The confusion matrix of the linear model is \n', confusion_matrix(test_labels, linear_ypred))
print('The confusion matrix of the rbf model is \n', confusion_matrix(test_labels, rbf_ypred))
print('The classification report of the linear model is \n', classification_report(test_labels, linear_ypred))
print('The classification report of the rbf model is \n', classification_report(test_labels, rbf_ypred))

**Question:** How accurate is this best model? What evidence are you using to determine that? How many false positives and false negatives does it predict?

<font size=8 color="#009600">&#9998;</font> The rbf model is the most accurate with a precision of 1 and a recall value of .99. Also in the confusion matrix we can see that there are only 2 false positives and 4 false negatives. This is in comparison to the true positives/negatives being much higher in the 300s. The linear model is also fairly accurate with a recall value of .95 and precision around the same. However we can see in the confusion matrix that their false negatives and positives are in the double digits whereas that is not the case with the rbf model. So the best model is the rbf model with the parameters we found using GridSearch. 

---
## 5. Using Principal Components

The full model uses 36 features to predict the results. And you likely found that the model is incredibly accurate. But in some cases, we might have even more features (which means much more computational time), and we might not need nearly the level of accuracy we can achieve with the full data set. So, we will see how close we can get with fewer features. But instead of simply removing features, we will use a PCA to determine the featurse that contribute the most the model (through their accounted variance) and use those to build our SVC model.

### 5.1 Building a PCA

We will start with a small number of components (say, 4) to see how well we can predict the outcomes of the games.

<font size=8 color="#009600">&#9998;</font> Do this - Using `PCA()`, fit a pca to your training features with 4 components. Transform both the test and training features using this pca. Plot the `explained_variance_` versus component number.

In [None]:
## your code here
pca = PCA(n_components=4, whiten=True)
pca = pca.fit(train_vectors)
pca_train_vectors = pca.transform(train_vectors)
pca_test_vectors = pca.transform(test_vectors)

In [None]:
plt.plot(pca.explained_variance_ratio_, marker ='o')
plt.ylabel('pca explained variance ratio')
plt.xlabel('component number')
plt.title('explained variance ratio vs component number')

In [None]:
sum(pca.explained_variance_ratio_)

**Question:** What is the total explained variance captured by this PCA (we will use this later, just quote the number)? How well do you think a model with this many featuers will perform? Why?

<font size=8 color="#009600">&#9998;</font>  The total explained variance ratio is .44 which is pretty low. I think that a model with this many features will not have a high accuracy since there are only 4 components. As we saw in earlier parts the full model with 36 features was very accurate so for with this dataset it makes sense that such a low number of components for a PCA model will not give us the desired high accuracy.

### 5.2 Fit and Evaluate an SVC model

Using the pca transformed features, we will train and test an SVC model using the "best estimators".

<font size=8 color="#009600">&#9998;</font> Do this - Using the pca transformed training data, build and train an SVC model. Predict the classes using the pca transformed test data. Evaluate the model using the classfication report, and the confusion matrix.

In [None]:
## your code here
pca_svm = SVC(kernel ='rbf', C=100, gamma = 0.01)
pca_model = pca_svm.fit(pca_train_vectors, train_labels)
pca_ypred = pca_model.predict(pca_test_vectors)

In [None]:
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred))
print('The classification report is \n', classification_report(test_labels, pca_ypred))

**Question:** How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the full model?

<font size=8 color="#009600">&#9998;</font> The model is not that accurate. We can see that there are a large number of false positives and false negatives, with 144 and 153 respectively. The precision is much lower than the full model at .62 with a recall within .1 of that same value. The full model has signficantly fewer false positives and false negatives with a recall/precision value greater than .90.

### 5.3 Repeat your analysis with more components

You probably found that the model with 4 features didn't work so well. What if we increase the number of components (say to 30, which is still 6 fewer than the full data set). What happens now?

<font size=8 color="#009600">&#9998;</font> Do this - Repeat your analysis from 5.1 and 5.2 using 30 components instead.

In [None]:
## your code here
pca1 = PCA(n_components=30, whiten=True)
new_pca = pca1.fit(train_vectors)
new_pca_train_vectors = pca1.transform(train_vectors)
new_pca_test_vectors = pca1.transform(test_vectors)

In [None]:
plt.plot(pca1.explained_variance_ratio_, marker ='o')
plt.ylabel('pca explained variance ratio')
plt.xlabel('component number')
plt.title('explained variance ratio vs component number')

In [None]:
sum(pca1.explained_variance_ratio_)

In [None]:
pca_svm1 = SVC(kernel ='rbf', C=100, gamma = 0.01)
pca_model_new = pca_svm1.fit(new_pca_train_vectors, train_labels)
pca_ypred_new = pca_model_new.predict(new_pca_test_vectors)

In [None]:
print('The confusion matrix is \n', confusion_matrix(test_labels, pca_ypred_new))
print('The classification report is \n', classification_report(test_labels, pca_ypred_new))

**Question:** What is the total explained variance captured by this PCA? How accurate is this model? What evidence are you using to determine that? How many false positives and false negatives does it predict? How does it compare to the 4 component model? To the full model?

<font size=8 color="#009600">&#9998;</font> The total explained variance ratio by this PCA is around .99 which is much higher than the last PCA model we ran. We want a higher explained variance ratio value because this implies that our model will make better predictions. As we can see in the printed confusion matrix there are 4 false negatives and 4 false positives. In comparison to our true values this model does really well at correctly classifying vectors. This PCA does much better than the PCA with only 4 components because as shown in the classification report the precision and recall scores are much higher and extremely close to 1 but with the original PCA we had a precision in the .60s, showcasing an inaccurate model. Also this PCA with 30 components does better than the full model that we originally made in part 3. The precision for both models was above .90 but the PCA had a precision of .99 whereas the original full model has a precision score of .95.

---
## 6. How well does a PCA work?

Clearly, the number of components we use in our PCA matters. Let's investigate how they matter by systematically building a model for any number of selected components.

### 6.1 Accuracy vs. Components

We will do this by writing a function that creates the PCA, the SVC model, fits the training data, predict the labels using test data, and returns the accuracy scores and the explained variance. So your function will take as input:
* the number of components
* the training features
* the test features
* the training labels
* the test labels
and it will return the accuracy scores for an SVC model fit to pca transformed features and the total explained variance.

<font size=8 color="#009600">&#9998;</font> Do this - Create this function, which you will use in the next section.

In [None]:
## your code here
def pca_function(components, train_v=train_vectors, test_v=test_vectors, 
                 train_labels=train_labels, test_labels = test_labels):
    pca = PCA(n_components=components, whiten=True)
    pca = pca.fit(train_v)
    transformed_train_vectors = pca.transform(train_v)
    transformed_test_vectors = pca.transform(test_v)
    svm = SVC(kernel ='rbf', C=100, gamma = 0.01)
    model = svm.fit(transformed_train_vectors, train_labels)
    y_pred = model.predict(transformed_test_vectors)
    total_var = sum(pca.explained_variance_ratio)
    accuracy = accuracy_score(test_labels, y_pred)
    return total_var, accuracy

### 6.2 Compute accuracies

Now that you have created a function that returns the accuracy for a given number of components, we will use that to plot the how the accuracy of your SVC model changes when we increase the number of components used in the PCA.

<font size=8 color="#009600">&#9998;</font> Do this - For 1 to 36 components, use your function above to compute and store (as a list) the accuracy of your models.

In [None]:
## your code here
output_list=[]
i=1
for i in range(1,37,1):
    output = pca_function(components = i, train_v=train_vectors, test_v = test_vectors, 
                          train_labels = train_labels, test_labels = test_labels)
    output_list.append(output)
    i +=1 
output_list = pd.DataFrame(output_list)
accuracy_list = output_list.iloc[:, 1]
accuracy_list

### 6.3 Plot accuracy vs number of components

Now that we have those numbers, it makes sense to look at the accuracy vs components.

<font size=8 color="#009600">&#9998;</font> Do this - Plot the accuracy vs components.

In [None]:
## your code here
component_list = list(np.arange(1,37,1))
plt.plot(component_list, accuracy_list)
plt.xlabel('number of components')
plt.ylabel('accuracy score')
plt.title('accuracy vs number of components')

**Question:** Where does it seem like we have diminishing returns, that is, no major increase in accuracy as we add additional components to the PCA?

<font size=8 color="#009600">&#9998;</font> I would say that based on the plot once we reach above 20 components, or more specifically 22 we don't see a huge uptick in accuracy as we add additional components to the PCA. It seems like accuracy increases with each component until we reach a little above 20 components. 

### 6.4 Plot total explained variance vs number of components

<font size=8 color="#009600">&#9998;</font> Do this - Plot the total explained variance vs components. 

In [None]:
## your code here
variance_list = output_list.iloc[:,0]
plt.plot(component_list, variance_list)
plt.xlabel('number of components')
plt.ylabel('total explained variance ratio ')
plt.title('total explained variance ratio vs number of components')

**Question:** Where does it seem like we have diminishing returns, that is, no major increase in explained variance as we add additional components to the PCA? How does that number of components compare to the diminishing returns for accuracy?

<font size=8 color="#009600">&#9998;</font> It seems that according to this plot above we have dimishing returns at around 30 or so components. This is higher than the dimishing returns for accuracy as shown in this plot we see that for pretty much every increase in component number the explained variance also manages to increase, so it takes a bit longer for us to have dimishing returns in this plot in comparison to the accuracy vs component number plot.

---
## 7. Assignment wrap-up¶
Please fill out the form that appears when you run the code below. **You must completely fill this out in order to receive credit for the assignment!**

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://docs.google.com/forms/d/e/1FAIpQLSc0IBD2mdn4TcRyi-KNXVtS3aEg6U4mOFq2MOciLQyEP4bg1w/viewform?usp=sf_link" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

### Congratulations, you're done!
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the "Homework Assignments" folder, find the dropbox link for Homework 4, and upload your notebook.