# Fall Problem Session 9
## Classifying Pumpkin Seeds III

In this notebook you continue to work with the pumpkin seed data from <a href="https://link.springer.com/article/10.1007/s10722-021-01226-0">The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.)</a> by Koklu, Sarigil and Ozbek (2021).


The problems in this notebook will cover the content covered in some of our `Classification`, `Dimension Reduction` and our `Ensemble Learning` notebooks. In particular we will cover content touched on in:
- `Classification/Support Vector Machines`
- `Classification/Decision Trees`,
- `Ensemble Learning/Random Forests` and
- `Dimension Reduction/Principal Components Analysis`.

In [None]:
## Importing packages you will likely use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

#### 1. Load and prepare the data

Run the code below in order to:

- Load the data stored in `Pumpkin_Seeds_Dataset.xlsx` in the `Data` folder,
- Create a column `y` where `y=1` if `Class=Ürgüp Sivrisi` and `y=0` if `Class=Çerçevelik` and
- Make a train test split setting $10\%$ of the data aside as a test set.

In [None]:
## loading the data
seeds = pd.read_excel("../Data/Pumpkin_Seeds_Dataset.xlsx")

## making a target column, y
seeds['y'] = 0

seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1

In [None]:
## importing train_test_split
from sklearn.model_selection import train_test_split

In [None]:
## making a stratified train test split
seeds_train, seeds_test = train_test_split(seeds.copy(),
                                              shuffle=True,
                                              random_state=123,
                                              test_size=.1,
                                              stratify=seeds.y.values)

#### 2. Refresh your memory

If you need to refresh your memory on these data and the problem, you may want to look at a small subset of the data, look back on `Fall Problem Session 7` and `Fall Problem Session 8` and/or browse Figure 5 and Table 1 of this paper, <a href="pumpkin_seed_paper.pdf">pumpkin_seed_paper.pdf</a>

#### 3. A support vector machine classifier

In this problem you will work to build a support vector classifier on these data. Along the way you will get a closer look at iterative versions of the hyperparameter grid search. 

##### a.

Start by importing the support vector classifier from `sklearn`. Note that these data are not close to being linearly separable so we will not want `LinearSVC`.

In [None]:
## import LinearSVC here
from sklearn.

##### b.

You will now perform hyperparameter tuning on the `C` parameter of the support vector classifier. Fill in the missing pieces of the code below to perform 10-fold cross-validation for different values of `C`.

In [None]:
## import the correct kfold class
from sklearn.model_selection import 

## import Pipeline and StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

## import accuracy_score
from sklearn.metrics import accuracy_score

In [None]:
## this will isolate the feature columns for you
features = seeds_train.columns[:-2]

In [None]:
## set the number of CV folds
n_splits = 

## Make the kfold object
kfold = 

## the values of C you will try
Cs = [.01, .1, 1, 10, 25, 50, 75, 100, 125, 150]

## this will hold the CV accuracies
C_accs1 = np.zeros((n_splits, len(Cs)))


## the cross-validation loop
i = 0
for train_index, test_index in kfold.split(seeds_train, seeds_train.y):
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    j = 0
    ## loop through all possible values of C
    for C in Cs:
        ## make the model, fit it and get the prediction 
        
    
        
    
        pred = 
        
        ## record the accuracy of your prediction on the holdout set
        C_accs1[i, j] = accuracy_score(seeds_ho.y, pred)
        
        j = j + 1
    i = i + 1

##### c.

Plot the average cross-validation accuracy against the $\log$ of `C`.

In [None]:
## make a figure object
plt.figure(figsize = (8,6))

## plot the log of C on the horizontal axis,
## the avg cv accuracy on the vertical axis
plt.plot()

## adding labels
plt.xlabel("$\log(C)$", fontsize=16)
plt.ylabel("Avg. CV Accuracy", fontsize=16)
plt.xticks(np.arange(-2,3,.5),fontsize=14)
plt.yticks(fontsize=14)

plt.show()

##### d.

A common thing that is done after one grid search is to do another grid search using values from the previous grid search as the endpoints of the grid. This is done to try and better hone in on the optimal value of the hyperparameter.

Create a new grid of `C` values using the plot you made in <i>c.</i> to determine the new endpoints. Then run cross-validation a second time. Plot the average accuracies against `C`.

In [None]:
## the values of C you will try
Cs = 

## this will hold the CV accuracies
C_accs2 = np.zeros((n_splits, len(Cs)))


## the cross-validation loop
i = 0
for train_index, test_index in kfold.split(seeds_train, seeds_train.y):
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    j = 0
    ## looping through your new C values
    for C in Cs:
       ## make the model, fit it and get the prediction  
    
        
    
        pred = 

        ## record the accuracy of your prediction on the holdout set
        C_accs2[i, j] = accuracy_score(seeds_ho.y, pred)
        
        j = j + 1
    i = i + 1

In [None]:
plt.figure(figsize = (8,6))

## Plot the C values on the horizontal axis
## plot the avg CV accuracies on the vertical axis
plt.plot()

plt.xlabel("$C$", fontsize=16)
plt.ylabel("Avg. CV Accuracy", fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.show()

##### e.

What was the optimal value of `C`, what was the average cross-validation accuracy for this value of `C`?

##### Write your answer here

#### 4. Tuning a random forest with `GridSearchCV`

In this problem you will tune the `max_depth` and `n_estimators` hyperparameters of a random forest model. First you will use a `for` loop for the cross-validation. Then you will see how much easier your life could be with `GridSearchCV`.

##### a. 

Import `sklearn`'s random forest model for classification.

In [None]:
## Import the random forest classification model
from sklearn.

##### b.

Fill in the code below to find the values of `max_depth` and `n_estimators` with the highest average cross-validation accuracy.

In [None]:
## note this will take about 2 minutes to run

## the possible max_depth values you'll consider
max_depths = range(1, 11)

## you'll consider n_estimators = 100, and 500
n_trees = [100, 500]


## This will record the accuracies for each loop
rf_accs = np.zeros((n_splits, len(max_depths), len(n_trees)))


i = 0
## the cross-validation loop
for train_index, test_index in kfold.split(seeds_train, seeds_train.y):
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    j = 0
    ## looping through all possible max_depth values
    for max_depth in max_depths:
        k = 0
        ## looping through all possible n_estimators values
        for n_estimators in n_trees:
            print(i,j,k)
            ## make the random forest model object here
            ## set max_samples = int(.8*len(seeds_tt)) and set a random state
            
            
            ## get the prediction on the holdout set
            pred = 
            
            ## record the accuracy of the prediction
            rf_accs[i,j,k] = accuracy_score(seeds_ho.y,  pred)
            k = k + 1
        j = j + 1
    i = i + 1

In [None]:
## This gives you the values with the best Avg CV Accuracy
max_index = np.unravel_index(np.argmax(np.mean(rf_accs, axis=0), axis=None), 
                                       np.mean(rf_accs, axis=0).shape)


print("Maximum Depth:",max_depths[max_index[0]])
print("Number of trees:",n_trees[max_index[1]])

In [None]:
## find the optimal mean CV Accuracy here here



##### c.

In this problem you will learn about `GridSearchCV`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</a>, a handy class from `sklearn` that makes hyperparameter tuning through a grid search and cross-validation quicker to code up than writing out a series of `for` loops.


Read through the code chunks below and fill in the missing code to run the same grid search cross-validation you did above with `GridSearchCV`.

In [None]:
## first import GridSearchCV
from sklearn.model_selection import 

In [None]:
## This will also take about two minutes
grid_cv = GridSearchCV(, # first put the model object here
                          param_grid = {'max_depth':, # place the grid values for max_depth and
                                        'n_estimators':}, # and n_estimators here
                          scoring = , # put the metric we are trying to optimize here as a string, "accuracy"
                          cv = ) # put the number of cv splits here

## you fit it just like a model, model.fit(features, target)
## fit grid_cv here


Once a `GridSearchCV` is fit you are easily able to find what hyperparameter combinations were best, what the optimal score was as well as get access to the best model.

In [None]:
## You can find the hyperparameter grid point that
## gave the best performance like so
## .best_params_
grid_cv.best_params_

In [None]:
## You can find the best score like so
## .best_score_
## You try


In [None]:
## Calling best_estimator_ returns the model with the 
## best avg cv performance after it has been refit on the
## entire data set
## You try to look at the best model


The `best_estimator_` is a model with the optimal hyperparameters that has been fit on the entire training set. Try and predict the pumpkin seed class on the training set with the `best_estimator_` below.

If you want to look at all of the results, you can do that as well with `.cv_results`. See all that entails by running the code below.

In [None]:
## You can get all of the results with cv_results_
grid_cv.cv_results_

##### d.

Using either the `best_estimator_` fitted model or a refitted model according to your results from the `for` loop cross-validation find the feature importance scores. Try and refer back to your notes from `Fall Problem Session 7`, how do the scores compare to your initial EDA?

In [None]:
## code here



##### Write any notes here




In the next notebook you will build a couple of more models on these data and select a final model.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)