In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("classification.ipynb")

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
SEED = 2313

**Important note** Unless the instruction says otherwise, it's OK if your series or data frame results have additional rows/columns not requested, as long as the required information is shown.

The following dataset collects information about Google searches related to U.S. major sports leagues (5 professional leagues, plus college football and college baseball), tabulated by location in designated market areas (DMA). The values reflect the shares of all such searches attributed to each league. The last column is the share of the vote won by Donald Trump in the 2016 presidential election.

(Note how some extra parameters are used to designate the index column and header row within the CSV file.)


In [None]:
fans = pd.read_csv("NFL_fandom_data-google_trends.csv",index_col=0,header=1)
fans

## Preliminaries

**1.** Overwrite `fans` with a frame in which the percent signs have been stripped out of all the columns, which are then converted to floating-point values.

In [None]:
fans = ...

In [None]:
grader.check("prelim-convert")

**2.** Create a feature matrix (in the form of a frame) from all the columns of `fans` except the last. 

In [None]:
X = ...

In [None]:
grader.check("prelim-features")

**3.** Make a Boolean label series that is `True` in each row where Trump got more than half of the vote.

In [None]:
y = ...

In [None]:
grader.check("prelim-labels")

**4.** Split the dataset, reserving 15% of it for testing and the rest for training. Make sure the split is random and starting from random state 3383.

In [None]:
X_train,X_test,y_train,y_test = 

In [None]:
grader.check("prelim-split")

## Decision tree

**1.** Train a decision tree of maximum depth 4 on the full dataset. Create a series of predictions using all the data, and compute the accuracy of the classifier.

**Important!** The decision tree classifier may randomly break ties among equivalent splits. In order to make your results reproducible, at the classifier creation you have to set its `random_state` equal to `SEED`, which is predefined above for you.

In [None]:
dt = ...
yhat = ...
accuracy = ...

In [None]:
grader.check("tree-accuracy")

**2.** Considering "Trump > 50%" to be the "positive" case, how many false positives does the classifier have?

In [None]:
FP = ...

**3.** Considering "Trump > 50%" to be "positive," find the recall and F₁ score on the full dataset.

In [None]:
recall = ...
F1 = ...

In [None]:
grader.check("tree-scores")

**4.** Which of the sports leagues is most important to the classifier? What fraction of the overall impurity reduction does it account for?

In [None]:
league = ...
fraction = ...

In [None]:
grader.check("tree-importance")

## Nearest neighbors

**1.** Each row of the feature matrix `X` is a 7-dimensional vector. Find the 2-norm and 1-norm of the difference of the first two rows.

In [None]:
two_norm = ...
one_norm = ...

In [None]:
grader.check("knn-norm")

**3.** Train a nearest neighbors classifier with 5 neighbors on the training set, and measure its accuracy on the test set.

In [None]:
accuracy = ...

In [None]:
grader.check("knn-accuracy")

**4.** Which DMA most resembles a region that has 18% share in each of the professional leagues, plus 5% share in each of *CBB* and *CFB*?

In [None]:
region = ...

In [None]:
grader.check("knn-neighbors")

**5.** Repeat problem 3 above, but using standardized training data (i.e., Z-scores).

In [None]:
accuracy = ...

## Support vector machine


**1.** Using the full dataset, train a linear kernel SVM classifier with $C=0.1$. Find the distance from the last sample observation to the separating hyperplane.

In [None]:
def dot(u,v):
    return sum(u[i]*v[i] for i in range(len(u)))
dist = ...

In [None]:
grader.check("svm-dist")


**2.** Using the train/test split, compute *F*₁ scores (with "Trump > 50%" as "positive") for a linear kernel SVM classifier with $C=1$, and for an RBF kernel classifier with $C=10$.

In [None]:
linear = ...

rbf = ...

In [None]:
grader.check("svm-f1")

**3.** Repeat the previous problem, but use standardized values (Z-scores).

In [None]:
linear_standard = ...

rbf_standard = ...

In [None]:
grader.check("svm-f1_standard")

## Validation and selection


**1.** Create a K-fold cross-validation scheme with 6 folds, starting at random state `SEED`. Use it to find a vector of balanced accuracy scores on the training data of a decision tree classifier with maximum depth 5.

**Important!** The decision tree classifier may randomly break ties among equivalent splits. In order to make your results reproducible, at the classifier creation you have to set its `random_state` equal to `SEED`, which is predefined above for you.

In [None]:
test_score = ...

In [None]:
grader.check("vaild-kfold")

<!-- BEGIN QUESTION -->

**2.** Using the same setup as in the previous problem, plot cross-validated training and test errors as a function of maximum decision tree depth running from 2 through 12.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**3.** Repeat the previous problem using k-nearest neighbor classifiers with the number of neighbors running from 2 through 20, applied to standardized scores.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**4.** Repeat the previous problem using an SVM classifier with RBF kernel and $C$ equal to $m/20$ as $m$ ranges over each integer from 1 through 20. Use standardized scores for the training.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**5.** For each of the three classifier types, select the hyperparameter value that gives the least error. For each classifier type, train on all of the training data at the optimum value, and then compute the confusion matrix on the test data.

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

Select *Kernel/Restart & Run All*, then save, then run this export cell again. Submit by pushing the resulting zip file to your GitHub assignment repo.

In [None]:
grader.export(pdf=False, force_save=True)