# DS 3000 Quiz 3

Due by: Monday Nov 14 @ 11:59 PM EST

Time Limit: You have 2 hours to complete the assignment once started

## Instructions

This quiz has 100 points total.

- You are welcome to post a private note on piazza, but to keep a consistent testing environment for all students we are unlikely to provide assistance.
- You may not contact other students with information about this this quiz
    - even saying "it was easy/hard" in a general sense can introduce a bias in favor of students who take the quiz earlier or later
- Under no circumstances should you share a copy of this quiz with anyone who isn't a member of the course staff.
- Take this quiz with open notes and feel free to access any online resource / documentation you'd like.  

### Submission Instructions
After completing the quiz below, please follow the instructions below to submit:
1. "Kernel" -> "Restart & Run All"
1. save your quiz file to this latest version
1. upload the `.ipynb` to gradescope **before** clicking submit
1. ensure that you can see your jupyter notebook in the gradescope interface after clicking "submit"

We specify the last note above as gradescope has allowed students to "submit" without uploading a file.  It is your responsibility to ensure that you've actually submitted a file.

### Academic Integrity Pledge

Input your name below to sign the Academic Integrity Pledge before continuing with the quiz. Failure to do so will result in a score of **0**.

In [None]:
name = 'Student Name Here'
print(f'I, {name}, declare that the following work is entirely my own, and that I did not copy or seek help from any students who have currently or previously taken this course, nor from any online source other than private messages between myself and the professor on Piazza/via email.')

In [None]:
# the following modules will be necessary to complete the quiz
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from copy import copy
from collections import Counter
from sklearn import tree
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

## Part 1: k-Nearest Neighbors (30 points; 10 points each)

For this problem you will use the `new_fifa23.csv` file (available in the Quiz module on Canvas). It is a slightly altered of the FIFA players data set you have used previously. We will use these data to see if we can predict a football player's Position (Backfield, Forward, Goalkeeper, Midfield, Wing) based on all the numeric features in the data.

In [None]:
df_fifa = pd.read_csv('new_fifa23.csv')
df_fifa.head()

**NOTE**: You are not expected to write all the code for this problem. To save time, you may take code from the lecture notes/homework and adapt it appropriately. If you do, please be careful to make sure you are adequately adapting the code to this problem, and **MAKE SURE TO COMMENT**.

## Part 1.1: Fit a Cross Validated k-NN

Use 5-fold cross validation to fit and predict the `PositionRole` of the players in the data set. Use the $k=6$ nearest neighbors (for this data set, more $k$ is better, but after $k=6$ the improvement is negligible). Do not forget to scale normalize your all x-features (there are many features which are on different scales). To help you get started, you can get the feature names with:

```python
x_feat_list = list(df_fifa.columns[2:])
y_feat = 'PositionRole'
```

## Part 1.2: Confusion Matrix and Overall Accuracy

Make a confusion matrix using the `y_true` values and the `y_pred` values from the 5-fold cross validation in the previous part. Then, use the `accuracy_score()` function to calculate the overall accuracy of the k-NN classifier.

## Part 1.3: Discuss

In a markdown cell, provide a discussion (at least one paragraph) of the results of the k-NN classifier. Be sure to discuss:
- What the overall accuracy tells us
- Which positions were predicted well
- Which positions were not predicted well
- Why some were predicted well and others weren't (you do not necessarily have to have an understanding of soccer to do this, but it may help to [read a bit about it](https://en.wikipedia.org/wiki/Association_football_positions): note that some intuition can be gained simply by looking at the table of contents of that wikipedia page)

## Part 2: Decision Trees (30 points; 6 points each part)

Load the `ratemyprof_sample.csv` (available in the Quiz module on Canvas). These data consist of randomly selected ratings from RateMyProfessor.com, and are adapted from the dataset available at [this site](https://data.mendeley.com/datasets/fvtfjyvw7d/2). You will use these data in Parts 2 and 3 to build Decision Trees and a Random Forest.

Our goal is to predict if a student would take a professor again (the `would_take_agains` feature is our output) with the rest of the features which include:
- `student_difficult`: 1 to 5 rating of how difficult the student found the professor (5 being the most difficult)
- `attence`: whether attendance was mandatory for the professor's class
- `gives_good_feedback` to `IsCourseOnline`: True/False features coded as 0 (False) or 1 (True) representing how students felt about the professor/class. Most are easily interpretable; if more information is required about a specific feature, please see the source website above.

In [None]:
df_rmp = pd.read_csv("ratemyprof_sample.csv")
df_rmp.head()

In [None]:
Counter(df_rmp['would_take_agains'])

**NOTE**: You are not expected to write all the code for this problem. To save time, you may take code from the lecture notes/homework and adapt it appropriately. If you do, please be careful to make sure you are adequately adapting the code to this problem, and **MAKE SURE TO COMMENT**.

## Part 2.1: Make a Single Tree

Using a `max_depth = 3`, fit and plot a single Decision Tree. Make sure the plot is readable (i.e. that you can easily read the features being used to split the data, the gini, and the counts in each node).

## Part 2.2: Predict a Professor

Predict (manually or with the `.predict()` function, your choice) if a student would take a professor again or not if:
- The student rated the professor as very difficult: `student_difficult = 5`
- The student rated the professor as not a tough grader: `tough_grader = 0`
- The student rated the professor as having amazing lectures: `amazing_lectures = 1`

**Then** discuss in a sentence or two what the gini value in the terminal node that your prediction came from means.

## Part 2.3: Cross Validate

Predict whether students would take a professor again using Decision Trees with a 10-fold cross validation and `max_depth = None`.

## Part 2.4: Confusion Matrix and Accuracy/Sensitivity/Specificity

Make a confusion matrix using the true `y` values and the `y_pred` values from the 10-fold cross validation in the previous part. Then, use the provided `get_acc_sens_spec()` function to calculate the accuracy, sensitivity, and specificity of the cross validated predictions from the previous part (the Decision Tree with `max_depth = None`).

In [None]:
def get_acc_sens_spec(y_true, y_pred, verbose=True):
    """ computes sensitivity & specificity (assumed binary inputs)

    Args:
        y_true (np.array): binary ground truth per trial
        y_pred (np.array): binary prediction per trial

    Returns:
        acc (float): accuracy
        sens (float): sensitivity
        spec (float): specificity
    """
    # line below stolen from sklearn confusion_matrix documentation
    tn, fp, fn, tp = confusion_matrix(y_true,
                                      y_pred).ravel()

    # compute sensitivity
    if tp + fn:
        sens = tp / (tp + fn)
    else:
        sens = np.nan

    # compute specificity
    if tn + fp:
        spec = tn / (tn + fp)
    else:
        spec = np.nan
        
    # compute acc
    acc = (tp + tn) / (tn + fp + fn + tp)

    return acc, sens, spec

## Part 2.5: Discuss

In a markdown cell, discuss the results from Part 2.4. In 3-4 sentences, describe (a) the performance of the decision tree and (b) what is one drawback of the single decision tree approach and how might we address it?

## Part 3: Random Forest (40 points; 10 points each part)

### Part 3.1: Cross Validated Random Forest

Using the same data from Part 2, fit a 10-fold cross validated Random Forest classifier of the RateMyProfessor data. Keep the defaults of `max_depth=None` and `n_estimators=100`. 

**NOTE**: You are not expected to write all the code for this problem. To save time, you may take code from the lecture notes/homework and adapt it appropriately. If you do, please be careful to make sure you are adequately adapting the code to this problem, and **MAKE SURE TO COMMENT**.

## Part 3.2: Confusion Matrix and Accuracy/Sensitivity/Specificity

Make a confusion matrix using the true `y` values and the `y_pred` values from the 10-fold cross validation in the previous part. Then, use the `get_acc_sens_spec()` function to calculate the accuracy, sensitivity, and specificity of the cross validated predictions from previous part (the Random Forest of 100 decision trees with `max_depth = None`).

## Part 3.3: Feature Importance

Use the `plot_feat_import` function provided below to display a plot of the top **5** most important features based on the Random Forest you fit.

In [None]:
def plot_feat_import(feat_list, feat_import, sort=True, limit=None):
    """ plots feature importances in a horizontal bar chart
    
    Args:
        feat_list (list): str names of features
        feat_import (np.array): feature importances (mean gini reduce)
        sort (bool): if True, sorts features in decreasing importance
            from top to bottom of plot
        limit (int): if passed, limits the number of features shown
            to this value    
    """
    
    if sort:
        # sort features in decreasing importance
        ## not sure why, but .argsort had been doing decreasing by default, but this is now increasing
        ## This will now only work if you set the limit, since I flip it around in the next if block
        idx = np.argsort(feat_import).astype(int)
        feat_list = [feat_list[_idx] for _idx in idx]
        feat_import = feat_import[idx] 
        
    if limit is not None:
        # limit to the first limit feature
        feat_list = feat_list[-limit:]
        feat_import = feat_import[-limit:]
    
    # plot and label feature importance
    plt.barh(feat_list, feat_import)
    plt.gcf().set_size_inches(5, len(feat_list) / 2)
    plt.xlabel('Feature importance\n(Mean decrease in Gini across all Decision Trees)')

## Part 3.4: Discuss

In a markdown cell, provide a discussion (at least one paragraph) of the results of the Random Forest and their comparison to the results from the single decision tree:
- Compare the confusion matrix, accuracy, sensitivity, and specificity from the past two parts and explain what the differences mean
- Discuss the feature importance graph, what the top five important features represent, and whether it makes sense to you that they would be important in determining if a student would take a class again