## Overview



The focus of this notebook is on interpretability. You will use a Decision Tree model and a Support Vector Machine to look at the most predictive features for predicting diabetes. Here, the features will be words in the text documents. 

## Pipeline task overview: Predicting presence of diabetes from text



Recall from your previous courses that these tasks can typically be described by the following components: 

 1. Data collection - <font color='green'>Done</font>
 2. Data cleaning / transformation - <font color='magenta'>You will do some in assignment 3 c</font>
 3. Dataset splitting <font color='green'>Done</font>
 4. Model training <font color='magenta'>You will do</font>
 5. Model evaluation <font color='magenta'>You will do</font>
 6. Repeat 1-5 to perform model selection <font color='magenta'>You will do</font>
 7. Presenation of findings (Visualization) <font color='green'>Not required</font>



In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

from sklearn.svm import SVC 
from sklearn.metrics import average_precision_score
from sklearn.feature_selection import RFE
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer

import data_cleaners as dc

## <font color='magenta'>Task One</font>

First we'll load in the raw data. We are going to be using basic sklearn packages, rather than PyTorch for this notebook, so you can code everything up in the notebook itself.

In the first task, we will use the CountVectorizer object to build the vocabulary and the bag of words vectors for each document. The shape of the Numpy array bow_vecs will be (1045, 15003). You can check your implementation before you submit by ensuring that you get 192 1s in the first document.
```python
assert bow_vecs[0].sum()==192
```

In [None]:
# This code loads in the raw X and y files. Use sklearn's CountVectorizer to convert the collection of text documents 
# to a matrix of token presence/absence. 
# Ensure that the CountVectorizer object converts all characters to lowercase and removes the stopwords before tokenizing.
# NOTE: we are only interested in binary values (True/False) rather than integer counts of words in a document.

# CountVectorizer object will give you a vocabulary which is a dictionary. The key in the dictionary is a unique word 
# and each value is the index of that word in the bag of words vectors.
# For example, the value for the key  'gauze' is 5944. That means for each document the 5944th value of the vector
# associated with that document will be 1 if this document contains the word  'gauze' and 0 otherwise. 

X, y = dc.get_raw_data()
bow_vecs = None
vocab = None

# YOUR CODE HERE
raise NotImplementedError()

index_to_word = {v:k for k,v in vocab.items()} # reverse lookup of word from index

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Two</font>


Use a DecisionTreeClassifier (set random_state to 12345 for the autograder) and find the best max_depth hyperparameter which maximizes the average precision and is in the range [5,10).

Note that the expected format for this answer is a dictionary with max_depth in string format as the key and average precision as the value (sample format shown below).
```
{'5': 0.11111111111111111,
 '6': 0.21111111111111111,
 '7': 0.31111111111111111,
 '8': 0.41111111111111111,
 '9': 0.21212121212121212}
```

In [None]:
def get_best(X_train,y_train,X_val,y_val):
    score_dict = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return score_dict

X_train = bow_vecs[:900]
y_train = y.iloc[:900]
X_val = bow_vecs[900:1000]
y_val = y.iloc[900:1000]

params = get_best(X_train,y_train,X_val,y_val)

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Three A</font>

When the model is trained with the best value of max_depth, find the five most predictive features of that model.

Note that the expected format for this answer is a list of tuples, each tuple containing the feature name and the importance (sample format shown below).

```
[('confused', 0.11223344556677889),
 ('cut', 0.09111111111111111),
 ('noninfectious', 0.08222222222222222),
 ('zyloprim', 0.07111111111111111),
 ('bloody', 0.05111111111111111)]
```

In [None]:
def get_five_best_features(maxdepth, X_train, y_train):
    # YOUR CODE HERE
    raise NotImplementedError()
    return None

best_maxdepth = None
# YOUR CODE HERE
raise NotImplementedError()
best_features = get_five_best_features(best_maxdepth, X_train, y_train)

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Three B</font>

Enter your thoughts on these features: do they make sense to you in light of the health content from this week? What questions do you have after looking at them?"

YOUR ANSWER HERE

Does this Decision Tree model perform better or worse, on average, than the models you explored in notebook 3_a?

Set relative performance to "better" or "worse".

In [None]:
relative_performance_3_a = ""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

Does this Decision Tree model perform better or worse, on average, than the models you explored in notebook 3_b?

Set relative performance to "better" or "worse".

In [None]:
relative_performance_3_b = ""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

Does this Decision Tree model perform better or worse, on average, than the models you explored in notebook 3_c?

Set relative performance to "better" or "worse". 

In [None]:
relative_performance_3_c = ""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Four</font>

Using Recursive Feature Elimination and an SVM with a linear kernel, C=.001 and random_state=1234.

Return the list of top 5 features. Note that the expected format for this answer is a numpy array of feature names (sample format shown below).
```
array(['confused', 'cut', 'hemoglobin', 'noninfectious', 'zyloprim'],
      dtype=object)
```
Set the step to 0.7 or greater but less than 1 to speed up the computation. **Note: do not set step to 1, as it will make the computation extremely slow.** Refer to the <a href='https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html'>documentation</a> for more details.

In [None]:
top_5_features = []
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell