<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Quiz 3 In-Class Review

 _**Authors:** Boom D. (DSI-NYC), Edited by Noelle B. (DSI-DEN)_

---

### Classification Models

**Q1.** Which method should you use when predicting a $Y$ that is discrete?

- _Regression models continuous $Y$_
- _Classification models discrete $Y$_

**Q2.** Suppose $Y$ is a binary variable that takes a value of $Y = 1$ if a person is at-risk of heart disease, and $Y = 0$ otherwise. Also let $X_1$ = LDL Cholesterol level and $X_2$ = binary variable for whether or not a person smokes. A scientist decides to construct the following regression model:

$$ \text{log}(\frac{P(Y = 1)}{1 - P(Y = 1)}) = \beta_0 + \beta_1X_1 + \beta_2X_2 $$

After fitting the model with data, the scientist finds that $\beta_1 = 5.8$

What is one appropriate interpretation for the impact of LDL Cholesterol level on the chances of developing heart-disease?

In [1]:
import numpy as np
np.exp(5.8)

330.2995599096486

_For every 1 unit increase in $X_1$, the patient is 330.3 times as likely to be at-risk of heart disease given everything else is held constant._

**Q3.** $k$-Nearest Neighbors ($k$NN) is a classification algorithm where $k$ is a hyperparameter. Briefly state what $k$ represents and its role in how the $k$NN algorithm works.

- _$k$ is the hyperparameter that controls how many of the closest neighbors we are going to look at._

### Classification Metrics: Confusion Matrices

It is 2190 and an evil dictator has decided that the justice system should be completely automated by a classification algorithm of his own design that decides whether someone is guilty. Out of 100,000 people sampled, his model has made the following results:
- 63,000 truly guilty people were predicted to be guilty
- 27,000 truly innocent people were predicted to be guilty
- 3,000 truly guilty people were predicted to be innocent
- 7,000 truly innocent people were predicted to be innocent

**Q4.** Identify the "positive" case in this problem.

_Positive case : Guilty_

**Q5.** State the number of **false positive** and **false negative** predictions.

- _FP : 27,000_
- _FN : 3,000_

**Q6.** Calculate the following metrics
    - Accuracy Rate
    - Misclassification Rate
    - Sensitivity
    - Specificity
    - Precision

- Accuracy : 1-(FP+FN)/ALL = 1-30,000/100,000 = 0.7
- Misclassification : (FP+FN)/ALL = 30,000/100,000 = 0.3
- Sensitivity : TP/P = TP/(TP + FN) = 63,000/(63,000+3,000) = 63,000/66,000
- Specificity : TN/N = TN/(TN + FP) = 7000/(7000+27,000) = 7000/34000
- Precision : TP/(TP + FP) = 63,000/(63,000+27,000) = 63,000/90,000

### Classification Metrics: AUC ROC

Recall that
- Sensitivity : Out of all the Y = 1 observations, how many did we correctly identify?
- Specificity : Out of all the Y = 0 observations, how many did we correctly Identify?

**Q7.** Which of these is also known as the True Positive Rate?

_Sensitivity_

**Q8.** If we predict $Y = 1$ for all observations, what happens to:
- Sensitivity
- Specificity
- False Positive Rate

- _Sensitivity => Increases_
- _Specificity => Decreases_

**Q9. (True/False)** The Receiver Operator Characteristic (ROC) curve shows the trade-off between the True Positive Rate and False Positive Rate. <br>

True: _Sensitivity vs. 1 - Specificity_

*Note:* If you need a visual for how the ROC curve is constructed from the data, check this out: http://mlwiki.org/index.php/ROC_Analysis#Example_1)

**Q10.** What range of values can AUC ROC scores take on?

_0 to 1_

**Q11.** How would you handle a case where $0 \le AUC ROC < 0.5$?

_Terrible model? Invert your target class variables (0s to 1s and vice versa)_

### Hyperparameter Tuning

**Q12.** A student is trying to optimize a model by tuning hyperparameters with `GridSearchCV()`. She runs the following code to do so.

```python
params = {'C':       [0.01, 0.02, 0.1, 1, 10, 50, 100],
          'penalty': ['l1', 'l2']}
logit = GridSearchCV(LogisticRegression(), cv=5, param_grid=params)
logit.fit(X_train, y_train)
C_optimal = logit.best_params_['C']
```
How many model fit simulations does the above code run behind the scenes before arriving at the optimal hyperparameter combinations?

_$7x2x5$ = 70 model fits_

### Natural Language Processing (NLP)

**Q13.** Define the following terms:
    - Stemming
    - Lemmatization
    - Tokenization
    - Stop Words

- Stemming: Removing the suffix to get word to root word
- Lemmatization: Remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma [(source)](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)
- Tokenization: The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens [(source)](https://www.quora.com/NLP-What-does-it-mean-Tokenizing)
- Stop Words: A commonly used word that will likely have no significant impact on your NLP model and should be removed

**Q14.** Briefly discuss (preferably with an example case) why it may not always be optimal to always tokenize a string into single words only, i.e. why `ngram_range = (1,1)` is not always ideal.

_Two word phrases, such as "not bad", have different meaning as opposed to being separate  words._

**Q15.** Briefly explain the adjustment `TFIDFVectorizer()` makes to `CountVectorizer()` and describe why we may sometimes prefer TFIDF.

- CountVectorizer : Breaks a document into unique terms and makes a vector of counts from that document.
- TFIDFVectorizer : Some calculation that considers the number of times a term is used in the entire corpus, as well as the document. TFIDF considers the corpus holistically.

- *max_df : if a word appears over a certain percent of documents, exclude the word*
- *min_df : a word must appear in at least a certain number of documents to be included in the vectorizer*