<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating Classification Models on Humor Styles Data

---

In this lab you will be practicing evaluating classification models (Logistic Regression in particular) on a "Humor Styles" survey.

This survey is designed to evaluate what "style" of humor subjects have. Your goal will be to classify gender using the responses on the survey.

## Humor styles questions encoding reference

### 32 questions:

Subjects answered **32** different questions outlined below:

    1. I usually don't laugh or joke with other people.
    2. If I feel depressed, I can cheer myself up with humor.
    3. If someone makes a mistake, I will tease them about it.
    4. I let people laugh at me or make fun of me at my expense more than I should.
    5. I don't have to work very hard to make other people laugh. I am a naturally humorous person.
    6. Even when I'm alone, I am often amused by the absurdities of life.
    7. People are never offended or hurt by my sense of humor.
    8. I will often get carried away in putting myself down if it makes family or friends laugh.
    9. I rarely make other people laugh by telling funny stories about myself.
    10. If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better.
    11. When telling jokes or saying funny things, I am usually not concerned about how other people are taking it.
    12. I often try to make people like or accept me more by saying something funny about my own weaknesses, blunders, or faults.
    13. I laugh and joke a lot with my closest friends.
    14. My humorous outlook on life keeps me from getting overly upset or depressed about things.
    15. I do not like it when people use humor as a way of criticizing or putting someone down.
    16. I don't often say funny things to put myself down.
    17. I usually don't like to tell jokes or amuse people.
    18. If I'm by myself and I'm feeling unhappy, I make an effort to think of something funny to cheer myself up.
    19. Sometimes I think of something that is so funny that I can't stop myself from saying it, even if it is not appropriate for the situation.
    20. I often go overboard in putting myself down when I am making jokes or trying to be funny.
    21. I enjoy making people laugh.
    22. If I am feeling sad or upset, I usually lose my sense of humor.
    23. I never participate in laughing at others even if all my friends are doing it.
    24. When I am with friends or family, I often seem to be the one that other people make fun of or joke about.
    25. I don't often joke around with my friends.
    26. It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems.
    27. If I don't like someone, I often use humor or teasing to put them down.
    28. If I am having problems or feeling unhappy, I often cover it up by joking around, so that even my closest friends don't know how I really feel.
    29. I usually can't think of witty things to say when I'm with other people.
    30. I don't need to be with other people to feel amused. I can usually find things to laugh about even when I'm by myself.
    31. Even if something is really funny to me, I will not laugh or joke about it if someone will be offended.
    32. Letting others laugh at me is my way of keeping my friends and family in good spirits.

---

### Response scale:

For each question, there are 5 possible response codes ("like scale") that correspond to different answers. There is also a code that indicates there is no response for that subject.

    1 == "Never or very rarely true"
    2 == "Rarely true"
    3 == "Sometimes true"
    4 == "Often true"
    5 == "Very often or always true
    [-1 == Did not select an answer]
    
---

### Demographics:

    age: entered as as text then parsed to an integer.
    gender: chosen from drop down list (1=male, 2=female, 3=other, 0=declined)
    accuracy: How accurate they thought their answers were on a scale from 0 to 100, answers were entered as text and parsed to an integer. They were instructed to enter a 0 if they did not want to be included in research.	

In [66]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

### 1. Load the data and perform any EDA and cleaning you think is necessary.

It is worth reading over the description of the data columns above for this.

In [74]:
hsq = pd.read_csv('../../../../resource-datasets/humor_styles/hsq_data.csv')

In [75]:
# A:
hsq.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
0,2,2,3,1,4,5,4,3,4,3,3,1,5,4,4,4,2,3,3,1,4,4,3,2,1,3,2,4,2,4,2,2,4.0,3.5,3.0,2.3,25,2,100
1,2,3,2,2,4,4,4,3,4,3,4,3,3,4,5,4,2,2,3,2,3,3,4,2,2,5,1,2,4,4,3,1,3.3,3.5,3.3,2.4,44,2,90
2,3,4,3,3,4,4,3,1,2,4,3,2,4,4,3,3,2,4,2,1,4,2,4,3,2,4,3,3,2,5,4,2,3.9,3.9,3.1,2.3,50,1,75
3,3,3,3,4,3,5,4,3,-1,4,2,4,4,5,4,3,3,3,3,3,4,3,2,4,2,4,2,2,4,5,3,3,3.6,4.0,2.9,3.3,30,2,85
4,1,4,2,2,3,5,4,1,4,4,2,2,5,4,4,4,2,3,2,1,5,3,3,1,1,5,2,3,2,5,4,2,4.1,4.1,2.9,2.0,52,1,80


In [85]:
hsq.rename(columns={'agressive':'aggressive'}, inplace=True)

In [89]:
# if we're only using the Q's then need to filter out the non-answered ones:
hsq = hsq.applymap(lambda x: np.nan if x==-1 else x)
hsq.dropna(inplace=True)

In [99]:
# only males and females
hsq = hsq[hsq.gender.isin([1,2])]

### 2. Set up a predictor matrix to predict `gender` (only male vs. female)

Choice of predictors is up to you. Justify which variables you include.

In [98]:
hsq.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'affiliative', 'selfenhancing', 'aggressive', 'selfdefeating',
       'age', 'gender', 'accuracy'],
      dtype='object')

In [103]:
# A:
predictors = [x for x in hsq.columns if 'Q' in x]
predictors = predictors + ['accuracy','gender']
X = hsq[predictors]
y = X.pop('gender')

### 3. Fit a Logistic Regression model and compare your cross-validated accuracy to the baseline.

In [104]:
# A:
baseline = y.value_counts(normalize=True).max()
y.value_counts(normalize=True)

1    0.547959
2    0.452041
Name: gender, dtype: float64

In [174]:
# first let's scale predictor:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Xs = pd.DataFrame(scaler.fit_transform(X),columns=X.columns, index=X.index)

# then fit the model:
logistic = LogisticRegression(C=10**10, solver='lbfgs')
logistic.fit(Xs,y)

# the cross-validated accuracy:
cross_val_score(logistic, Xs, y, cv=5)

array([0.58375635, 0.60406091, 0.55102041, 0.60512821, 0.56410256])

### 4. Create a 50-50 train-test split. Fit the model on the training data and get the predictions and predicted probabilities on the test data.

In [181]:
# A:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.5, random_state=1)

# make sure to re-scale on the training mean,std
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [182]:
logistic = LogisticRegression(C=10**10,solver='lbfgs')
logistic.fit(X_train,y_train)

print('Training Score:',logistic.score(X_train,y_train))
print('Test Score:',logistic.score(X_test,y_test))

Training Score: 0.6448979591836734
Test Score: 0.610204081632653


In [183]:
# now get the predictions for the test data:
predictions = logistic.predict(X_test)
predictions

array([1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1,
       2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2,
       2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1,
       1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2,
       2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2,
       1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2,
       2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1,
       2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1,
       1, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2,
       1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2,
       1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2,

In [184]:
# now get the predictions for the test data:
logistic.predict_proba(X_test)

array([[0.57424395, 0.42575605],
       [0.46616606, 0.53383394],
       [0.86903045, 0.13096955],
       [0.78161402, 0.21838598],
       [0.8524356 , 0.1475644 ],
       [0.59958417, 0.40041583],
       [0.55269636, 0.44730364],
       [0.45085974, 0.54914026],
       [0.60838725, 0.39161275],
       [0.37424874, 0.62575126],
       [0.47300775, 0.52699225],
       [0.43433224, 0.56566776],
       [0.593925  , 0.406075  ],
       [0.34406128, 0.65593872],
       [0.50421088, 0.49578912],
       [0.62940458, 0.37059542],
       [0.73953527, 0.26046473],
       [0.74431383, 0.25568617],
       [0.81810463, 0.18189537],
       [0.67686984, 0.32313016],
       [0.40396249, 0.59603751],
       [0.74854283, 0.25145717],
       [0.46392977, 0.53607023],
       [0.51570162, 0.48429838],
       [0.33002428, 0.66997572],
       [0.4314907 , 0.5685093 ],
       [0.39002903, 0.60997097],
       [0.75393845, 0.24606155],
       [0.31591299, 0.68408701],
       [0.51968729, 0.48031271],
       [0.

In [185]:
logistic.classes_

array([1, 2])

### 5. Manually calculate the true positives, false positives, true negatives, and false negatives.

In [186]:
# A:
# true positives will always be the most numerous class in the modelled training set,
# so in this case there are more of class 1 so that's +ve.
y_train.value_counts(normalize=True)

1    0.536735
2    0.463265
Name: gender, dtype: float64

In [187]:
tp = np.sum((y_test == 1) & (predictions == 1))
fp = np.sum((y_test == 2) & (predictions == 1))
tn = np.sum((y_test == 2) & (predictions == 2))
fn = np.sum((y_test == 1) & (predictions == 2))
print("tp:", tp)
print("fp:", fp)
print("tn:", tn)
print("fn:", fn)

tp: 187
fp: 104
tn: 112
fn: 87


### 6. Construct the confusion matrix. 

In [188]:
# A:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predictions))

[[187  87]
 [104 112]]


### 7. Print out the false positive count as you change your threshold for predicting label 1.

In [189]:
# A:
# first create a df from predict_proba... MAKE SURE TO ADD ORIGINAL INDEX WHEN CREATING NEW DATAFRAMES,
# AS THESE NEW DF'S WILL HAVE THEIR OWN NEW INDEX!!! MAKING COMPARISON LATER IMPOSSIBLE.
Y_pp = pd.DataFrame(logistic.predict_proba(X_test),columns=['class_1_pp','class_2_pp'], index=X_test.index)
Y_pp.head()

AttributeError: 'numpy.ndarray' object has no attribute 'index'

In [177]:
def predict_at_threshold(x, threshold):
    if x >= threshold:
        return 1
    else:
        return 2

In [178]:
Y_pp['predict_at_threshold'] = Y_pp.class_1_pp.apply(predict_at_threshold, threshold=0.5)
np.sum((y_test == 2) & (Y_pp['predict_at_threshold'] == 1))

100

In [179]:
fp_list = []
for i in range(0, 100):
    Y_pp['predict_at_threshold'] = Y_pp.class_1_pp.apply(predict_at_threshold, threshold=i/100)
    fp_list.append(np.sum((y_test == 2) & (Y_pp['predict_at_threshold'] == 1)))
print("fp:", fp_list)

fp: [216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 216, 215, 215, 214, 214, 213, 213, 213, 213, 213, 213, 211, 206, 205, 204, 203, 197, 197, 195, 191, 186, 179, 176, 173, 169, 165, 157, 150, 146, 143, 137, 133, 126, 121, 118, 113, 106, 103, 100, 97, 92, 87, 82, 79, 74, 66, 58, 56, 51, 47, 46, 41, 35, 31, 29, 28, 21, 19, 14, 11, 9, 6, 6, 6, 4, 3, 3, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### 8. Plot an ROC curve using your predicted probabilities on the test data.

Calculate the area under the curve.

> *Hint: go back to the lesson to find code for plotting the ROC curve.*

In [162]:
from sklearn.metrics import roc_curve, auc

In [168]:
# A:
# For class 1, find the area under the curve
fpr, tpr, threshold = roc_curve(y_test, Y_pp.class_1_pp)
roc_auc = auc(fpr, tpr)

# Plot of a ROC curve for class 1
plt.figure(figsize=[6, 6])
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('ROC curve', fontsize=18)
plt.legend(loc="lower right")
plt.show()

ValueError: Data is not binary and pos_label is not specified

#### 9. Cross-validate a logistic regression with a Ridge penalty.

Logistic regression can also use the Ridge penalty. Sklearn's [`LogisticRegressionCV`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) class will help you cross-validate an appropriate regularization strength.

**Important `LogisticRegressionCV` arguments:**
- `penalty`: this can be one of `'l1'` or `'l2'`. L1 is the Lasso, and L2 is the Ridge.
- `Cs`: How many different (automatically-selected) regularization strengths should be tested.
- `cv`: How many cross-validation folds should be used to test regularization strength.
- `solver`: When using the lasso penalty, this should be set to `'liblinear'`

> **Note:** The `C` regularization strength is the *inverse* of alpha. That is to say, `C = 1./alpha`

In [12]:
from sklearn.linear_model import LogisticRegressionCV

In [13]:
# A:

#### 9.A Calculate the predicted labels and predicted probabilities on the test set with the Ridge logisitic regression.

In [14]:
# A:

#### 9.B Construct the confusion matrix for the Ridge LR.

In [15]:
# A:

### 10. Plot the ROC curve for the original and Ridge logistic regressions on the same plot.

Which performs better?

In [16]:
# A:

### 11. Cross-validate a Lasso logistic regression.

**Hint:**
- `penalty` must be set to `'l1'`
- `solver` must be set to `'liblinear'`

> **Note:** The lasso penalty can be considerably slower. You may want to try fewer Cs or use fewer cv folds.

In [17]:
# A:

### 12. Make the confusion matrix for the Lasso model.

In [18]:
# A:

### 13. Plot all three logistic regression models on the same ROC plot.

Which is the best (if any)?

In [19]:
# A:

### 14. Look at the coefficients for the Lasso logistic regression model. Which variables are the most important ones?

In [20]:
# A: