<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating Classification Models on Humor Styles Data

_Authors: Kiefer Katovich (SF)_

---

In this lab you will be practicing evaluating classification models (Logistic Regression in particular) on a "Humor Styles" survey.

This survey is designed to evaluate what "style" of humor subjects have. Your goal will be to classify gender using the responses on the survey.

## Humor styles questions encoding reference

### 32 questions:

Subjects answered **32** different questions outlined below:

    1. I usually don't laugh or joke with other people.
    2. If I feel depressed, I can cheer myself up with humor.
    3. If someone makes a mistake, I will tease them about it.
    4. I let people laugh at me or make fun of me at my expense more than I should.
    5. I don't have to work very hard to make other people laugh. I am a naturally humorous person.
    6. Even when I'm alone, I am often amused by the absurdities of life.
    7. People are never offended or hurt by my sense of humor.
    8. I will often get carried away in putting myself down if it makes family or friends laugh.
    9. I rarely make other people laugh by telling funny stories about myself.
    10. If I am feeling upset or unhappy I usually try to think of something funny about the situation to make myself feel better.
    11. When telling jokes or saying funny things, I am usually not concerned about how other people are taking it.
    12. I often try to make people like or accept me more by saying something funny about my own weaknesses, blunders, or faults.
    13. I laugh and joke a lot with my closest friends.
    14. My humorous outlook on life keeps me from getting overly upset or depressed about things.
    15. I do not like it when people use humor as a way of criticizing or putting someone down.
    16. I don't often say funny things to put myself down.
    17. I usually don't like to tell jokes or amuse people.
    18. If I'm by myself and I'm feeling unhappy, I make an effort to think of something funny to cheer myself up.
    19. Sometimes I think of something that is so funny that I can't stop myself from saying it, even if it is not appropriate for the situation.
    20. I often go overboard in putting myself down when I am making jokes or trying to be funny.
    21. I enjoy making people laugh.
    22. If I am feeling sad or upset, I usually lose my sense of humor.
    23. I never participate in laughing at others even if all my friends are doing it.
    24. When I am with friends or family, I often seem to be the one that other people make fun of or joke about.
    25. I donít often joke around with my friends.
    26. It is my experience that thinking about some amusing aspect of a situation is often a very effective way of coping with problems.
    27. If I don't like someone, I often use humor or teasing to put them down.
    28. If I am having problems or feeling unhappy, I often cover it up by joking around, so that even my closest friends don't know how I really feel.
    29. I usually can't think of witty things to say when I'm with other people.
    30. I don't need to be with other people to feel amused. I can usually find things to laugh about even when I'm by myself.
    31. Even if something is really funny to me, I will not laugh or joke about it if someone will be offended.
    32. Letting others laugh at me is my way of keeping my friends and family in good spirits.

---

### Response scale:

For each question, there are 5 possible response codes ("likert scale") that correspond to different answers. There is also a code that indicates there is no response for that subject.

    1 == "Never or very rarely true"
    2 == "Rarely true"
    3 == "Sometimes true"
    4 == "Often true"
    5 == "Very often or always true
    [-1 == Did not select an answer]
    
---

### Demographics:

    age: entered as as text then parsed to an interger.
    gender: chosen from drop down list (1=male, 2=female, 3=other, 0=declined)
    accuracy: How accurate they thought their answers were on a scale from 0 to 100, answers were entered as text and parsed to an integer. They were instructed to enter a 0 if they did not want to be included in research.	

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report, confusion_matrix,roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


### 1. Load the data and perform any EDA and cleaning you think is necessary.

It is worth reading over the description of the data columns above for this.

In [2]:
import pandas as pd
hsq = pd.read_csv('./datasets/hsq_data.csv')

#### pare gender down to male and female and then change values to 0 and 1

In [41]:
newhsq = hsq[(hsq.gender == 2) | (hsq.gender == 1)].copy()

In [42]:
#titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})

newhsq['gender']=newhsq.gender.map({2:1,1:0})
newhsq

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
0,2,2,3,1,4,5,4,3,4,3,...,4,2,2,4.0,3.5,3.0,2.3,25,1,100
1,2,3,2,2,4,4,4,3,4,3,...,4,3,1,3.3,3.5,3.3,2.4,44,1,90
2,3,4,3,3,4,4,3,1,2,4,...,5,4,2,3.9,3.9,3.1,2.3,50,0,75
3,3,3,3,4,3,5,4,3,-1,4,...,5,3,3,3.6,4.0,2.9,3.3,30,1,85
4,1,4,2,2,3,5,4,1,4,4,...,5,4,2,4.1,4.1,2.9,2.0,52,0,80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1066,3,2,3,3,2,5,3,2,3,4,...,4,4,4,2.5,3.3,2.9,3.0,18,1,95
1067,1,4,5,2,4,4,1,2,2,5,...,4,1,2,4.8,3.9,2.5,2.4,31,0,95
1068,1,4,4,5,4,4,3,5,4,3,...,4,1,5,4.4,3.9,3.0,4.3,15,0,95
1069,3,4,4,3,3,4,3,2,4,3,...,4,3,3,3.1,3.6,2.9,2.8,21,1,87


In [3]:

hsq.gender.unique()

array([2, 1, 3, 0])

In [51]:
newhsq.groupby('Q11')['gender'].mean()

Q11
-1    0.000000
 1    0.600000
 2    0.458580
 3    0.443350
 4    0.403361
 5    0.288660
Name: gender, dtype: float64

In [53]:
newhsq.groupby('Q15')['gender'].mean()

Q15
-1    0.142857
 1    0.262136
 2    0.356098
 3    0.388889
 4    0.504386
 5    0.591973
Name: gender, dtype: float64

In [56]:
newhsq.groupby('Q4')['gender'].mean()

Q4
-1    0.000000
 1    0.518248
 2    0.471572
 3    0.460784
 4    0.435556
 5    0.288889
Name: gender, dtype: float64

In [57]:
newhsq.groupby('Q20')['gender'].mean()

Q20
-1    0.500000
 1    0.496000
 2    0.476190
 3    0.388571
 4    0.375000
 5    0.242424
Name: gender, dtype: float64

In [31]:
hsq.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
0,2,2,3,1,4,5,4,3,4,3,...,4,2,2,4.0,3.5,3.0,2.3,25,2,100
1,2,3,2,2,4,4,4,3,4,3,...,4,3,1,3.3,3.5,3.3,2.4,44,2,90
2,3,4,3,3,4,4,3,1,2,4,...,5,4,2,3.9,3.9,3.1,2.3,50,1,75
3,3,3,3,4,3,5,4,3,-1,4,...,5,3,3,3.6,4.0,2.9,3.3,30,2,85
4,1,4,2,2,3,5,4,1,4,4,...,5,4,2,4.1,4.1,2.9,2.0,52,1,80


In [None]:
newhsq.groupby('accuracy')['gender'].mean()

### 2. Set up a predictor matrix to predict `gender` (only male vs. female)

Choice of predictors is up to you. Justify which variables you include.

In [78]:
# A: predictors: agressive, selfenhancing, accuracy

hsqfilter = ['selfenhancing','agressive','accuracy','Q4','Q11','Q15','Q20']
X = newhsq[hsqfilter]
y = newhsq.gender

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

logreg = LogisticRegression()
logreg.fit(X,y)
pred = logreg.predict(X)

### 3. Use cross-validation to evaluate the accuracy of a logistic regression and compare result to the baseline.

In [80]:
# A:
#glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]
#cross_validate(logr,X,y)['test_score']
newhsq['genderpred'] = logreg.predict_proba(X)[:,1]
#scores = cross_val_score(lr, Xs, y, cv=25)

scores = cross_val_score(logreg,X,y,cv=10)
print (scores)
print (np.mean(scores))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[0.5754717  0.63207547 0.54716981 0.56603774 0.54716981 0.64150943
 0.61320755 0.63207547 0.67619048 0.58095238]
0.6011859838274932


In [75]:
y.mean()

0.45085066162570886

### 4. Create a 80-20 train-test split. Fit the model on training and get the predictions and predicted probabilities on the test data.

### 5. Construct the confusion matrix. 

In [8]:
# A:

### 6. Print out the false positive count as you change your threshold for predicting label 1.

In [9]:
# A:

### 7. Plot a ROC curve using your predicted probabilities on the test data.

Calculate the area under the curve.

> *Hint: go back to the lecture to find code for plotting the ROC curve.*

In [10]:
from sklearn.metrics import roc_curve, auc

In [11]:
# A:

# add your predictions and your reponse to roc_curve function
fpr, tpr, _ = roc_curve()
roc_auc = auc(fpr, tpr)

plt.figure(figsize=[8,8])
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Receiver operating characteristic: is male', fontsize=18)
plt.legend(loc="lower right")
plt.show()

### Look at the coefficients for the logistic regression model. Which variables are the most important?

In [None]:
# A:

# Regularization

### 8. Cross-validate a logistic regression with a Ridge penalty.

Logistic regression can also use the Ridge penalty. Sklearn's `LogisticRegressionCV` class will help you cross-validate an appropriate regularization strength.

**Important `LogisticRegressionCV` arguments:**
- `penalty`: this can be one of `'l1'` or `'l2'`. L1 is the Lasso, and L2 is the Ridge.
- `Cs`: How many different (automatically-selected) regularization strengths should be tested.
- `cv`: How many cross-validation folds should be used to test regularization strength.
- `solver`: When using the lasso penalty, this should be set to `'liblinear'`

> **Note:** The `C` regularization strength is the *inverse* of alpha. That is to say, `C = 1./alpha`

In [12]:
from sklearn.linear_model import LogisticRegressionCV

In [13]:
# A:

**8.B Calculate the predicted labels and predicted probabilities on the test set with the Ridge logisitic regression.**

In [14]:
# A:

**8.C Construct the confusion matrix for the Ridge LR.**

In [15]:
# A:

### 9. Plot the ROC curve for the original and Ridge logistic regressions on the same plot.

Which performs better?

In [16]:
# A:

### 10. Cross-validate a Lasso logistic regression.

**Remember:**
- `penalty` must be set to `'l1'`
- `solver` must be set to `'liblinear'`

> **Note:** The lasso penalty can be considerably slower. You may want to try fewer Cs or use fewer cv folds.

In [17]:
# A:

### 11. Make the confusion matrix for the Lasso model.

In [18]:
# A:

### 12. Plot all three logistic regression models on the same ROC plot.

Which is the best? (if any)

In [19]:
# A: