## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

**Answer**:

(1) Are left-handed people identify with stronger emotions for particular events?

(2) Do people who are left-handed lean towards violent personality trait-based responses?

(3) Are left-handed people prefer more artistic activities in general?

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [4]:
# Import dataset
df = pd.read_csv('data.csv', delimiter='\t')

**Note**:

This data was collected from an interactive version of the Open Sex Role Inventory in 2014.
The following items were rated on a five point scale, with the labels *1=Disagree, 3=Neutral, 5=Agree*:

|No.|Description|
|:-|:-|
Q1	I have studied how to win at gambling.
Q2	I have thought about dying my hair.
Q3	I have thrown knives, axes or other sharp things.
Q4	I give people handmade gifts.
Q5	I have day dreamed about saving someone from a burning building.
Q6	I get embarrassed when people read things I have written.
Q7	I have been very interested in historical wars.
Q8	I know the birthdays of my friends.
Q9	I like guns.
Q10	I am happiest when I am in my bed.
Q11	I did not work very hard in school.
Q12	I use lotion on my hands.
Q13	I would prefer a class in mathematics to a class in pottery.
Q14	I dance when I am alone.
Q15	I have thought it would be exciting to be an outlaw.
Q16	When I was a child, I put on fake concerts and plays with my friends.
Q17	I have considered joining the military.
Q18	I get dizzy when I stand up sharply.
Q19	I do not think it is normal to get emotionally upset upon hearing about the deaths of people you did not know.
Q20	I sometimes feel like crying when I get angry.
Q21	I do not remember birthdays.
Q22	I save the letters I get.
Q23	I playfully insult my friends.
Q24	I oppose medical experimentation with animals.
Q25	I could do an impressive amount of push ups.
Q26	I jump up and down in excitement sometimes.
Q27	I think a natural disaster would be kind of exciting.
Q28	I wear a blanket around the house.
Q29	I have burned things up with a magnifying glass.
Q30	I think horoscopes are fun.
Q31	I don't pack much luggage when I travel.
Q32	I have thought about becoming a vegetarian.
Q33	I hate shopping.
Q34	I have kept a personal journal.
Q35	I have taken apart machines just to see how they work.
Q36	I take lots of pictures of my activities.
Q37	I have played a lot of video games.
Q38	I leave nice notes for people now and then.
Q39	I have set fuels, aerosols or other chemicals on fire, just for fun.
Q40	I really like dancing.
Q41	I take stairs two at a time.
Q42	I bake sweets just for myself sometimes.
Q43	I think a natural disaster would be kind of exciting.
Q44	I decorate my things (e.g. stickers on laptop).

On the next page the following questions were administered:

|Var|Qn|Responses|
|:-|:-|:-|
|engnat|" Is English you native language?"|1=Yes, 2=No|
|age|"What is your age?", entered as text| (ages <  13 not recorded)|
|education|"How much education have you completed?"|1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree|
|gender||1=Male, 2=Female, 3=Other|
|orientation||1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other|
|race||1=Mixed race, 2=Asian, 3=Black, 4=Native American, 5=Native Australian, 6=White, 7=Other|
|religion||1=Atheist/Agnostic, 2=Christian, 3=Muslim, 4=Jewish, 5=Hindu, 6=Buddhist, 7=Other|
|hand|"What hand do you use to write with?"|1=Right, 2=Left, 3=Both|

The following technical data was also obtained:

|Var|Responses|
|:-|:-|
|country|where the users computer was located (using MaxMind GeoIPLite), ISO country code|
|fromgoogle|1=HTTP_referer contained '.google.', 2=it did not|
|introelapse|how many seconds from when the introduction page was loaded until the user started the test|
|testelapse|how many seconds from when the test was started until the page with the test items was submitted|

### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

**Answer**: (1) Unique identifiers in the survey should be excluded, (2) Data should be accurately stored, (3) Participants should not be allowed to dicuss their reponses with each other

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [5]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [90]:
df['hand'].value_counts(normalize=True)

1    0.846558
2    0.108031
3    0.042782
0    0.002629
Name: hand, dtype: float64

In [106]:
# Removing answers for Q27 and Q43 that are not exactly the same
df2 = df.loc[df['Q27'] == df['Q43'], :]
df2.drop(columns='Q43', inplace=True)

## These kinda questions are usually some form of validation mechanism to ensure respondents are doing the survey properly

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [107]:
df3 = df2.loc[df['hand'] != 0]
df3['hand'].unique()

array([3, 2, 1], dtype=int64)

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

**Answer**: You're either left-handed or you're not - That's a binary question. So this would be a classification problem.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

**Answer**: Variables usually are measured at different scales and hence may not contribute equally to the model fitting; This may inadvertently create some bias

### 7. Give an example of when we might not standardize our variables.

**Answer**: Algorithms which are not sensitive to the magnitude of variables (i.e. Logistic Regression

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

**Answer**: K-nearest neighbours is a distance-based classifier that classifies new observations based on similarity measures (eg. distance metrics) with labelled observations of the training set. Standardisation is usually required.

In this case, since our features are already subjected to a common scale, there may not be a need to standardize

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

**Answer**: The target (y) variable needs to have two classes: `1=left-handed` and `0=not left-handed`

The column currently has 4x responses: 0 - No response; 1 - Right-handed, 2 - Left-handed, 3 - Both

That means we will have to dummify the `hand` column. We have removed rows with 'No responses' previously. *Note*: People who are able to use both hands will also be included as 'left-handed'

In [108]:
# Dummifying 'hand' column
dummies = pd.get_dummies(data=df3['hand'])

# Renaming columns
dummies.columns = ['right', 'left', 'both']

# Combining left and both columns into one
left_handed = dummies['left'] + dummies['both']

print(type(left_handed)) # Checking whether our target response is a series
left_handed.head()

<class 'pandas.core.series.Series'>


0    1
2    1
3    1
4    1
5    0
dtype: uint8

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer: There is no optimal `k` value here; Trial-and-error for different values may make more sense

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [110]:
X_var = [i for i in df3._get_numeric_data().columns if not i in ['introelapse', 'testelapse', 
                                                                'fromgoogle','engnat', 
                                                                'age', 'education', 
                                                                'gender', 'orientation', 'race',
                                                                'religion', 'hand']]
X = df3[X_var]
X.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q34,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q44
0,4,1,5,1,5,1,5,1,4,1,...,5,5,1,1,1,5,5,5,1,1
2,1,2,1,1,5,4,3,2,1,4,...,4,2,2,4,2,1,4,2,2,2
3,1,4,1,5,1,4,5,4,3,5,...,5,5,1,3,4,1,2,1,1,3
4,5,1,5,1,5,1,5,1,3,1,...,5,5,1,1,1,5,5,5,1,1
5,5,4,2,2,1,1,3,3,3,1,...,1,5,5,2,2,5,4,5,5,1


In [111]:
y = left_handed
y[:5]

0    1
2    1
3    1
4    1
5    0
dtype: uint8

In [112]:
y.value_counts(normalize=True)

0    0.847038
1    0.152962
dtype: float64

In [119]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.3, stratify=y)

In [120]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

In [127]:
# Fit and score on the training datasets
for i in [3, 5, 15, 25]:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    print(f'K{i}: Accuracy (Train): {knn.score(X_train, y_train)} \nK{i}: Accuracy (Test) {knn.score(X_test, y_test)}\n')

K3: Accuracy (Train): 0.8707368421052631 
K3: Accuracy (Test) 0.7986247544204322

K5: Accuracy (Train): 0.8534736842105263 
K5: Accuracy (Test) 0.825147347740668

K15: Accuracy (Train): 0.8471578947368421 
K15: Accuracy (Test) 0.8457760314341847

K25: Accuracy (Train): 0.8471578947368421 
K25: Accuracy (Test) 0.8467583497053045



Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

**Answer**: Yes, there is default regularization. The default is set to L2 (Ridge)

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

**Answer**: Logistic regression is not sensitive to the magnitude of variables; The coefficients in a logstic regression model are also interpretable as they represent the change in log-odds caused by the input variables

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [122]:
# Instantiate Logistic Regression model (Lasso, )
logreg_l1c1 = LogisticRegression(penalty='l1', C=1.0, solver='liblinear')
logreg_l1c10 = LogisticRegression(penalty='l1', C=10.0, solver='liblinear')
logreg_l2c1 = LogisticRegression(penalty='l2', C=1.0)
logreg_l2c10 = LogisticRegression(penalty='l2', C=10.0)

In [123]:
# Scoring logistic regression with lasso, c=1
logreg_l1c1.fit(X_train, y_train)
print(f'Accuracy (Train): {logreg_l1c1.score(X_train, y_train)}')
print(f'Accuracy (Test): {logreg_l1c1.score(X_test, y_test)}')

Accuracy (Train): 0.8471578947368421
Accuracy (Test): 0.8467583497053045


In [124]:
# Scoring logistic regression with lasso, c=10
logreg_l1c10.fit(X_train, y_train)
print(f'Accuracy (Train): {logreg_l1c10.score(X_train, y_train)}')
print(f'Accuracy (Test): {logreg_l1c10.score(X_test, y_test)}')

Accuracy (Train): 0.8471578947368421
Accuracy (Test): 0.8467583497053045


In [125]:
# Scoring logistic regression with ridge, c=1
logreg_l2c1.fit(X_train, y_train)
print(f'Accuracy (Train): {logreg_l2c1.score(X_train, y_train)}')
print(f'Accuracy (Test): {logreg_l2c1.score(X_test, y_test)}')

Accuracy (Train): 0.8471578947368421
Accuracy (Test): 0.8467583497053045


In [126]:
# Scoring logistic regression with ridge, c=10
logreg_l2c10.fit(X_train, y_train)
print(f'Accuracy (Train): {logreg_l2c10.score(X_train, y_train)}')
print(f'Accuracy (Test): {logreg_l2c10.score(X_test, y_test)}')

Accuracy (Train): 0.8471578947368421
Accuracy (Test): 0.8467583497053045


---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

**Answer**: Our X variables 

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer: See table below

|Type|Description|Accuracy (Train)|Accuracy (Test)|
|:-|:-|:-|:-|
knn|k=3|0.8707368421052631|0.7986247544204322|
knn|k=5|0.8534736842105263|0.825147347740668|
knn|k=15|0.8471578947368421|0.8457760314341847|
knn|k=25|0.8471578947368421|0.8467583497053045|
lasso|l1 penalty, c=1|0.8471578947368421|0.8467583497053045|
lasso|l1 penalty, c=10|0.8471578947368421|0.8467583497053045|
ridge|l2 penalty, c=1|0.8471578947368421|0.8467583497053045|
ridge|l2 penalty, c=10|0.8471578947368421|0.8467583497053045|

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

**Answer**: The $k$-NN model where *k=3* is showing some evidence of overfitting. This is because the accuracy on the training set is significantly higher compared to the accuracy on the test set

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

**Answer**: As k increases, the variance would increase while bias would decrease

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

**Answer**: (1) Increase the value of k, (2) Increase the number of observations, (3) Remove features with multi-collinearity

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

**Answer**: Accuracy for both train and test sets are the same across different logistic regresion models; We can conclude that there is little evidence of overfitting

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

**Answer**: As C increases, bias should increase while variance decreases

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

**Answer**: Usually as we increase C, the constraints on coefficients rises while fit decreases

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

**Answer**: (1) Remove some features, (2) Add regularization parameters, (3) Increase C

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

**Answer**: I would use Lasso logistic regression primarily because I believe many of the features are irrelevant

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

In [128]:
logreg_l1c1.coef_

array([[-0.00035173,  0.02467567,  0.03878952,  0.030816  ,  0.04488419,
         0.02678837,  0.04158006, -0.12212715,  0.02105754,  0.00090294,
         0.01342054,  0.00428094, -0.02052788, -0.07750944,  0.01090812,
         0.02120303,  0.0223356 , -0.06245631, -0.00578357, -0.00639293,
        -0.10097874, -0.08609748, -0.12532075,  0.        ,  0.0774237 ,
         0.06696628, -0.02026698,  0.0442557 ,  0.00750825, -0.01070895,
        -0.0128059 , -0.02998423,  0.06814026, -0.00057736,  0.10382164,
        -0.01360956, -0.04073529,  0.05278688, -0.05134452, -0.06339894,
        -0.06535533, -0.07121681,  0.07379411]])

**Answer**: For every 1-point increase in Q1, we should expect a -0.00035173 decrease in the log-odds of the the respondent being left-handed

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

**Answer**: Since the accuracies of train/test datasets for different models aree similar, I'm going to go with the Lasso Logistic Regression because I prefer a sparser model

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

In [139]:
df4 = pd.concat([X, y], axis=1)
df4.rename(columns={0: 'hand'}, inplace=True)
df4.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q35,Q36,Q37,Q38,Q39,Q40,Q41,Q42,Q44,hand
0,4,1,5,1,5,1,5,1,4,1,...,5,1,1,1,5,5,5,1,1,1
2,1,2,1,1,5,4,3,2,1,4,...,2,2,4,2,1,4,2,2,2,1
3,1,4,1,5,1,4,5,4,3,5,...,5,1,3,4,1,2,1,1,3,1
4,5,1,5,1,5,1,5,1,3,1,...,5,1,1,1,5,5,5,1,1,1
5,5,4,2,2,1,1,3,3,3,1,...,5,5,2,2,5,4,5,5,1,0


In [148]:
df4.corr()[['hand']].sort_values('hand',ascending=False)

Unnamed: 0,hand
hand,1.0
Q35,0.065403
Q3,0.043064
Q25,0.043012
Q17,0.039582
Q1,0.039498
Q5,0.036287
Q29,0.033694
Q7,0.032322
Q33,0.032081


**Answer**: Hard to make any generic conclusions here

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)