## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Answer:
1. As the degree to which indivudlas agree with the statement "I would prefer a class in mathematics to a class in pottery" increases, how does the liklihood of that individual being lef-handed change?
2. As the degree to which indivudlas agree with the statement "I have taken apart machines just to see how they work" increases, how does the liklihood of that individual being lef-handed change?
3. As the degree to which indivudlas agree with the statement "I have studied how to win at gambling" increases, how does the liklihood of that individual being lef-handed change?

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [1]:
import pandas as pd

In [5]:
# Read in csv

df = pd.read_csv('./data.csv', sep='\t')

### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:
1. All identifying information should be supressed. It should not be possible to use combinations of specific features to identify an individual.
2. The purpose of the study and a clear explanation of how the data should be used should be clearly stated to all participants.
3. If popssible, an Institutional Review Board (IRB) should be consulted to check for other sensitivities in the questions.

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [6]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4184 entries, 0 to 4183
Data columns (total 56 columns):
Q1             4184 non-null int64
Q2             4184 non-null int64
Q3             4184 non-null int64
Q4             4184 non-null int64
Q5             4184 non-null int64
Q6             4184 non-null int64
Q7             4184 non-null int64
Q8             4184 non-null int64
Q9             4184 non-null int64
Q10            4184 non-null int64
Q11            4184 non-null int64
Q12            4184 non-null int64
Q13            4184 non-null int64
Q14            4184 non-null int64
Q15            4184 non-null int64
Q16            4184 non-null int64
Q17            4184 non-null int64
Q18            4184 non-null int64
Q19            4184 non-null int64
Q20            4184 non-null int64
Q21            4184 non-null int64
Q22            4184 non-null int64
Q23            4184 non-null int64
Q24            4184 non-null int64
Q25            4184 non-null int64
Q26            418

In [9]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [10]:
df.isnull().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: Classification. All of the questions are ordinal and discrete in nature and not continuous.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer: We want to standardize our variables when they are on different scales. For example, comparing the number of rooms in a house to the number of square inches in the house might benefit from standardization, as their units of measurements are on very different scales. 

### 7. Give an example of when we might not standardize our variables.

Answer: Latitude and Longitude are already in the format we need for using them. That is, they are already standardized

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

Answer: Probably not. All of the variables are already on an ordinal scale of 1-5.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: We should drop 'no-response' observations and dummify the variable for left-handedness.

In [15]:
import numpy as np
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [26]:
# drop no-response answers for left-handedness
df = df.drop(df[(df['hand']==0)].index)
# drop nulls if generated (this problem has presented itself a few times for me)
df.dropna(inplace=True)
# check for decrase of 11 in n
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4173 entries, 0 to 4183
Data columns (total 56 columns):
Q1             4173 non-null int64
Q2             4173 non-null int64
Q3             4173 non-null int64
Q4             4173 non-null int64
Q5             4173 non-null int64
Q6             4173 non-null int64
Q7             4173 non-null int64
Q8             4173 non-null int64
Q9             4173 non-null int64
Q10            4173 non-null int64
Q11            4173 non-null int64
Q12            4173 non-null int64
Q13            4173 non-null int64
Q14            4173 non-null int64
Q15            4173 non-null int64
Q16            4173 non-null int64
Q17            4173 non-null int64
Q18            4173 non-null int64
Q19            4173 non-null int64
Q20            4173 non-null int64
Q21            4173 non-null int64
Q22            4173 non-null int64
Q23            4173 non-null int64
Q24            4173 non-null int64
Q25            4173 non-null int64
Q26            417

In [21]:
df.hand.map({'Left': 1, 'Right': 0, 'Both': 0})

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
4179   NaN
4180   NaN
4181   NaN
4182   NaN
4183   NaN
Name: hand, Length: 4173, dtype: float64

In [23]:
df['hand'].value_counts()

1    3542
2     452
3     179
Name: hand, dtype: int64

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer: Picking an even number could result in a tie. Additionally, we probably want to choose our K value by running cross validation and determining which K value gives us the best score. 

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [115]:
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

In [122]:
X = df.drop(['testelapse',     
            'country',        
            'fromgoogle',     
            'engnat',         
            'age',            
            'education',      
            'gender',         
            'orientation',    
            'race',           
            'religion',       
            'hand'],axis = 'columns')
            
y = df['hand']

In [123]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [118]:
ss = StandardScaler()
ss.fit(X_train)
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)
    
knn = KNeighborsClassifier(n_neighbors = 3)

cross_val_score(knn, X_train_sc, y_train, cv=5).mean()

# Code adapted from GA DSI Lecture 4.03

0.8063243450479233

In [35]:
knn.fit(X_train_sc, y_train);
print('train', knn.score(X_train_sc, y_train))
print('test', knn.score(X_test_sc, y_test))

train 0.8481943112815596
test 0.8362068965517241


In [47]:
# for training data:

for k in [3,5,15,25]:
        model = KNeighborsClassifier(n_neighbors=k)
        score = cross_val_score(model, X_train_sc, y_train, cv=5).mean()
        print(f'K={k}, {score}')
        
# Code adapted from GA DSI Lecture 4.03

K=3, 0.8063243450479233
K=5, 0.8334921405750798
K=15, 0.8485137380191693
K=25, 0.8491532268370607


In [46]:
# for test data:

for k in [3,5,15,25]:
        model = KNeighborsClassifier(n_neighbors=k)
        score = cross_val_score(model, X_test_sc, y_test, cv=5).mean()
        print(f'K={k}, {score}')
        
# Code adapted from GA DSI Lecture 4.03

K=3, 0.8074806772175194
K=5, 0.8371641516378359
K=15, 0.8477042694147958
K=25, 0.8477042694147958


Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Answer: Yep! From the documentation: "Note that regularization is applied by default." The documentation describing parameters states "penalty{‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2."

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer: Logistic regression has built in standardization in the solver argument 

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [129]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Lasso, Ridge

In [130]:
# Dummify
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_train_dummified = ohe.fit_transform(X_train, y_train)
X_test_dummified = ohe.transform(X_test)

In [131]:
# Apply Standard Scaler to dummified X
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train_dummified)
X_test_sc = sc.transform(X_test_dummified)


In [132]:
# Instantiate: # default alpha is 1
ridge = Ridge()        
lasso = Lasso()

In [133]:
# Make Alpha List:
alpha_list = [1, 10]
# Instantiate Mean Scores:
mean_score_list = []
# Function:
for a_value in a_list:
    lasso = Lasso(alpha=a_value)
    #ridge = Ridge(alpha=a_value)
    mean_score = cross_val_score(lasso, X_train_sc, y_train).mean()
    #mean_score = cross_val_score(ridge, X_train_sc, y_train).mean()
    mean_score_list.append(mean_score)

print(mean_score_list)
list(zip(alpha_list, mean_score_list))

# Code adapted from GA DSI Local Lesson on GridSearchCV

[-0.0012918611458810681, -0.0012918611458810681, -0.0012918611458810681, -0.0012918611458810681, -0.0012918611458810681]


[(1, -0.0012918611458810681), (10, -0.0012918611458810681)]

In [144]:
lasso.fit(X_train_sc, y_train)
lasso.score(X_test_sc, y_test)

-0.00012331508175300598

In [114]:
# Well, these seem awfully wrong to me :(

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer:
Probably not! At least, not in the case of the logistic regressions. In retreospet, dummifying each ordinal value may not have been as prudent as creating simple binary encodings for each question rather than each ordinal answer to each question. This apprach will likely result in poor logistic regression results.  

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer: They sure do look weird! Need to return to this and figure out what's going on. I thought with LASSO and Ridge, we need to one-hot-encode and scale when given so many ordinal values? Could use some help.  

|  Model   |     score     | k/alpha
|----------|:-------------:|------:|
|   KNN    |       0.80748 |   3   |
|   KNN    |       0.83716 |   5   |
|   KNN    |       0.84770 |  15   |
|   KNN    |       0.84770 |  25   |
|  LASSO   |       0.00129 |   1   |
|  LASSO   |       0.00129 |  10   |
|  RIDGE   |       0.00452 |   1   |
|  RIDGE   |       0.00451 |  10   |


### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

Answer: Though none of these models display drastic differences between the train and test scores, the KNN model with a k of 25 shows the greatest discrepency between train and test scores, suggesting overfitting. 

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

Answer: As K increases, we generally expect bias (training error) to increase and variance (test error) to decrease.


### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:
1. Attempt a different distance metric (Euclidean or Manhattan)
2. Hypertune K: attempt to optimize K by looking for the value of K that provides the best cross-val score.
3. Regularize our model by adding a penalty term to our loss function

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer: Need to come back to this questions. One might have expected overfitting to manifest in more significant vairation between the training and test results, but given the model scores, it might be worth reassessing the aproach to one-hot encoding and standard scaler before assessing overfit. Further investigation is required. 


### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer:
Assuming that by C we're talking about Alpha (I don't recall that nuance being covered in class?)
The Alpha hyperparameter controls the size of the coefficient and  amount of regularization in a model. As Alpha increases, we would expect bias to increase, though this might not effect a models' accuracy.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer: Need to return to this, given the curious Ridge and LASSO model results.

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer:
1. Attempt using L1 and L2 models (Ridge and LASSO) to better acount for sparse or multicolinear data.
2. Hypertune Alpha: attempt to optimize alpha by looking for the value that provides the best cross-val score.
3. Regularize our model by adding a penalty term to our loss function

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer: Probably a KNN, based on the model results. Since KNN assign a class (left-handedness) to unclassified_point using "votes" from k_nearest_points, it seems an reasonable tool for this data science questions.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

In [148]:
lasso.coef_

#feature_cols = X_train_sc.columns

#df_lr = pd.DataFrame([lasso.coef_], columns=feature_cols)

array([ 0., -0., -0.,  0.,  0.,  0., -0.,  0., -0., -0.,  0., -0.,  0.,
       -0., -0., -0., -0.,  0.,  0., -0., -0.,  0.,  0.,  0.,  0., -0.,
        0.,  0., -0.,  0.,  0.,  0.,  0., -0., -0., -0., -0., -0., -0.,
        0., -0.,  0.,  0.,  0.,  0.,  0., -0., -0.,  0., -0., -0.,  0.,
        0.,  0.,  0.,  0.,  0., -0.,  0., -0., -0., -0.,  0., -0., -0.,
        0.,  0.,  0., -0., -0.,  0., -0.,  0., -0., -0., -0.,  0.,  0.,
        0.,  0., -0.,  0.,  0., -0.,  0., -0.,  0.,  0., -0.,  0.,  0.,
       -0., -0.,  0.,  0., -0.,  0., -0.,  0., -0.,  0.,  0.,  0.,  0.,
       -0.,  0.,  0., -0.,  0., -0., -0., -0., -0.,  0.,  0.,  0.,  0.,
       -0., -0., -0.,  0.,  0., -0.,  0., -0., -0.,  0.,  0., -0.,  0.,
       -0., -0., -0., -0., -0.,  0.,  0., -0.,  0.,  0., -0.,  0., -0.,
       -0., -0., -0., -0.,  0.,  0.,  0.,  0., -0.,  0., -0., -0.,  0.,
        0., -0., -0.,  0.,  0.,  0.,  0.,  0., -0., -0.,  0., -0.,  0.,
       -0., -0.,  0., -0.,  0., -0.,  0.,  0.,  0., -0., -0.,  0

Answer:

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

Answer:

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer:

AttributeError: 'KNeighborsClassifier' object has no attribute 'coef_'

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)