## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> Check out the codebook in the repo for some inspiration.

> You'll be asked to answer one of these questions later on, so make sure your questions are based on the data provided (specifically Q1 - Q44)!

Answer:  
**1)** As the scale increases for 'likes guns' how does the likelihood of being left handed change?  
**2)** As the scale increases for 'I hate shopping' how does the likelihood of being left handed change?  
**3)** As the scale increases for 'I have considered joining the military' how does the likelihood of being left handed change.

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

> Notice that the data is separated by *tabs* (not commas, like most .csv files). Check out the parameters to see if there is anything that might help you parse this.

In [170]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso, LassoCV

In [140]:
# Need to add delimiter="\t" to load in tab separated files
data = pd.read_csv('../4.01-lab-classification-model-comparison-master/data.csv', delimiter="\t")
data.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

Answer:  
**1)** One way of conducting this survey would be to send out a survey where it is not required for them to put their names or person information like that.  
**2)** We can also encrypt the data so only people with the special access keys will have access to the data.  
**3)** Depending on how much data I collect I can store this data in something like AWS which has built in firewall protection and security features

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [141]:
# There are no null values
data.isna().sum()

Q1             0
Q2             0
Q3             0
Q4             0
Q5             0
Q6             0
Q7             0
Q8             0
Q9             0
Q10            0
Q11            0
Q12            0
Q13            0
Q14            0
Q15            0
Q16            0
Q17            0
Q18            0
Q19            0
Q20            0
Q21            0
Q22            0
Q23            0
Q24            0
Q25            0
Q26            0
Q27            0
Q28            0
Q29            0
Q30            0
Q31            0
Q32            0
Q33            0
Q34            0
Q35            0
Q36            0
Q37            0
Q38            0
Q39            0
Q40            0
Q41            0
Q42            0
Q43            0
Q44            0
introelapse    0
testelapse     0
country        0
fromgoogle     0
engnat         0
age            0
education      0
gender         0
orientation    0
race           0
religion       0
hand           0
dtype: int64

In [142]:
data.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [143]:
# Dropping the rows where Is English you native language? = 0
# data.drop(data[data['engnat'] == 0].index, inplace = True)

In [144]:
# Dropping the rows where age is not possible
data.drop(data[data['age'] >110].index, inplace = True)

In [145]:
# Dropping the rows where education is 0
# data[data['education'] == 0].index, inplace=True)

In [146]:
# Dropping the rows where test elapse is under 2 minutes
data.drop(data[data['testelapse'] < 120].index, inplace = True)

In [147]:
data.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,...,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0,4095.0
mean,1.959219,3.833211,2.845665,3.191697,2.866178,3.673504,3.217582,3.185836,2.760928,3.522589,...,488.088156,1.573871,1.244444,24.611477,2.317705,1.652259,1.826129,5.010501,2.404151,1.191209
std,1.359095,1.550571,1.664044,1.47678,1.545331,1.340482,1.490495,1.387851,1.510665,1.238495,...,3175.658084,0.494573,0.443239,10.901128,0.874888,0.641128,1.299187,1.97422,2.19005,0.495037
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,120.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,189.0,1.0,1.0,18.0,2.0,1.0,1.0,4.5,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,245.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,326.0,2.0,2.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,86.0,4.0,3.0,5.0,7.0,7.0,3.0


---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

**Answer:** This would be a classification problem. This would be a classification problem because it is a binary outcome yes or no. Fundamentally, classification is about predicting a label and regression is about predicting a quantity

https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/#:~:text=Fundamentally%2C%20classification%20is%20about%20predicting,is%20about%20predicting%20a%20quantity.&text=That%20classification%20is%20the%20problem,quantity%20output%20for%20an%20example.)

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

**Answer:** We should standardize our data because variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais. In this example age has a max value of 86 and religion has a max value of 7. This is a drastic difference in range and can cause problems later down the line in modelling.

### 7. Give an example of when we might not standardize our variables.

**Answer:** We would not standardize ordinal features such as excellent, fair, good, and bad. These need a specific 'weight' because they are giving a level a quality to that column.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case (remember we're only using Q1 - Q44 as predictor variables)? Why or why not?

**Answer:** We would not want to standardize out predictor variables because the range of predictor variables are small.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

> Note: Think critically about how to clean the $y$ variable based on your problem statement.
   
   > Be sure to provide some explanation/justification for your choice.

**Answer:** We would need to drop all rows that have a value of 0 and change all of the values of 3 to 2. 

In [148]:
# Dropping the rows where hand is 0
data.drop(data[data['hand'] == 0].index, inplace = True)

In [149]:
data.shape

(4085, 56)

In [150]:
# Replacing 'Both' with 'Left'
data['hand'].replace(3, 2, inplace = True)

In [224]:
# Number of lefts and rights
data['hand'].value_counts()

1    3467
2     618
Name: hand, dtype: int64

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

**Answer:** We do not want to set k=4 because we have an even number of classes. If we have an even number of classes we should set k = to an odd number so we do not end up with a tie. The inverse is also true. We do not want to have an odd number of classes and an odd number of k because we can end up with a tie. So, if we have an odd number of classes we should set k = to an even number.

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [225]:
# Only the features Q1-Q44
data_cols = data.columns[:-12]

In [226]:
X = data[data_cols]
y = data['hand']

# Train, Test, Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### **K = 3**

In [227]:
# Instantiate KNN with 3 K's
knn3 = KNeighborsClassifier(n_neighbors=3)

In [228]:
# Fit the KNN with 3 K's
knn3.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [275]:
# Score the KNN with 3 K's
print(f"KNN accuracy train for K = 3 is: {knn3.score(X_train, y_train)}")
print(f"KNN accuracy test for K = 3 is: {knn3.score(X_test, y_test)}")

KNN accuracy train for K = 3 is: 0.8782239634345413
KNN accuracy test for K = 3 is: 0.7710371819960861


### **K = 5**

In [230]:
# Instantiate KNN with 5 K's
knn5 = KNeighborsClassifier(n_neighbors=5)

In [231]:
knn5.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [269]:
print(f"KNN accuracy train for K = 5 is: {knn5.score(X_train, y_train)}")
print(f"KNN accuracy train for K = 5 is: {knn5.score(X_test, y_test)}")

KNN accuracy train for K = 5 is: 0.8635324844923278
KNN accuracy train for K = 5 is: 0.7915851272015656


### **K = 15**

In [233]:
knn15 = KNeighborsClassifier(n_neighbors=15)

In [234]:
knn15.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=15, p=2,
                     weights='uniform')

In [264]:
print(f"KNN accuracy train for K = 15 is: {knn15.score(X_train, y_train)}")
print(f"KNN accuracy train for K = 15 is: {knn15.score(X_test, y_test)}")

KNN accuracy train for K = 15 is: 0.8609206660137121
KNN accuracy train for K = 15 is: 0.8131115459882583


### **K = 25**


In [236]:
knn25 = KNeighborsClassifier(n_neighbors=25)

In [237]:
knn25.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=25, p=2,
                     weights='uniform')

In [262]:
print(f"KNN accuracy train for K = 25 is: {knn25.score(X_train, y_train)}")
print(f"KNN accuracy test for K = 25 is: {knn25.score(X_test, y_test)}")

KNN accuracy train for K = 25 is: 0.860594188703885
KNN accuracy test for K = 25 is: 0.8131115459882583


Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

**Answer:** The default regularization for logistic regression in sklearn is default=’l2’. The L2 regulariztion is actually Ridge Regession. Ridge regression or L2 adds the “squared magnitude” of coefficient as penalty term to the loss function.

https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

**Answer:** We do not need to standardize out features since the features we are using are all on the same scale as each other.

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate logistic regression models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [239]:
X = data[data_cols]
y = data['hand']

# Train, Test, Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### **LASSO with 𝛼 = 1**


In [240]:
# Instantiate the Logistic Regession Model for LASSO and 𝛼 = 1
logregl1 = LogisticRegression(penalty='l1', solver='liblinear', C = 1.0)

In [241]:
# Fit the Logistic Regession Model for LASSO and 𝛼 = 1
logregl1.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [259]:
print(f"LASSO accuracy for train with alpha = 1 is: {logregl1.score(X_train, y_train)}")
print(f"LASSO accuracy for train with alpha = 1 is: {logregl1.score(X_test, y_test)}")

LASSO accuracy for train with alpha = 1 is: 0.8602677113940581
LASSO accuracy for train with alpha = 1 is: 0.8131115459882583


### **LASSO with 𝛼 = 10**


In [243]:
# Instantiate the Logistic Regession Model for LASSO and 𝛼 = 10
logregl110 = LogisticRegression(penalty='l1', solver='liblinear', C = 0.1)

In [244]:
logregl110.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [257]:
print(f"LASSO accuracy for train with alpha = 10 is: {logregl1.score(X_train, y_train)}")
print(f"LASSO accuracy for test with alpha = 10 is: {logregl1.score(X_test, y_test)}")

LASSO accuracy for train with alpha = 10 is: 0.8602677113940581
LASSO accuracy for test with alpha = 10 is: 0.8131115459882583


### **Ridge with 𝛼 = 1**


In [246]:
# Instantiate the Logistic Regession Model for Ridge and 𝛼 = 1
logregl21 = LogisticRegression(penalty='l2', C = 1.0)

In [247]:
logregl21.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [253]:
print(f"Ridge accuracy for train with alpha = 1 is: {logregl21.score(X_train, y_train)}")
print(f"Ridge accuracy for test with alpha = 1 is: {logregl21.score(X_test, y_test)}")

Ridge accuracy for train with alpha = 1 is: 0.8602677113940581
Ridge accuracy for test with alpha = 1 is: 0.8131115459882583


### **Ridge with 𝛼 = 10**


In [296]:
logregl210 = LogisticRegression(penalty='l2', C = 0.1)

In [297]:
logregl210.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [298]:
logregl210.coef_

array([[ 0.06861106,  0.00264391,  0.02935526, -0.00200734,  0.04157423,
        -0.07054575,  0.02613326, -0.11282439, -0.0153721 , -0.03055827,
         0.03371327,  0.00795544, -0.04859689, -0.03992647,  0.01383503,
        -0.00947503,  0.0483985 , -0.04252212, -0.06510668,  0.008095  ,
        -0.11642582, -0.11219591, -0.05231265, -0.03285661,  0.05194174,
         0.03597416,  0.15984457,  0.0503515 ,  0.01688062, -0.01752059,
         0.01044849,  0.00467021,  0.03033102, -0.00890223,  0.12690436,
        -0.02626925, -0.04436286,  0.13247143, -0.07768385, -0.0748033 ,
        -0.08917079, -0.01708683, -0.14827578,  0.0590259 ]])

In [284]:
print(f"Ridge accuracy for train with alpha = 10 is: {logregl21.score(X_train, y_train)}")
print(f"Ridge accuracy for test with alpha = 10 is: {logregl21.score(X_test, y_test)}")

Ridge accuracy for train with alpha = 10 is: 0.8602677113940581
Ridge accuracy for test with alpha = 10 is: 0.8131115459882583


---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

> For this question, consider your own thoughts (or research) on the relationship between psychological factors and handedness.

> When evaluating your models later on, consider whether a high score always means the variables are good predictors. Is this always the case?

**Answer:** 

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

Answer:

In [285]:
print(f"KNN accuracy train for K = 3 is: {knn3.score(X_train, y_train)}")
print(f"KNN accuracy test for K = 3 is: {knn3.score(X_test, y_test)}")
print('='*60)
print(f"KNN accuracy train for K = 5 is: {knn5.score(X_train, y_train)}")
print(f"KNN accuracy test for K = 5 is: {knn5.score(X_test, y_test)}")
print('='*60)
print(f"KNN accuracy train for K = 15 is: {knn15.score(X_train, y_train)}")
print(f"KNN accuracy test for K = 15 is: {knn15.score(X_test, y_test)}")
print('='*60)
print(f"KNN accuracy train for K = 25 is: {knn25.score(X_train, y_train)}")
print(f"KNN accuracy test for K = 25 is: {knn25.score(X_test, y_test)}")
print('='*60)
print(f"LASSO accuracy for train with alpha = 1 is: {logregl1.score(X_train, y_train)}")
print(f"LASSO accuracy for test with alpha = 1 is: {logregl1.score(X_test, y_test)}")
print('='*60)
print(f"LASSO accuracy for train with alpha = 10 is: {logregl1.score(X_train, y_train)}")
print(f"LASSO accuracy for test with alpha = 10 is: {logregl1.score(X_test, y_test)}")
print('='*60)
print(f"Ridge accuracy for train with alpha = 1 is: {logregl21.score(X_train, y_train)}")
print(f"Ridge accuracy for test with alpha = 1 is: {logregl21.score(X_test, y_test)}")
print('='*60)
print(f"Ridge accuracy for train with alpha = 10 is: {logregl21.score(X_train, y_train)}")
print(f"Ridge accuracy for test with alpha = 10 is: {logregl21.score(X_test, y_test)}")

KNN accuracy train for K = 3 is: 0.8782239634345413
KNN accuracy test for K = 3 is: 0.7710371819960861
KNN accuracy train for K = 5 is: 0.8635324844923278
KNN accuracy test for K = 5 is: 0.7915851272015656
KNN accuracy train for K = 15 is: 0.8609206660137121
KNN accuracy test for K = 15 is: 0.8131115459882583
KNN accuracy train for K = 25 is: 0.860594188703885
KNN accuracy test for K = 25 is: 0.8131115459882583
LASSO accuracy for train with alpha = 1 is: 0.8602677113940581
LASSO accuracy for test with alpha = 1 is: 0.8131115459882583
LASSO accuracy for train with alpha = 10 is: 0.8602677113940581
LASSO accuracy for test with alpha = 10 is: 0.8131115459882583
Ridge accuracy for train with alpha = 1 is: 0.8602677113940581
Ridge accuracy for test with alpha = 1 is: 0.8131115459882583
Ridge accuracy for train with alpha = 10 is: 0.8602677113940581
Ridge accuracy for test with alpha = 10 is: 0.8131115459882583


### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

**Answer:** The k-NN model with 3 k's and 5 k's was overfit. I know this because when scoring the testing data it was around 10% off from the training data. This is an indication of a model being overfit.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

**Answer:**  In k-nearest neighbors algorithm, trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model and lowers variance.

https://www.listendata.com/2017/02/bias-variance-tradeoff.html

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: If you have a k-nn model that has evidence of overfitting you can increase the number of K's or reduce the amount of features in X and train with more data.

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Answer: Yes since the model still performed differently on the testing data.

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

Answer: As we increase the value of C in logistic regression the amount of bias decreases and variance increases.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

Answer: As I vary C the coefficeients change. So the smaller I make C the smaller the coefficents become. As I increase C the closer the coefficients are to 0. 

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

Answer: In a logistic regression model three things we can do to combat overfitting would be to take away a handful of features, increase the amount of data, and using regularization.

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

Answer: "Therefore, you can use the KNN algorithm for applications that require high accuracy but that do not require a human-readable model. The quality of the predictions depends on the distance measure." - IBM

https://www.ibm.com/support/knowledgecenter/SSHRBY/com.ibm.swg.im.dashdb.analytics.doc/doc/r_knn_usage.html

We would pick Logistic regression because, we can get the coefficients for each feature.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

In [335]:
logregl1.coef_[0][0]

0.06729518329033465

**Answer:** For every one increase in rating for Q1 the probabilty of having left handedness increases by 6.7% 

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

In [342]:
print(logregl1.coef_.mean())

print(logregl110.coef_.mean())

print(logregl21.coef_.mean())

print(logregl210.coef_.mean())

-0.008816861155945833
-0.011440972593723122
-0.006979886823763138
-0.006719148590277671


**Answer:** If I had to select one model overall to be my best model I would pick LASSO Logistic Regression with alpha being 10. This model gave us accuracy scores comparable with the other Logistic Regression however this model had the highest average coefficients. LASSO also only uses features that have a relationship wiht y so it achieved better coeffients than the other models with using less features.

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Answer:

As the scale increases for 'likes guns' how does the likelihood of being left handed change?

In [345]:
logregl110.coef_[0][10]

0.007575971456760836

In [346]:
logregl110.coef_[0][35]

-0.017706670604718307

In [347]:
logregl110.coef_[0][18]

-0.06350357380759074

If this model is accurate. It seems that the trait 'likes guns' does not have an impact on the likelihood of left handedness. If someone circles a 5 on the rating scale for likes guns the probability of having left handedness increases by 3.5% everything else held constant. 

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?