## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

Answer: 

1) How likely is someone to be left handed given their answers to a list of questions about their personality?

2) Do left-handed people tend to answer peronality-based questions identically to right-handed people or differently?

3) Do lef-handed people give answers to questions's indicating a violent personlity more, less, or the same as right-handed people? 

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [76]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [77]:
import warnings
warnings.filterwarnings('ignore')

In [78]:
df = pd.read_csv('./data.csv', sep='\t')

In [79]:
df.head()


Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

1) Give respondant the option of answering survey anonymously. 

2) Administer the survey by filling in bubbles, or some other way that doesn't require handwritting, so their answers can't be traced back to them.

3) 


---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [80]:
df.columns


Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [81]:
df.shape

(4184, 56)

In [82]:
df.isnull().sum().sum()

0

In [83]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,testelapse,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
count,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,...,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0,4184.0
mean,1.962715,3.829589,2.846558,3.186902,2.86544,3.672084,3.216539,3.184512,2.761233,3.522945,...,479.994503,1.576243,1.239962,30.370698,2.317878,1.654398,1.833413,5.013623,2.394359,1.190966
std,1.360291,1.551683,1.664804,1.476879,1.545798,1.342238,1.490733,1.387382,1.511805,1.24289,...,3142.178542,0.494212,0.440882,367.201726,0.874264,0.640915,1.303454,1.970996,2.184164,0.495357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,1.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,3.0,1.0,2.0,1.0,3.0,2.0,2.0,1.0,3.0,...,186.0,1.0,1.0,18.0,2.0,1.0,1.0,5.0,1.0,1.0
50%,1.0,5.0,3.0,3.0,3.0,4.0,3.0,3.0,3.0,4.0,...,242.0,2.0,1.0,21.0,2.0,2.0,1.0,6.0,2.0,1.0
75%,3.0,5.0,5.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,...,324.25,2.0,1.0,27.0,3.0,2.0,2.0,6.0,2.0,1.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,119834.0,2.0,2.0,23763.0,4.0,3.0,5.0,7.0,7.0,3.0


In [84]:
df['country'].unique()

array(['US', 'CA', 'NL', 'GR', 'GB', 'KR', 'SE', 'NO', 'DE', 'NZ', 'CH',
       'RO', 'IL', 'IN', 'ZA', 'TR', 'JM', 'AU', 'BE', 'PL', 'CZ', 'RS',
       'TW', 'A2', 'MX', 'PH', 'ES', 'AT', 'JP', 'IT', 'SG', 'MY', 'HK',
       'FR', 'EU', 'DK', 'AE', 'EC', 'TH', 'IE', 'PK', 'BR', 'ID', 'EG',
       'NI', 'FI', 'CN', 'RU', 'SI', 'AR', 'PT', 'LB', 'DO', 'PF', 'LT',
       'BG', 'GE', 'CL', 'SK', 'EE', 'KE', 'UZ', 'LV', 'BB', 'BN', 'PR',
       'HR', 'NP', 'A1', 'PE', 'UA', 'HU', 'VN', 'TZ', 'KH', 'UY', 'VE',
       'IS', 'MP', 'CO', 'JO', 'TN', 'KW', 'CY', 'FJ', 'LK', 'VI', 'ZW',
       'IM', 'ZM', 'QA', 'DZ', 'LY', 'SA'], dtype=object)

In [85]:
df['engnat'].head()

0    1
1    1
2    2
3    1
4    1
Name: engnat, dtype: int64

#### I want to change the 'no' value from 2 to 0 in the 'engnat' and 'fromgoogle' columns

In [86]:
df['engnat'].replace(2, 0, inplace=True)

In [87]:
df['fromgoogle'].replace(2, 0, inplace=True)

In [88]:
df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand'],
      dtype='object')

In [89]:
df['fromgoogle'].head()

0    0
1    0
2    0
3    0
4    0
Name: fromgoogle, dtype: int64

In [90]:
#df['age'].value_counts()

##### Respondants who put down '23763' may be suspect...

In [91]:
df['gender'].value_counts()

2    2212
1    1586
3     304
0      82
Name: gender, dtype: int64

In [92]:
df['education'].value_counts()

2    2055
3    1086
1     546
4     446
0      51
Name: education, dtype: int64

In [93]:
df['orientation'].value_counts()

1    2307
2     833
5     349
3     335
4     237
0     123
Name: orientation, dtype: int64

In [94]:
df['race'].value_counts()

6    2793
1     393
2     383
7     342
3     168
0      66
4      33
5       6
Name: race, dtype: int64

In [95]:
df['hand'].value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [96]:
df = df[df.age < 120]
#If you're 120 or older 

In [97]:
df.hand.value_counts()

1    3541
2     452
3     178
0      10
Name: hand, dtype: int64

---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

Answer: This is a classification problem because we are sorting individual into two classes, which are discrete and unordered.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Answer: We want to standardize out variables so that they are weighed at the same scale.  This is to ensure that certain variables don't drown out the effect of other variables on the prediction.  One instance is gender and age, where the former can only take on a handful of values wheras the latter can take on a large range of inegers. 

### 7. Give an example of when we might not standardize our variables.

Answer: Any instance where the variables are NOT being weighed on the same scale.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

Answer: No becuase all of the predictors in this case only take on values 1-5.

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: There are 10 '0' responses to the questions asking for a persons handedness. I'm assuming that none of the respondents don't have either of their hands, so I will delete these rows as there since there are more than 4000 responses and 10 is a small fraction.

In [98]:
df = df[df.hand != 0]

### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

Answer: The dataset contains more than 4000 responses, and a k = 4 is comparitivley small which raises the risk of overfitting. 

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [99]:
target = df['hand']
col_list = list(df.columns)

In [100]:
features = col_list[0:44] 

X = df[features]
y = df['hand']

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)


In [102]:
knn = KNeighborsClassifier(n_neighbors=3)

In [103]:
knn.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [104]:
knn.score(X_train, y_train)


0.8631713554987213

In [105]:
knn.score(X_test, y_test)


0.825503355704698

In [106]:
cross_val_score(knn, X_train, y_train, cv = 3).mean()


0.8126613276854856

In [107]:
knn_5 = KNeighborsClassifier(n_neighbors=5)


In [108]:
knn_5.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [109]:
knn_5.score(X_train, y_train)


0.8539002557544757

In [110]:
knn_5.score(X_test, y_test)


0.8427612655800575

In [111]:
cross_val_score(knn_5, X_train, y_train, cv = 3).mean()

0.8382365918768063

In [112]:
cross_val_score(knn_5, X_test, y_test, cv = 3).mean()

0.8360578356354963

In [113]:
knn_15 = KNeighborsClassifier(n_neighbors=15)

In [114]:
knn_15.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=15, p=2,
                     weights='uniform')

In [115]:
knn_15.score(X_train, y_train)


0.8491048593350383

In [116]:
knn_15.score(X_test, y_test)


0.8485139022051774

In [117]:
cross_val_score(knn_15, X_train, y_train, cv = 3).mean()

0.8491051445912787

In [118]:
cross_val_score(knn_15, X_test, y_test, cv = 3).mean()

0.8485154636898651

In [119]:
knn_25 = KNeighborsClassifier(n_neighbors=15)

In [120]:
knn_25.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=15, p=2,
                     weights='uniform')

In [121]:
knn_25.score(X_train, y_train)


0.8491048593350383

In [122]:
knn_25.score(X_test, y_test)


0.8485139022051774

In [123]:
cross_val_score(knn_25, X_train, y_train, cv = 3).mean()

0.8491051445912787

In [124]:
cross_val_score(knn_25, X_test, y_test, cv = 3).mean()

0.8485154636898651

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

In [125]:
from sklearn.linear_model import LinearRegression, LogisticRegression

Answer: yes Ridge Regression.

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

Answer: No, the predictor variables already have the same scale.

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [126]:
logregLa = LogisticRegression(penalty = 'l1', C = 1)

In [127]:
logregLa.fit(X_train, y_train)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [128]:
logregLa.score(X_train, y_train)

0.8491048593350383

In [129]:
logregLa.score(X_test, y_test)

0.8485139022051774

In [130]:
cross_val_score(logregLa, X_train, y_train, cv = 3).mean()

0.8491051445912787

In [131]:
cross_val_score(logregLa, X_test, y_test, cv = 3).mean()

0.843717910497201

In [132]:
logregLa_10 = LogisticRegression(penalty = 'l1', C = 0.1)

In [133]:
logregLa_10.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [134]:
logregLa_10.score(X_train, y_train)

0.8494245524296675

In [135]:
logregLa_10.score(X_test, y_test)

0.8475551294343241

In [136]:
cross_val_score(logregLa_10, X_train, y_train, cv = 3).mean()

0.8491051445912787

In [137]:
logregRi = LogisticRegression(penalty = 'l2', C = 0.1) #logistic regression model with ridge regularization and alpha=1

In [138]:
logregRi.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [139]:
logregRi.score(X_train, y_train)

0.8491048593350383

In [140]:
logregRi.score(X_test, y_test)

0.8485139022051774

In [141]:
cross_val_score(logregRi, X_train, y_train, cv = 3).mean()

0.8491051445912787

In [142]:
cross_val_score(logregRi, X_test, y_test, cv = 3).mean()

0.8456336193094615

In [143]:
logregRi_10 = LogisticRegression(penalty = 'l2', C = 0.1) #logistic regression model with ridge regularization and alpha=10


In [144]:
logregRi_10.fit(X_train, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [145]:
logregRi_10.score(X_train, y_train)

0.8491048593350383

In [146]:
logregRi_10.score(X_test, y_test)

0.8485139022051774

In [147]:
cross_val_score(logregRi_10, X_train, y_train, cv = 3).mean()

0.8491051445912787

In [148]:
cross_val_score(logregRi_10, X_test, y_test, cv = 3).mean()

0.8456336193094615

---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

Answer: I'm not a behaviorial scientist of any kind, but I don't believe there is any relationship between personality and handedness.  Therefore I don't think my X variables will do a good job of predicting my Y variable.

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

| Model | Training Accuracy | Testing Accuracy |
| --- | --- | ------- |
| Knn_3| 0.8126613276854856 | 0.8063450263340952 |
| Knn_5| 0.8382365918768063 | 0.8360578356354963 |
| knn_15| 0.8491051445912787 | 0.8485154636898651 |
| knn_25| 0.8491051445912787 | 0.8485139022051774 |
| LogiRegr LASSO $\alpha$= 1| 0.8491051445912787 | 0.843717910497201 |
| LogiRegr LASSO $\alpha$= 10| 0.8491051445912787 | 0.8456336193094615 |
| LogiRegr Ridge $\alpha$= 1 | 0.8485139022051774 | 0.8398837324853422 |
| LogiRegr Rdige $\alpha$= 10 | 0.8491051445912787 | 0.8456336193094615
 |

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

For k=25 there is evidence of overfitting because that has a lower Testing accuracy than the k=15 model but the same training accuracy.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

When k increases biases goes up and variance goes down.

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

1. Reduce k
2. Use some regularization scheme depending on the features.
3. Get rid of some features.

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

Both ridge regression models and Lasso with $\alpha$ = 1 because accuracy is better for the training set than it is for the test set. 

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

C = 1 / $\alpha$ , so the model becomes less overfit as C increases.  As C gets bigger, bias gets higher and variance gets lower.

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

See above answer.  

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

1) Make C smaller.

2) Modify, scale, or eliminiate certain features.

3) Choose a suitable regularization scheme.

---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

I'd rather use logistic regression because it performs better.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

Answer: For every unit increase in Q1(They've studied how to win at gambling) 1.7 % more likley to be left-handed.

In [150]:
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logregLa.coef_))], axis = 1)
#found this on stack exchange

In [151]:
coefficients.head(n=1)


Unnamed: 0,0,0.1,1,2
0,Q1,-0.016961,-0.024259,0.086216


### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

I'd go with the Lasso Logistic Regression model with $\alpha$ = 10 because it had the highest accuracy. 

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Given the smallness of the model's coeffcients I'd say there is no one question to 

In [163]:
coefficients.sort_values

<bound method DataFrame.sort_values of       0         0         1         2
0    Q1 -0.016961 -0.024259  0.086216
1    Q2 -0.001145 -0.020356  0.052551
2    Q3 -0.062910  0.022940  0.123866
3    Q4  0.000000 -0.070866  0.188915
4    Q5 -0.027180  0.078443 -0.088210
5    Q6  0.017999  0.007456 -0.069014
6    Q7 -0.040607  0.003611  0.104166
7    Q8  0.129698 -0.156775 -0.041833
8    Q9  0.071173 -0.111535  0.014974
9   Q10  0.009350  0.013212 -0.102761
10  Q11 -0.026849  0.023285  0.018480
11  Q12  0.000000 -0.002477  0.006603
12  Q13 -0.004607 -0.029444  0.084078
13  Q14  0.055250 -0.026023 -0.145634
14  Q15  0.024851 -0.016077 -0.002225
15  Q16 -0.029484  0.081988 -0.099662
16  Q17 -0.067089  0.045370  0.098403
17  Q18  0.043030 -0.028519 -0.060442
18  Q19  0.000000 -0.023280  0.055072
19  Q20  0.014592  0.001368 -0.063868
20  Q21  0.119724 -0.050657 -0.278461
21  Q22  0.081188 -0.092849 -0.045229
22  Q23  0.034690 -0.010521 -0.087946
23  Q24  0.016230 -0.010460 -0.017931
24  Q25 -0.

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)