## Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

One way to define the data science process is as follows:

1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

---
## Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

In [2]:
# 1
# Do the left-handed people answer the quesions quicker than the right-handed people? 
# (The mouse of computer is usaully designed for right-handed people)

# 2
# Do Native American people tend to have left-handedness than other race?

# 3
# Do agressive people tend to have left-handedness?

---
## Step 2: Obtain the data.

### Read in the file titled "data.csv":
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [4]:
# library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
df = pd.read_csv('data.csv', delimiter = '\t')    # Because this csv saved in tab-delimeter

In [6]:
df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


---

## Step 3: Explore the data.

### Conduct background research:

Domain knowledge is irreplaceable. Figuring out what information is relevant to a problem, or what data would be useful to gather, is a major part of any end-to-end data science project! For this lab, you'll be using a dataset that someone else has put together, rather than collecting the data yourself.

Do some background research about personality and handedness. What features, if any, are likely to help you make good predictions? How well do you think you'll be able to model this? Write a few bullet points summarizing what you believe, and remember to cite external sources.

You don't have to be exhaustive here. Do enough research to form an opinion, and then move on.

> You'll be using the answers to Q1-Q44 for modeling; you can disregard other features, e.g. country, age, internet browser.

### Are lefty good at arts more than righty?
Base on the journal we found, there is no difference.

### https://www.tandfonline.com/doi/full/10.1080/1357650X.2024.2315856

The notion of an increased incidence of left handers among architects and visual artists has inspired both scientific theory building and popular discussion. However, a systematic exploration of the available publications provides, at best, modest evidence for this claim. The present preregistered observational study was designed to reinvestigate the postulated association by examining hand preference of visual artists who share their artistic activities as short video clips (“reels”) on the social media platform Instagram. Determining individual hand preference based on five reels for each of N = 468 artists, we identified 42 (8.97%) left handers, suggesting an incidence which is below but statistical comparable to the 10.6% expected for the general population (χ2 = 1.30; p = .25; Cohen’s w = 0.05). Also, we did not find any support for the notion that the art created by left-handed artists is of higher quality than art of right handers, as no difference in public endorsement or interest were observed (reflected by the number of likes per post or account followers). Taken together, **we do not find any support for difference in artistic engagement or quality between left and right handers**.

### Conduct exploratory data analysis on this dataset:

If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

You might use this section to perform data cleaning if you find it to be necessary.

In [10]:
# Extract the rows that have null value in any columns
df[df.isnull().any(axis = 1)]    

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand


In [11]:
df.shape

(4184, 56)

In [12]:
# With method .describe, the minimum of some columns is 0
# df.filter(regex='^Q').describe().T

In [13]:
# There are 0's in columns Q1 - Q44, total 0's is 697, average 15 - 16 per column
# If we drops all rows having 0, we will lose around 15 - 697 rows
(df == 0).filter(regex='^Q').sum().sum()

697

In [14]:
# Drop rows that has 0 in any column Q1 - Q44
# We lose 381 rows, around 9%
df = df[~(df.loc[:, 'Q1':'Q44'] == 0).any(axis=1)]
df.shape

(3803, 56)

In [15]:
# There are 9 rows that value of column 'hand' is 0
df[df['hand'] == 0]

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
1322,1,5,3,3,4,3,2,5,3,2,...,US,2,1,34,3,0,0,0,1,0
1846,1,1,1,2,1,1,3,3,1,4,...,NO,2,2,34,4,2,1,6,2,0
2409,5,1,5,3,4,3,3,4,5,3,...,US,2,2,18,2,1,0,7,7,0
2471,1,5,1,5,5,5,1,5,1,3,...,CA,1,1,22,3,0,1,6,2,0
2690,2,5,5,1,5,5,5,5,4,2,...,US,2,2,23763,4,1,2,7,7,0
2703,3,5,5,5,1,5,5,2,5,5,...,US,1,1,52,4,2,2,6,7,0
3098,1,4,3,2,4,3,2,1,3,5,...,NL,2,1,21,0,1,0,0,0,0
3105,1,4,4,5,1,3,2,2,5,5,...,US,2,1,57,4,0,2,6,6,0
4015,5,5,1,5,1,3,5,1,1,1,...,US,1,1,54,2,2,1,3,2,0


In [16]:
df = df[df['hand'] != 0]

### Calculate and interpret the baseline accuracy rate:

In [18]:
# Baseline Accuracy = 85%, if we predict all cases as 1 (righty)
df['hand'].value_counts(normalize = True).mul(100).round(2)

hand
1    85.27
2    10.44
3     4.30
Name: proportion, dtype: float64

### Short answer questions:

In this lab, you'll use K-nearest neighbors and logistic regression to model handedness based on psychological factors. 

Answer the following related questions; your answers may be in bullet points.

#### Describe the difference between regression and classification problems:

In [20]:
# Answer here:
# The target 'y' in regression is continuous
# BUT, the target 'y' in classification problems is discrete, usually finite

#### Considering $k$-nearest neighbors, describe the relationship between $k$ and the bias-variance tradeoff:

In [22]:
# Answer here:
# When k is low, the model will very specific, focusing the nearest points
# Low bias because it will fit the training data
# High variance because little change in data will lead to different outcome
# So, it likely to overfit

# When k is high, the model will be more flexible, focusing all other options
# High bias because it's less sensitive to new data
# Low variance because the prediction is more statble

#### Why do we often standardize predictor variables when using $k$-nearest neighbors?

In [24]:
# Answer here:
# The scales (units) of each variables may be too much difference, for example height and salary of person.
# The salary that have wide range of values will give more impact

#### Do you think we should standardize the explanatory variables for this problem? Why or why not?

In [26]:
# Answer here:
# I think it's unnecessary because the independent variables Q1 - Q44 sharing same scales (1-5).

#### How do we settle on $k$ for a $k$-nearest neighbors model?

In [28]:
# Answer here:
# We will use Cross Validation, and try many value of k.

#### What is the default type of regularization for logistic regression as implemented in scikit-learn? (You might [check the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

In [30]:
# Answer here:
# L2, euclidean distance

#### Describe the relationship between the scikit-learn `LogisticRegression` argument `C` and regularization strength:

In [32]:
# Answer here:
# Small C, will give small coefficient in the regularization. Thus, prevent overfitting.
# Large C, will give the room for model fit more, but may lead to overfitting.

#### Describe the relationship between regularization strength and the bias-variance tradeoff:

In [34]:
# Answer here:
# Small C, Strong Regularization, high bias and low variance; leads to underfitting
# Large C, Weak Regularization, low bias and high variance; leads to overfitting

#### Logistic regression is considered more interpretable than $k$-nearest neighbors. Explain why.

In [36]:
# Answer here:
# Logistic regression is parametric algorithm. The model has coefficients that can be interpreted.
# While kNN is non-parametric, it is hard to tell relation between variables

---

## Step 4 & 5 Modeling: $k$-nearest neighbors

### Train-test split your data:

Your features should be:

In [38]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [39]:
X = df.loc[:, 'Q1':'Q44']
y = df['hand']

In [40]:
X.shape, y.shape

((3794, 44), (3794,))

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

In [42]:
y_train.value_counts(normalize = True),y_test.value_counts(normalize = True)

(hand
 1    0.852718
 2    0.104448
 3    0.042834
 Name: proportion, dtype: float64,
 hand
 1    0.852437
 2    0.104084
 3    0.043478
 Name: proportion, dtype: float64)

In [43]:
# I mentioned earlier it no need to sc, but who know it may be useful
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test) 

#### Create and fit four separate $k$-nearest neighbors models: 
- one with $k = 3$
- one with $k = 5$
- one with $k = 15$
- one with $k = 25$:

In [74]:
knn3 = KNeighborsClassifier(n_neighbors = 3)
knn3.fit(X_train,y_train)
cross_val_score(knn3, X_train, y_train, cv = 10).mean()

0.8418414538822303

In [161]:
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn5.fit(X_train,y_train)
cross_val_score(knn5, X_train, y_train, cv = 10).mean()

0.8378854003821434

In [163]:
knn15 = KNeighborsClassifier(n_neighbors = 15)
knn15.fit(X_train,y_train)
cross_val_score(knn15, X_train, y_train, cv = 10).mean()

0.8527195153725898

In [167]:
knn25 = KNeighborsClassifier(n_neighbors = 25)
knn25.fit(X_train,y_train)
cross_val_score(knn25, X_train, y_train, cv = 10).mean()

0.8527195153725898

In [76]:
knn3sc = KNeighborsClassifier(n_neighbors = 3)
knn3sc.fit(X_train_sc,y_train)
cross_val_score(knn3sc, X_train_sc, y_train, cv = 10).mean()

0.8411846447802676

In [197]:
knn5sc = KNeighborsClassifier(n_neighbors = 5)
knn5sc.fit(X_train_sc,y_train)
cross_val_score(knn5sc, X_train_sc, y_train, cv = 10).mean()

0.8392055323953448

In [201]:
knn15sc = KNeighborsClassifier(n_neighbors = 15)
knn15sc.fit(X_train_sc,y_train)
cross_val_score(knn15sc, X_train_sc, y_train, cv = 10).mean()

0.8523905680041688

In [203]:
knn25sc = KNeighborsClassifier(n_neighbors = 25)
knn25sc.fit(X_train_sc,y_train)
cross_val_score(knn25sc, X_train_sc, y_train, cv = 10).mean()

0.8527195153725898

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

In [173]:
knn3.score(X_train,y_train), knn3.score(X_test,y_test)

(0.8695222405271829, 0.8260869565217391)

In [175]:
knn5.score(X_train,y_train), knn5.score(X_test,y_test)

(0.8530477759472818, 0.8445322793148881)

In [177]:
knn15.score(X_train,y_train), knn15.score(X_test,y_test)

(0.8527182866556837, 0.852437417654809)

In [179]:
knn25.score(X_train,y_train), knn25.score(X_test,y_test)

(0.8527182866556837, 0.852437417654809)

In [185]:
# model with k = 3, is overfit because it preforms bad when using test data
# Each model performs as same as baseline do; around 85%.

In [207]:
knn3sc.score(X_train_sc,y_train), knn3sc.score(X_test_sc,y_test)

(0.8658978583196046, 0.8274044795783926)

In [209]:
knn5sc.score(X_train_sc,y_train), knn5sc.score(X_test_sc,y_test)

(0.8523887973640857, 0.8445322793148881)

In [211]:
knn15sc.score(X_train_sc,y_train), knn15sc.score(X_test_sc,y_test)

(0.8527182866556837, 0.852437417654809)

In [213]:
knn25sc.score(X_train_sc,y_train), knn25sc.score(X_test_sc,y_test)

(0.8527182866556837, 0.852437417654809)

In [215]:
# Same :(

---

## Step 4 & 5 Modeling: logistic regression

#### Create and fit four separate logistic regression models: one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

Note: You can use the same train and test data as used above with kNN.

In [223]:
# 1. Lasso (L1) with C = 1
lasso_model_c1 = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')
lasso_model_c1.fit(X_train_sc, y_train)

# 2. Lasso (L1) with C = 0.1
lasso_model_c01 = LogisticRegression(penalty = 'l1', C = 0.1, solver='liblinear')
lasso_model_c01.fit(X_train_sc, y_train)

# 3. Ridge (L2)  with C = 1
ridge_model_c1 = LogisticRegression(penalty = 'l2', C = 1.0)
ridge_model_c1.fit(X_train_sc, y_train)

# 4. Ridge (L2) C = 0.1
ridge_model_c01 = LogisticRegression(penalty = 'l2', C = 0.1)
ridge_model_c01.fit(X_train_sc, y_train)

### Evaluate your models:

Evaluate each of your four models on the training and testing sets, and interpret the four scores. 

Are any of your models overfit or underfit? 

Do any of your models beat the baseline accuracy rate?

In [227]:
lasso_model_c1.score(X_train_sc,y_train),lasso_model_c1.score(X_test_sc,y_test)

(0.8527182866556837, 0.852437417654809)

In [229]:
lasso_model_c01.score(X_train_sc,y_train),lasso_model_c01.score(X_test_sc,y_test)

(0.8527182866556837, 0.852437417654809)

In [231]:
ridge_model_c1.score(X_train_sc,y_train),ridge_model_c1.score(X_test_sc,y_test)

(0.8530477759472818, 0.8511198945981555)

In [233]:
ridge_model_c01.score(X_train_sc,y_train),ridge_model_c01.score(X_test_sc,y_test)

(0.8530477759472818, 0.8511198945981555)

In [235]:
# Each model performs well but can do just like a baseline

---

## Step 6: Answer the problem.

Are any of your models worth moving forward with? 

What are the "best" models?

In [237]:
# None of them
# So far, logistic with L2 where C = 0.1, 1 sounds good but not beat the baseline
# May be consider more cleaning data or add features