## Week 4, Lab 1: Predicting Left-Handedness from Psychological Factors
> Author: Matt Brems

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

We'll walk through a full data science problem in this lab. 

In [48]:
#Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

%matplotlib inline

---
## Step 1: Define The Problem.

You're currently a data scientist working at a university. A professor of psychology is attempting to study the relationship between personalities and left-handedness. They have tasked you with gathering evidence so that they may publish.

Specifically, the professor says "I need to prove that left-handedness is caused by some personality trait. Go find that personality trait and the data to back it up."

As a data scientist, you know that any real data science problem must be **specific** and **conclusively answerable**. For example:
- Bad data science problem: "What is the link between obesity and blood pressure?"
    - This is vague and is not conclusively answerable. That is, two people might look at the conclusion and one may say "Sure, the problem has been answered!" and the other may say "The problem has not yet been answered."
- Good data science problem: "Does an association exist between obesity and blood pressure?"
    - This is more specific and is conclusively answerable. The problem specifically is asking for a "Yes" or "No" answer. Based on that, two independent people should both be able to say either "Yes, the problem has been answered" or "No, the problem has not yet been answered."
- Excellent data science problem: "As obesity increases, how does blood pressure change?"
    - This is very specific and is conclusively answerable. The problem specifically seeks to understand the effect of one variable on the other.

### 1. In the context of the left-handedness and personality example, what are three specific and conclusively answerable problems that you could answer using data science? 

> You might find it helpful to check out the codebook in the repo for some inspiration.

1) Does an assosciation exist between Left Handedness & gender ?

2) Do Left handers prefer mathematics over poetry ?

3) Do left handers prefer decorations ?

---
## Step 2: Obtain the data.

### 2. Read in the file titled "data.csv."
> Hint: Despite being saved as a .csv file, you won't be able to simply `pd.read_csv()` this data!

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data_df = pd.read_csv('data.csv', sep = '\t')

### 3. Suppose that, instead of us giving you this data in a file, you were actually conducting a survey to gather this data yourself. From an ethics/privacy point of view, what are three things you might consider when attempting to gather this data?
> When working with sensitive data like sexual orientation or gender identity, we need to consider how this data could be used if it fell into the wrong hands!

1) Tell my audience how the data will be used.

2) Give an option to skip questions which make my audience uncomfortable or opt out of the survey itself.

3) Agree to withdraw any usage of data if any participant feels that his data should be revoked at all stages of usage.

---
## Step 3: Explore the data.

### 4. Conduct exploratory data analysis on this dataset.
> If you haven't already, be sure to check out the codebook in the repo, as that will help in your EDA process.

In [29]:
#Checking for null values
data_df.isnull().sum().sum()

0

In [30]:
data_df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,country,fromgoogle,engnat,age,education,gender,orientation,race,religion,hand
0,4,1,5,1,5,1,5,1,4,1,...,US,2,1,22,3,1,1,3,2,3
1,1,5,1,4,2,5,5,4,1,5,...,CA,2,1,14,1,2,2,6,1,1
2,1,2,1,1,5,4,3,2,1,4,...,NL,2,2,30,4,1,1,1,1,2
3,1,4,1,5,1,4,5,4,3,5,...,US,2,1,18,2,2,5,3,2,2
4,5,1,5,1,5,1,5,1,3,1,...,US,2,1,22,3,1,1,3,2,3


---
## Step 4: Model the data.

### 5. Suppose I wanted to use Q1 - Q44 to predict whether or not the person is left-handed. Would this be a classification or regression problem? Why?

This is a classification problem as the predicted variable is a categorical variable.

### 6. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed based on their responses to Q1 - Q44. Before doing that, however, you remember that it is often a good idea to standardize your variables. In general, why would we standardize our variables? Give an example of when we would standardize our variables.

Standardising the variables is important as different variables have different units and cannot be compared in with each other easily. Hence we scale the data to become Z Scores which is a standardisation method.

### 7. Give an example of when we might not standardize our variables.

When all predictor variables are of the same unit, there is little or no need for standardisation.

### 8. Based on your answers to 6 and 7, do you think we should standardize our predictor variables in this case? Why or why not?

No there is not much need for standardisation in this case as the units of multiple variables are the same. 

### 9. We want to use $k$-nearest neighbors to predict whether or not a person is left-handed. What munging/cleaning do we need to do to our $y$ variable in order to explicitly answer this question? Do it.

Answer: 

We could create dummy variables for all scenarios Left handed, Right handed, Both. This would give us a specific column for Left Handednes.

In [9]:
data_df.hand.value_counts()

1    3542
2     452
3     179
0      11
Name: hand, dtype: int64

In [4]:
dummydrop = pd.get_dummies(data_df['hand'], drop_first = True)

In [35]:
dummydrop.columns = ['right_handed', 'left_handed', 'both_hands']

data_df = pd.concat([data_df, dummydrop], axis = 1)

In [36]:
data_df.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,age,education,gender,orientation,race,religion,hand,right_handed,left_handed,both_hands
0,4,1,5,1,5,1,5,1,4,1,...,22,3,1,1,3,2,3,0,0,1
1,1,5,1,4,2,5,5,4,1,5,...,14,1,2,2,6,1,1,1,0,0
2,1,2,1,1,5,4,3,2,1,4,...,30,4,1,1,1,1,2,0,1,0
3,1,4,1,5,1,4,5,4,3,5,...,18,2,2,5,3,2,2,0,1,0
4,5,1,5,1,5,1,5,1,3,1,...,22,3,1,1,3,2,3,0,0,1


### 10. The professor for whom you work suggests that you set $k = 4$. In this specific case, why might this be a bad idea?

It is advisible to take odd values for binary classification to avoid the ties i.e. two classes labels achieving the same score.

Source : https://discuss.analyticsvidhya.com/t/why-to-use-odd-value-of-k-in-knn-algorithm/2704 
credit to Steve

### 11. Let's *(finally)* use $k$-nearest neighbors to predict whether or not a person is left-handed!

> Be sure to create a train/test split with your data!

> Create four separate models, one with $k = 3$, one with $k = 5$, one with $k = 15$, and one with $k = 25$.

> Instantiate and fit your models.

In [37]:
data_df.columns

Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10', 'Q11',
       'Q12', 'Q13', 'Q14', 'Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q20', 'Q21',
       'Q22', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q28', 'Q29', 'Q30', 'Q31',
       'Q32', 'Q33', 'Q34', 'Q35', 'Q36', 'Q37', 'Q38', 'Q39', 'Q40', 'Q41',
       'Q42', 'Q43', 'Q44', 'introelapse', 'testelapse', 'country',
       'fromgoogle', 'engnat', 'age', 'education', 'gender', 'orientation',
       'race', 'religion', 'hand', 'right_handed', 'left_handed',
       'both_hands'],
      dtype='object')

In [38]:
features = data_df.columns[:44]
X = data_df[features]
y = data_df['left_handed']
knn = KNeighborsClassifier()

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

print (len(X_train),
      len(X_test),
      len(y_train),
      len(y_test))

3138 1046 3138 1046


In [65]:
print('Knn neighbors 1, train: ',cross_val_score(KNeighborsClassifier(n_neighbors=1), X_train, y_train, cv = 5).mean())
print('Knn neighbors 1, test: ',cross_val_score(KNeighborsClassifier(n_neighbors=1), X_test, y_test, cv = 5).mean())

Knn neighbors 1, train:  0.8148503133920499
Knn neighbors 1, test:  0.7868380742064952


In [66]:
print('Knn neighbors 3, train: ',cross_val_score(KNeighborsClassifier(n_neighbors = 3), X_train, y_train, cv = 5).mean())
print('Knn neighbors 3, test: ',cross_val_score(KNeighborsClassifier(n_neighbors = 3), X_test, y_test, cv = 5).mean())

Knn neighbors 3, train:  0.8683865134753501
Knn neighbors 3, test:  0.832730515098936


In [61]:
print('Knn neighbors 5, train: ',cross_val_score(KNeighborsClassifier(n_neighbors = 5), X_train, y_train, cv = 5).mean())
print('Knn neighbors 5, test: ',cross_val_score(KNeighborsClassifier(n_neighbors = 5), X_test, y_test, cv = 5).mean())

0.8881434187669521
0.8585451828872881


In [67]:
print('Knn neighbors 15, train: ',cross_val_score(KNeighborsClassifier(n_neighbors = 15), X_train, y_train, cv = 5).mean())
print('Knn neighbors 15, test: ',cross_val_score(KNeighborsClassifier(n_neighbors = 15), X_test, y_test, cv = 5).mean())

Knn neighbors 15, train:  0.8967502717418908
Knn neighbors 15, test:  0.8766769633874898


In [70]:
print('Knn neighbors 25, train: ',cross_val_score(KNeighborsClassifier(n_neighbors = 25), X_train, y_train, cv = 5).mean())
print('Knn neighbors 25, test: ',cross_val_score(KNeighborsClassifier(n_neighbors = 25), X_test, y_test, cv = 5).mean())

Knn neighbors 25, train:  0.8970687430794705
Knn neighbors 25, test:  0.8766769633874898


In [None]:
#Picking the model with maximum cross validation score
knn = KNeighborsClassifier(n_neighbors=25)

In [46]:
#Fitting the model
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Being good data scientists, we know that we might not run just one type of model. We might run many different models and see which is best.

### 12. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, let's check the [documentation for logistic regression in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Is there default regularization? If so, what is it? If not, how do you know?

Yes there is a default regularizer which is L2 or ridge regulizer .

### 13. We want to use logistic regression to predict whether or not a person is left-handed. Before we do that, should we standardize our features?

No there is not much need for standardisation in this case as the units of multiple variables are the same.

### 14. Let's use logistic regression to predict whether or not the person is left-handed.


> Be sure to use the same train/test split with your data as with your $k$-NN model above!

> Create four separate models, one with LASSO and $\alpha = 1$, one with LASSO and $\alpha = 10$, one with Ridge and $\alpha = 1$, and one with Ridge and $\alpha = 10$. *(Hint: Be careful with how you specify $\alpha$ in your model!)*

> Instantiate and fit your models.

In [95]:
print ('Lasso alpha 1, train: ' ,cross_val_score(LogisticRegression(penalty = 'l1', C = 1.0), X_train, y_train, cv = 5).mean())
print ('Lasso alpha 1, test: ' ,cross_val_score(LogisticRegression(penalty = 'l1', C = 1.0), X_test, y_test, cv = 5).mean())

lr = LogisticRegression(penalty = 'l1', C = 0.01)
lr.fit(X_train, y_train)
lr.coef_



Lasso alpha 1, train:  0.8970687430794705
Lasso alpha 1, test:  0.8747722014827278


array([[ 0.00000000e+00, -2.61292425e-02,  0.00000000e+00,
        -9.92837367e-03,  0.00000000e+00, -8.99565814e-03,
        -1.61056656e-02, -9.88366648e-02,  0.00000000e+00,
         0.00000000e+00, -8.40781120e-03, -3.88915559e-05,
        -1.70284041e-02,  0.00000000e+00, -8.73886178e-03,
         0.00000000e+00,  0.00000000e+00, -1.43312063e-02,
        -2.25827566e-02, -2.77535660e-02, -1.36910206e-03,
        -1.06724472e-01, -7.22194775e-02,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        -4.26177359e-02,  0.00000000e+00, -2.07106537e-02,
        -1.25173639e-02, -7.11801527e-02,  0.00000000e+00,
        -4.06794961e-02,  0.00000000e+00]])

In [100]:
print ('Lasso alpha 1, train: ' ,cross_val_score(LogisticRegression(penalty = 'l1', C = 10), X_train, y_train, cv = 5).mean())
print ('Lasso alpha 1, test: ' ,cross_val_score(LogisticRegression(penalty = 'l1', C = 10), X_test, y_test, cv = 5).mean())


lr = LogisticRegression(penalty = 'l1', C = 10)
lr.fit(X_train, y_train)
lr.coef_



Lasso alpha 1, train:  0.8970687430794705
Lasso alpha 1, test:  0.8728583258846416




array([[ 0.        ,  0.00124673,  0.01361231, -0.05832586,  0.05267561,
         0.01378587, -0.02028088, -0.14565867, -0.00897714,  0.03788356,
        -0.02637519, -0.02388296, -0.02938117, -0.00209476, -0.03925933,
         0.05898526,  0.01737615, -0.02453466, -0.0167781 , -0.03285178,
        -0.03095081, -0.11547072, -0.03465949,  0.00739917,  0.04483077,
         0.11718526,  0.11339695, -0.05619017,  0.03763763,  0.01165462,
         0.02959442,  0.00739759,  0.00330838,  0.00040353, -0.00253756,
        -0.02926244, -0.03441661,  0.07197816, -0.0741484 , -0.06754941,
        -0.07461089, -0.01185968, -0.17916452,  0.0197644 ]])

In [73]:
print ('Lasso alpha 1, train: ' ,cross_val_score(LogisticRegression(penalty = 'l2', C = 1.0), X_train, y_train, cv = 5).mean())
print ('Lasso alpha 1, test: ' ,cross_val_score(LogisticRegression(penalty = 'l2', C = 1.0), X_test, y_test, cv = 5).mean())



Lasso alpha 1, train:  0.8970687430794705
Lasso alpha 1, test:  0.8728583258846416




In [94]:
print ('Lasso alpha 1, train: ' ,cross_val_score(LogisticRegression(penalty = 'l2', C = 0.01), X_train, y_train, cv = 5).mean())
print ('Lasso alpha 1, test: ' ,cross_val_score(LogisticRegression(penalty = 'l2', C = 0.01), X_test, y_test, cv = 5).mean())

Lasso alpha 1, train:  0.8970687430794705
Lasso alpha 1, test:  0.8766769633874898




---
## Step 5: Evaluate the model(s).

### 15. Before calculating any score on your data, take a step back. Think about your $X$ variable and your $Y$ variable. Do you think your $X$ variables will do a good job of predicting your $Y$ variable? Why or why not? What impact do you think this will have on your scores?

They appear to be doing a good job just based on the accuraccy scores near to 90%. This is a good job. An in depth analysis of the prediction can be done at a later stage.

### 16. Using accuracy as your metric, evaluate all eight of your models on both the training and testing sets. Put your scores below. (If you want to be fancy and generate a table in Markdown, there's a [Markdown table generator site linked here](https://www.tablesgenerator.com/markdown_tables#).)
- Note: Your answers here might look a little weird. You didn't do anything wrong; that's to be expected!

| Model                                   | Training Accuracy Score | Testing Accuracy Score |
|-----------------------------------------|-------------------------|------------------------|
| Knn n_neighbours = 1                    | 0.8148                  | 0.7868                 |
| Knn n_neighbours = 3                    | 0.8683                  | 0.8327                 |
| Knn n_neighbours = 5                    | 0.8881                  | 0.8585                 |
| Knn n_neighbours = 15                   | 0.8967                  | 0.8766                 |
| Knn n_neighbours = 25                   | 0.8970                  | 0.8766                 |
| Logistic Regression (lasso, alpha = 1)  | 0.8970                  | 0.8747                 |
| Logistic Regression (lasso, alpha = 10) | 0.8970                  | 0.8728                 |
| Logistic Regression (Ridge, alpha = 1)  | 0.8970                  | 0.8728                 |
| Logistic Regression (Ridge, alpha = 10) | 0.8970                  | 0.8728                 |

### 17. In which of your $k$-NN models is there evidence of overfitting? How do you know?

There is not much evidence of overfitting in any model as the difference between training and test is around 3%. Though the training accuracy is higher than the testing accuracy, this is within limits and generally accepted in Data Science & ML.

### 18. Broadly speaking, how does the value of $k$ in $k$-NN affect the bias-variance tradeoff? (i.e. As $k$ increases, how are bias and variance affected?)

As K increases the Bias  & Variance both decrease

### 19. If you have a $k$-NN model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

1) Increase K

2) Drop outliers

3) Different model

### 20. In which of your logistic regression models is there evidence of overfitting? How do you know?

There is not much evidence of overfitting in any model as the difference between training and test is around 3%. Though the training accuracy is higher than the testing accuracy, this is within limits and generally accepted in Data Science & ML.

### 21. Broadly speaking, how does the value of $C$ in logistic regression affect the bias-variance tradeoff? (i.e. As $C$ increases, how are bias and variance affected?)

It does not affect Bias or variance that much. It should ideally but does not in this case. That is more regularization should reduce overfitting and reduce training accuracy. But it does not seem to have much effect on this model

### 22. For your logistic regression models, play around with the regularization hyperparameter, $C$. As you vary $C$, what happens to the fit and coefficients in the model? What do you think this means in the context of this specific problem?

As C increases the coefficient values reduces. This indicates that the effect of some variables have been regularized and reduced.

### 23. If you have a logistic regression model that has evidence of overfitting, what are three things you might try to do to combat overfitting?

1)Regularization increase

2)Drop Outliers

3)Get a new model


---
## Step 6: Answer the problem.

### 24. Suppose you want to understand which psychological features are most important in determining left-handedness. Would you rather use $k$-NN or logistic regression? Why?

I would use logistic regression as it would indicate higher coefficients of psychological features which are important.

### 25. Select your logistic regression model that utilized LASSO regularization with $\alpha = 1$. Interpret the coefficient for `Q1`.

The value of the coefficient for question 1 is 0. This indicates that learning to win at gambling has very little or no effect on being Left handed.

### 26. If you have to select one model overall to be your *best* model, which model would you select? Why?
- Usually in the "real world," you'll fit many types of models but ultimately need to pick only one! (For example, a client may not understand what it means to have multiple models, or if you're using an algorithm to make a decision, it's probably pretty challenging to use two or more algorithms simultaneously.) It's not always an easy choice, but you'll have to make it soon enough. Pick a model and defend why you picked this model!

I would pick the Lasso Logistic regression model with alpha 1. This is because it gives relatively good accuracy scores and it would help identify psychological features which are important.

### 27. Circle back to the three specific and conclusively answerable questions you came up with in Q1. Answer one of these for the professor based on the model you selected!

Does an assosciation occur of left handedness with Gender?

The answer is yes as there is a coefficient value for the gender parameter with Left handedness.

### BONUS:
Looking for more to do? Probably not - you're busy! But if you want to, consider exploring the following. (They could make for a blog post!)
- Create a visual plot comparing training and test metrics for various values of $k$ and various regularization schemes in logistic regression.
- Rather than just evaluating models based on accuracy, consider using sensitivity, specificity, etc.
- In the context of predicting left-handedness, why are unbalanced classes concerning? If you were to re-do this process given those concerns, what changes might you make?
- Fit and evaluate a generalized linear model other than logistic regression (e.g. Poisson regression).
- Suppose this data were in a `SQL` database named `data` and a table named `inventory`. What `SQL` query would return the count of people who were right-handed, left-handed, both, or missing with their class labels of 1, 2, 3, and 0, respectively? (You can assume you've already logged into the database.)