<a href="https://colab.research.google.com/github/tesseract314/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

In [0]:
# Imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression

# Setting options for pandas
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [41]:
# Importing data
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [42]:
# Renaming columns
df = df.rename({0: 'age', 1: 'workclass', 2: 'fnlwgt', 3: 'education', 4: 'education_num', 5: 'marital_status',
                6: 'occupation', 7: 'relationship', 8: 'race', 9: 'sex', 10: 'capital_gain', 11: 'capital_loss',
                12: 'hours_per_week', 13: 'native_country', 14: 'income'}, axis=1)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [47]:
# Looking at data types
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

In [55]:
# Using value_counts to see what categories are in the object columns
df['income'].value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

In [56]:
df['workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

In [57]:
df['education'].value_counts()

 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

In [58]:
df['marital_status'].value_counts()

 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: marital_status, dtype: int64

In [59]:
df['occupation'].value_counts()

 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: occupation, dtype: int64

In [60]:
df['relationship'].value_counts()

 Husband           13193
 Not-in-family      8305
 Own-child          5068
 Unmarried          3446
 Wife               1568
 Other-relative      981
Name: relationship, dtype: int64

In [61]:
df['race'].value_counts()

 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: race, dtype: int64

In [62]:
df['sex'].value_counts()

 Male      21790
 Female    10771
Name: sex, dtype: int64

In [63]:
df['native_country'].value_counts()

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

In [0]:
# Creating label encoder instance
le = LabelEncoder()

In [68]:
# Encoding categorical features and printing order of encoding

le.fit(df['workclass'])
print(le.classes_)
df['workclass'] = pd.Series(le.transform(df['workclass']))

le.fit(df['education'])
print(le.classes_)
df['education'] = pd.Series(le.transform(df['education']))

le.fit(df['marital_status'])
print(le.classes_)
df['marital_status'] = pd.Series(le.transform(df['marital_status']))

le.fit(df['occupation'])
print(le.classes_)
df['occupation'] = pd.Series(le.transform(df['occupation']))

le.fit(df['relationship'])
print(le.classes_)
df['relationship'] = pd.Series(le.transform(df['relationship']))

le.fit(df['race'])
print(le.classes_)
df['race'] = pd.Series(le.transform(df['race']))

le.fit(df['sex'])
print(le.classes_)
df['sex'] = pd.Series(le.transform(df['sex']))

le.fit(df['native_country'])
print(le.classes_)
df['native_country'] = pd.Series(le.transform(df['native_country']))

le.fit(df['income'])
print(le.classes_)
df['income'] = pd.Series(le.transform(df['income']))

[' ?' ' Federal-gov' ' Local-gov' ' Never-worked' ' Private'
 ' Self-emp-inc' ' Self-emp-not-inc' ' State-gov' ' Without-pay']
[' 10th' ' 11th' ' 12th' ' 1st-4th' ' 5th-6th' ' 7th-8th' ' 9th'
 ' Assoc-acdm' ' Assoc-voc' ' Bachelors' ' Doctorate' ' HS-grad'
 ' Masters' ' Preschool' ' Prof-school' ' Some-college']
[' Divorced' ' Married-AF-spouse' ' Married-civ-spouse'
 ' Married-spouse-absent' ' Never-married' ' Separated' ' Widowed']
[' ?' ' Adm-clerical' ' Armed-Forces' ' Craft-repair' ' Exec-managerial'
 ' Farming-fishing' ' Handlers-cleaners' ' Machine-op-inspct'
 ' Other-service' ' Priv-house-serv' ' Prof-specialty' ' Protective-serv'
 ' Sales' ' Tech-support' ' Transport-moving']
[' Husband' ' Not-in-family' ' Other-relative' ' Own-child' ' Unmarried'
 ' Wife']
[' Amer-Indian-Eskimo' ' Asian-Pac-Islander' ' Black' ' Other' ' White']
[' Female' ' Male']
[' ?' ' Cambodia' ' Canada' ' China' ' Columbia' ' Cuba'
 ' Dominican-Republic' ' Ecuador' ' El-Salvador' ' England' ' France'
 ' 

In [69]:
# Looking at encoded df
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


In [71]:
# Getting an idea of the range of each feature and making sure I have proper counts
df.describe()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,3.868892,189778.4,10.29821,10.080679,2.611836,6.57274,1.446362,3.665858,0.669205,1077.648844,87.30383,40.437456,36.718866,0.24081
std,13.640433,1.45596,105550.0,3.870264,2.57272,1.506222,4.228857,1.606771,0.848806,0.470506,7385.292085,402.960219,12.347429,7.823782,0.427581
min,17.0,0.0,12285.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,28.0,4.0,117827.0,9.0,9.0,2.0,3.0,0.0,4.0,0.0,0.0,0.0,40.0,39.0,0.0
50%,37.0,4.0,178356.0,11.0,10.0,2.0,7.0,1.0,4.0,1.0,0.0,0.0,40.0,39.0,0.0
75%,48.0,4.0,237051.0,12.0,12.0,4.0,10.0,3.0,4.0,1.0,0.0,0.0,45.0,39.0,0.0
max,90.0,8.0,1484705.0,15.0,16.0,6.0,14.0,5.0,4.0,1.0,99999.0,4356.0,99.0,41.0,1.0


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [77]:
# Defining variables, scaling X variables

X = pd.DataFrame(scale(df.drop(columns='income')))
y = df['income']

  


In [82]:
# Creating logistic regressio instance and fitting model
log_reg = LogisticRegression().fit(X, y)

# Getting score (pseudo R^2) -- it's a pretty good fit
log_reg.score(X, y)



0.8250667977027732

In [83]:
# Looking at coefficients for log_reg model
log_reg.coef_

array([[ 0.4646181 , -0.03154805,  0.05373115,  0.06094938,  0.85222912,
        -0.35307137,  0.04470072, -0.19192929,  0.09619363,  0.42082713,
         2.32323214,  0.27335535,  0.36885636,  0.02689289]])

In [84]:
# Looking at dataset, trying to make sense of coefficients
df.head()

# The initial numerical features have the most intuitive coefficients

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


In [87]:
# Looking at the percent of >50k incomes model predicted
predictions = pd.Series(log_reg.predict(X))
predictions.value_counts(normalize=True)

0    0.846903
1    0.153097
dtype: float64

In [89]:
# Looking at actual percent of >50k incomes in the dataset
df['income'].value_counts(normalize=True)

# The logistic regression under-predicted the number of >50k incomes

0    0.75919
1    0.24081
Name: income, dtype: float64

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**TODO - your answers!**

1. What are 3 features positively correlated with income above 50k?

**The 3 features that were most positively correlated with income above 50k were capital gains, number of years of education and age. All these results make sense. If you are making capital gains, you are making money from investments. And if you're making money from investments, you have higher amounts of disposable income that can be used for things other than basic necessities. Also, the higher education you recieve, the more opportunities to work in higher-paying careers you will have. And, the longer you are alive (age), the longer you can climb your way up the socioeconomic ladder. **

2. What are 3 features negatively correlated with income above 50k?

**The 3 features that were negatively correlated with income above 50k were workclass, marital status and relationship. These features are not as intuitive partly because they are categorical features that were encoded in kind of a random way. I think I could have been more methodical with the encoding. For example, I could have ordered the numbers of workclass by average income within those classes from lowest to highest. To do that, I could have looked up studies on income by workclass. This methodology would have made better use of these features and other features in the dataset. **

3. Overall, how well does the model explain the data and what insights do you derive from it?

**I think the model does a decent job of explaining the target variable (i.e. income above and below 50k). In the dataset, 24% of the participants made above 50k in income. However, the model predicted that 15% of the participants made above 50k in income. So, the model underpredicted the actual number of participants making above 50k. To get better results, I would have had to be more methodical with my encoding and do more feature engineering. But, overall, I think this model is a good starting point for predicting income categories. **

Situation 1: 

**I would use Quantile Regression in this scenario. To evaluate what makes students get bottom tier grades, I would fit the model to a lower quantile of the data (e.g. 0.1 quantile) and see how the coefficients of that model compare to the coefficients of models fit at higher quantiles. For example, we may see that hours of sleep per night has a greater effect on the grades of lower quantile students than average students.**

Situation 2 :

**I would use Survival Analysis in this situation. An ideal dataset may include when rumors of a new product first appear in the media (birth) and when the products are actually launched (death). The dataset would also include the time from birth to death, and the time from birth to censorship. Then, using Survival Analysis, I could get a better idea of how long it takes from a product to go from rumor to launch.**

Situation 3:

**In this situation, I would use Ridge Regression. Because we are only able to evaluate a few dozen plants at a time, we want our predictive ability to tansfer over to new (test) batches of plants. In other words, we want our model to be generalizable. And, with a large number of features, we would be in danger of overfitting the model when looking at a small number of plants. With Ridge Regression, we can take a lot of the variation out of our model so that we can better predict new batches of plants.**