<a href="https://colab.research.google.com/github/trista-paul/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
#import lifelines
#cph = lifelines.CoxPHFitter()
import statsmodels.formula.api as smf
from sklearn.linear_model import Ridge

In [89]:
attributes = ['age', 'workclass', 'fnlwgt','education','education-num',
              'marital-status','occupation','relationship','race',
              'sex', 'capital-gain', 'capital-loss','hours-per-week',
              'native-country', 'target']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                header=None, names=attributes, index_col=False)
df.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


In [90]:
#two hours and a half to think to use this
df = df.applymap(lambda x: x.strip() if type(x) is str else x)
df.target.unique()

array(['<=50K', '>50K'], dtype=object)

In [91]:
df.shape

(32561, 15)

In [92]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
target            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [93]:
dummys = pd.get_dummies(df, prefix_sep='-')
pd.set_option('display.max_columns', None)
dummys.columns

Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'workclass-?', 'workclass-Federal-gov',
       'workclass-Local-gov', 'workclass-Never-worked',
       ...
       'native-country-Scotland', 'native-country-South',
       'native-country-Taiwan', 'native-country-Thailand',
       'native-country-Trinadad&Tobago', 'native-country-United-States',
       'native-country-Vietnam', 'native-country-Yugoslavia', 'target-<=50K',
       'target->50K'],
      dtype='object', length=110)

In [94]:
dummys.filter(regex='\?').head(1)

Unnamed: 0,workclass-?,occupation-?,native-country-?
0,0,0,0


In [0]:
#cleaning null values
dummys = dummys.loc[dummys['workclass-?'] == 0]
dummys = dummys.loc[dummys['occupation-?'] == 0]
dummys = dummys.loc[dummys['native-country-?'] == 0]
dummys = dummys.drop(columns=['workclass-?','occupation-?','native-country-?'])

In [96]:
y = dummys['target->50K']
y.unique()

array([0, 1], dtype=uint64)

In [0]:
X = dummys.drop(columns=['target->50K','target-<=50K'])

In [98]:
X.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass-Federal-gov,workclass-Local-gov,workclass-Never-worked,workclass-Private,workclass-Self-emp-inc,workclass-Self-emp-not-inc,workclass-State-gov,workclass-Without-pay,education-10th,education-11th,education-12th,education-1st-4th,education-5th-6th,education-7th-8th,education-9th,education-Assoc-acdm,education-Assoc-voc,education-Bachelors,education-Doctorate,education-HS-grad,education-Masters,education-Preschool,education-Prof-school,education-Some-college,marital-status-Divorced,marital-status-Married-AF-spouse,marital-status-Married-civ-spouse,marital-status-Married-spouse-absent,marital-status-Never-married,marital-status-Separated,marital-status-Widowed,occupation-Adm-clerical,occupation-Armed-Forces,occupation-Craft-repair,occupation-Exec-managerial,occupation-Farming-fishing,occupation-Handlers-cleaners,occupation-Machine-op-inspct,occupation-Other-service,occupation-Priv-house-serv,occupation-Prof-specialty,occupation-Protective-serv,occupation-Sales,occupation-Tech-support,occupation-Transport-moving,relationship-Husband,relationship-Not-in-family,relationship-Other-relative,relationship-Own-child,relationship-Unmarried,relationship-Wife,race-Amer-Indian-Eskimo,race-Asian-Pac-Islander,race-Black,race-Other,race-White,sex-Female,sex-Male,native-country-Cambodia,native-country-Canada,native-country-China,native-country-Columbia,native-country-Cuba,native-country-Dominican-Republic,native-country-Ecuador,native-country-El-Salvador,native-country-England,native-country-France,native-country-Germany,native-country-Greece,native-country-Guatemala,native-country-Haiti,native-country-Holand-Netherlands,native-country-Honduras,native-country-Hong,native-country-Hungary,native-country-India,native-country-Iran,native-country-Ireland,native-country-Italy,native-country-Jamaica,native-country-Japan,native-country-Laos,native-country-Mexico,native-country-Nicaragua,native-country-Outlying-US(Guam-USVI-etc),native-country-Peru,native-country-Philippines,native-country-Poland,native-country-Portugal,native-country-Puerto-Rico,native-country-Scotland,native-country-South,native-country-Taiwan,native-country-Thailand,native-country-Trinadad&Tobago,native-country-United-States,native-country-Vietnam,native-country-Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [99]:
model = LogisticRegression(random_state=42, solver = 'lbfgs',
                           multi_class = 'multinomial', max_iter=15000)

log_reg = model.fit(X, y)
log_reg.score(X,y) #not bad...

0.7908958291890458

In [100]:
log_reg.coef_.shape

(1, 105)

In [101]:
log_reg.coef_

array([[-3.41351717e-03, -1.78229146e-06, -8.14665720e-04,
         1.68906111e-04,  3.88896054e-04, -4.00002373e-03,
         6.89009193e-06, -2.59900651e-06,  0.00000000e+00,
        -1.78506925e-04,  1.91331509e-05, -1.26122072e-05,
        -5.25768509e-06, -4.43092558e-07, -1.72124850e-05,
        -2.39327395e-05, -7.75845503e-06, -2.88391053e-06,
        -5.81910655e-06, -1.29599564e-05, -9.77581701e-06,
        -4.39768136e-06, -6.78511949e-06,  4.87762990e-05,
         1.39118575e-05, -1.28908770e-04,  3.35942687e-05,
        -1.15177936e-06,  1.69621340e-05, -6.50544126e-05,
        -8.45106598e-05,  4.18759536e-07,  1.88515173e-04,
        -7.64447115e-06, -2.32306738e-04, -1.96168665e-05,
        -1.82508701e-05, -5.87559331e-05, -1.24241676e-07,
        -2.89447295e-05,  5.84566377e-05, -2.12639959e-05,
        -2.86780442e-05, -3.09689047e-05, -7.86778404e-05,
        -3.84005792e-06,  4.40584506e-05,  3.07449353e-06,
        -1.33331689e-05,  6.16063813e-07, -1.50144022e-0

In [102]:
#nice but hard to interpret
#to get an idea of what features are most useful to continue on,
#I'm running logistic regressions by category
#ex. generating dfs of just columns with 'race'
#and taking their mean coef and regression score
df1 = df.drop(columns = ['target'])
for name in df1.columns:
  feature = X.filter(regex='^'+str(name))
  log_reg = model.fit(feature, y)
  score = log_reg.score(feature,y)
  meancoef = np.mean(log_reg.coef_)
  print(name, meancoef *100, score)

age 2.1172457863082226 0.7374510974073337
workclass -0.0002181039411443264 0.7552549565678669
fnlwgt -0.00024235956454449225 0.7510775147536636
education 0.38167893025059046 0.7730919700285127
education-num 18.10160911145732 0.7730919700285127
marital-status 0.0034676631890666132 0.7510775147536636
occupation 0.13587009351047935 0.7510775147536636
relationship -0.01657974220649061 0.7510775147536636
race -0.005488815893939236 0.7510775147536636
sex 0.0019049085562383716 0.7510775147536636
capital-gain 0.016733326642205646 0.791227372190173
capital-loss 0.03533282181370327 0.7610901133877064
hours-per-week 2.3463052350035856 0.7438498773290896
native-country -0.6527937531175002 0.7510775147536636


In [0]:
#new X with coefs above abs1 or with high score
#my target : 85%
X2 = X.filter(regex='age|education-num|capital-gain|capital-loss|hours-per-week|occupation')

In [104]:
X2.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,occupation-Adm-clerical,occupation-Armed-Forces,occupation-Craft-repair,occupation-Exec-managerial,occupation-Farming-fishing,occupation-Handlers-cleaners,occupation-Machine-op-inspct,occupation-Other-service,occupation-Priv-house-serv,occupation-Prof-specialty,occupation-Protective-serv,occupation-Sales,occupation-Tech-support,occupation-Transport-moving
0,39,13,2174,0,40,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,50,13,0,0,13,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,38,9,0,0,40,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,53,7,0,0,40,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,28,13,0,0,40,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [105]:
X2.shape

(30162, 19)

In [108]:
log_reg = model.fit(X2, y)
log_reg.score(X2,y) #improving

0.8147669252702076

In [109]:
#remove capital loss?
X3 = X.filter(regex='age|education-num|capital-gain|hours-per-week|occupation')
log_reg = model.fit(X3, y)
log_reg.score(X3,y) #doin worse

0.810987335057357

In [113]:
#remove occupation?
X4 = X.filter(regex='^age|^education-num|^capital-gain|^capital-loss|^hours-per-week')
log_reg = model.fit(X4, y)
log_reg.score(X4,y) #doing much worse

0.808931768450368

In [114]:
#remove age?
X5 = X.filter(regex='education-num|capital-gain|capital-loss|hours-per-week|occupation')
log_reg = model.fit(X5, y)
log_reg.score(X5, y) #improved from X2

0.8121145812611896

In [115]:
#add education
X6 = X.filter(regex='education|education-num|capital-gain|capital-loss|hours-per-week|occupation')
log_reg = model.fit(X6, y)
log_reg.score(X6, y) #improved from X5

0.8133412903653604

In [116]:
#add workclass
#lowers from X6
X7 = X.filter(regex='workclass|education|education-num|capital-gain|capital-loss|hours-per-week|occupation')
log_reg = model.fit(X7, y)
log_reg.score(X7, y)

0.8123466613619786

In [117]:
#add native country
#helps? very minorly?
X8 = X.filter(regex='native-country|education|education-num|capital-gain|capital-loss|hours-per-week|occupation')
log_reg = model.fit(X8, y)
log_reg.score(X8, y)

0.8129434387640077

In [120]:
#add race
#helps
#ten regressions is about as many as I'm willing
X9 = X.filter(regex='race|native-country|education|education-num|capital-gain|capital-loss|hours-per-week|occupation')
log_reg = model.fit(X9, y)
log_reg.score(X9, y)

0.8130097473642331

In [123]:
X9.head()

Unnamed: 0,education-num,capital-gain,capital-loss,hours-per-week,education-10th,education-11th,education-12th,education-1st-4th,education-5th-6th,education-7th-8th,education-9th,education-Assoc-acdm,education-Assoc-voc,education-Bachelors,education-Doctorate,education-HS-grad,education-Masters,education-Preschool,education-Prof-school,education-Some-college,occupation-Adm-clerical,occupation-Armed-Forces,occupation-Craft-repair,occupation-Exec-managerial,occupation-Farming-fishing,occupation-Handlers-cleaners,occupation-Machine-op-inspct,occupation-Other-service,occupation-Priv-house-serv,occupation-Prof-specialty,occupation-Protective-serv,occupation-Sales,occupation-Tech-support,occupation-Transport-moving,race-Amer-Indian-Eskimo,race-Asian-Pac-Islander,race-Black,race-Other,race-White,native-country-Cambodia,native-country-Canada,native-country-China,native-country-Columbia,native-country-Cuba,native-country-Dominican-Republic,native-country-Ecuador,native-country-El-Salvador,native-country-England,native-country-France,native-country-Germany,native-country-Greece,native-country-Guatemala,native-country-Haiti,native-country-Holand-Netherlands,native-country-Honduras,native-country-Hong,native-country-Hungary,native-country-India,native-country-Iran,native-country-Ireland,native-country-Italy,native-country-Jamaica,native-country-Japan,native-country-Laos,native-country-Mexico,native-country-Nicaragua,native-country-Outlying-US(Guam-USVI-etc),native-country-Peru,native-country-Philippines,native-country-Poland,native-country-Portugal,native-country-Puerto-Rico,native-country-Scotland,native-country-South,native-country-Taiwan,native-country-Thailand,native-country-Trinadad&Tobago,native-country-United-States,native-country-Vietnam,native-country-Yugoslavia
0,13,2174,0,40,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,13,0,0,13,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,9,0,0,40,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,7,0,0,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,13,0,0,40,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [130]:
#X9's coefs by name of column (see above)
for name, coef in zip(X9.columns, log_reg.coef_[0]):
  print(name, coef)

education-num 0.05524016104652109
capital-gain 0.00016619168948350699
capital-loss 0.0003473243202653044
hours-per-week 0.016148494735552365
education-10th -0.3797947530520055
education-11th -0.5163986917423092
education-12th -0.21152424692096944
education-1st-4th -0.13073011847298344
education-5th-6th -0.2316507418242949
education-7th-8th -0.35403308469915024
education-9th -0.27418423004306763
education-Assoc-acdm -0.13119132200485267
education-Assoc-voc -0.09608201032857697
education-Bachelors 0.08840518251272127
education-Doctorate 0.36250998679852575
education-HS-grad -0.1322632893360731
education-Masters 0.21689839064263944
education-Preschool -0.051318321743917816
education-Prof-school 0.44202544856410986
education-Some-college -0.13590410062561495
occupation-Adm-clerical -0.24588026380900296
occupation-Armed-Forces -0.005753610048399427
occupation-Craft-repair 0.07988599419331012
occupation-Exec-managerial 0.3270139612886128
occupation-Farming-fishing -0.42409092740998744
occupa

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k

2. What are 3 features negatively correlated with income above 50k?

3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**What are 3 features positively correlated with income above 50k**

Education level, having an executive-managerial profession and working security have some of the highest positive coorelations on the X9 coefficient table.

** What are 3 features negatively correlated with income above 50k?**

Having a farming or fishing profession, having a cleaning profession and immigrating from Mexico are some of the highest negative coorelations. The highest in any direction is the negative 'other services', but it's hard to say what that means.


**Overall, how well does the model explain the data and what insights do you derive from it?**

The model scores highly for percent of data explained by model. With no changes it predicts 79, and I was able to get it up to 81.3%, when models from earlier in the unit were good were 20-40. Education level, Occupation type and capital gains and losses are quite influential on the score while country of origin is extremely minor effect and oddly, 'work-class' is bad for the score (less precise than occupation). Age could be bad for the score because it's not linear, we would expect it to be highest in middle age but start to be a negative indicator after retirement.

**You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.**
Quantile regression can be used to find the relationship features have only to an upper % of data (such as top 10% or top 50% of grades). You can infer negative correlations are risks to becoming below that grade.

**You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.**
Using survival regression, we can define the new product is the death event and analyze how long features 'survive'.

**You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.**
Ridge regression can be used to highly customize the regression to avoid overfitting a small number of values.