<a href="https://colab.research.google.com/github/quinn-dougherty/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/2_3_SPRINTCHALLENGE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [205]:
'''
>50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
'''
FEATURE_names = ['age', 'workclass', 'fnlwgt','education', 'education_num', 
                  'marital_status', 'occupation', 'relationship', 'race', 'sex', 
                  'capital_gain', 'capital_loss', 'hours_per_week', 
                  'native_country', 'TARGET']

dependent = FEATURE_names[-1]

train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, scale
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from numpy.testing import assert_almost_equal
from heapq import nlargest, nsmallest
from functools import reduce

df_ = pd.read_csv(train_url, header=None, names=FEATURE_names, skipinitialspace=True)

assert all([x==0 for x in df_.isna().sum().values])

side_hustle = {'?': 'Side_hustle'}
df_.workclass = df_.workclass.replace(side_hustle)

# edu_replace = {'Preschool': 0, '1st-4th': 1, '5th-6th': 2, '7th-8th': 3, 
#                '9th': 4, '10th': 5, '11th': 6, '12th': 7, 'HS-grad': 8, 
#                'Some-college': 9, 'Assoc-voc': 10, 'Assoc-acdm': 10, 
#                'Prof-school': 11, 'Bachelors': 13, 'Masters': 14, 
#                'Doctorate': 15} # THIS WAS A WASTE OF TIME because education_num does this for me.
# assert len(edu_replace.keys())==len(df_.education.value_counts().index)

df_.occupation = df_.occupation.replace(side_hustle)

df_.native_country = df_.native_country.replace({'?': 'Unknown'})#.apply(lambda s: s + '_ORIGIN')

# for capital_gain and capital_loss, i mostly care about the difference between 0 and 0+. 
# i'll pronounce this. 
df_['log_capital_gain'] = np.log(df_.capital_gain + 1)
df_['log_capital_loss'] = np.log(df_.capital_loss + 1)

df_[dependent] = df_[dependent].replace({'<=50K': 0, '>50K': 1})

to_onehot = ['workclass', 'marital_status', 'occupation', 'relationship', 
             'race', 'sex', 'native_country']
# drop after onehotting. 

to_drop = ['education', 'capital_gain', 'capital_loss'] + to_onehot

df_.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,TARGET,log_capital_gain,log_capital_loss
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0,7.684784,0.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0,0.0,0.0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0,0.0,0.0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0,0.0,0.0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0,0.0,0.0


In [0]:
df_test_ = pd.read_csv(test_url, header=None, names=FEATURE_names, skiprows=1, skipinitialspace=True)

#df_test_.head()

In [207]:

to_standardize = df_.drop(to_drop+['log_capital_'+x for x in ['gain', 'loss']]+[dependent], axis=1).columns

df = pd.concat([df_.drop(to_drop, axis=1), pd.get_dummies(df_[to_onehot])], axis=1, join='outer')

df[to_standardize] = StandardScaler().fit_transform(df[to_standardize])
# we don't want to standardize log capital gain/loss! 

f = lambda s: s.replace('-', '_')
df.columns = [f(x) for x in df.columns]

assert df.shape[0]==df_.shape[0]
#assert df_.shape[1]+pd.get_dummies(df_[to_onehot]).shape[1]==df.shape[1] # There's an extra `+3` here because of the dropped columns
#df.shape[1], df_.shape[1]+pd.get_dummies(df_[to_onehot]).shape[1]

df.head()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Unnamed: 0,age,fnlwgt,education_num,hours_per_week,TARGET,log_capital_gain,log_capital_loss,workclass_Federal_gov,workclass_Local_gov,workclass_Never_worked,...,native_country_Puerto_Rico,native_country_Scotland,native_country_South,native_country_Taiwan,native_country_Thailand,native_country_Trinadad&Tobago,native_country_United_States,native_country_Unknown,native_country_Vietnam,native_country_Yugoslavia
0,0.030671,-1.063611,1.134739,-0.035429,0,7.684784,0.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0.837109,-1.008707,1.134739,-2.222153,0,0.0,0.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,-0.042642,0.245079,-0.42006,-0.035429,0,0.0,0.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,1.057047,0.425801,-1.197459,-0.035429,0,0.0,0.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,-0.775768,1.408176,1.134739,-0.035429,0,0.0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Index(['Private', 'Self-emp-not-inc', 'Local-gov', 'Side_hustle', 'State-gov',
       'Self-emp-inc', 'Federal-gov', 'Without-pay', 'Never-worked'],
      dtype='object')

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [234]:
X = df.drop(dependent, axis=1)
y = df[dependent]

X_train, X_test, y_train, y_test = train_test_split(X, y)

m = LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train,y_train)

def REPORT_(logi_regr_model, n=5): 
  
  avg_correct = logi_regr_model.score(X_test, y_test)
  
  report_dict = dict(zip(X.columns, 
                         logi_regr_model.coef_[0]))

  n_highest = dict(zip(nlargest(n, report_dict), 
                          nlargest(n, report_dict.values())))

  n_smallest = dict(zip(nsmallest(n, report_dict), 
                        nsmallest(n, report_dict.values())))
  
  mean_coef = np.divide(sum(report_dict.values()), len(report_dict.values()))
  
  s0 = f'The rate of prediction correctness on our testing data is {avg_correct:.3}\n\n'
  
  s1 = f'The {n} strongest predictors of annual income >50k are {list(n_highest.keys())}. \n'
  s11 = 'Each one of these is from the workclass feature, onehot\'d\n\n' 
  s2 = f'The {n} strongest predictors of annual income <=50k are {list(n_smallest.keys())}. \n'
  
  s21 = 'education_num is education level, ordinally represented. ' +\
        'Age says "every year you get older, you\'re less likely to earn above 50k"'
  
  s3 = '\n\nthe coefficients in general are not very drastic. ' + \
       'This could be because of standardization I performed, continuous ' + \
       'features were centered at origin and scaled to unit variance, besides '+ \
       'capital gains/losses of which I took the logarithm. ' + \
       'ultimately, the avg coefficient was {:.3}. '.format(mean_coef)
  
  return reduce(lambda s, t: s+t, [s0, s1, s11, s2, s21, s3])

print(REPORT_(m, 3))

The rate of prediction correctness on our testing data is 0.844

The 3 strongest predictors of annual income >50k are ['workclass_Without_pay', 'workclass_State_gov', 'workclass_Side_hustle']. 
Each one of these is from the workclass feature, onehot'd

The 3 strongest predictors of annual income <=50k are ['age', 'education_num', 'fnlwgt']. 
education_num is education level, ordinally represented. Age says "every year you get older, you're less likely to earn above 50k"

the coefficients in general are not very drastic. This could be because of standardization I performed, continuous features were centered at origin and scaled to unit variance, besides capital gains/losses of which I took the logarithm. ultimately, the avg coefficient was 0.0203. 


In [225]:
df_.workclass.value_counts().index

Index(['Private', 'Self-emp-not-inc', 'Local-gov', 'Side_hustle', 'State-gov',
       'Self-emp-inc', 'Federal-gov', 'Without-pay', 'Never-worked'],
      dtype='object')

# Three features positively correlated with income above 50k: 
When I gave `workclass` feature to `pd.get_dummies`, it made 9 features: `'Private', 'Self-emp-not-inc', 'Local-gov', 'Side_hustle', 'State-gov', 'Self-emp-inc', 'Federal-gov', 'Without-pay', 'Never-worked'`

the 3 best performing features are all _one-hot encodings_ of the `workclass` feature
- Without pay
- State gov
- "Side hustle", which in the data was originally "?". I made a gut move to consider "?" as stuff like uber, not exactly employment but not exactly unemployment either. This could have been totally wrong, and maybe "?" would have been better. All the docs tell us about this is that they `Convert Unknown to "?"`, meaning they **genuinely don't know**. 

##My choice to say `side hustle` instead of `?` is **not the point**,  the point is rather that _one of the strongest predictors is something the dataset knows nothing about_.

# Three features _negatively_ correlated with income above 50k. In other words, three features _positively_ correlated with income below 50k. 

- Age, which says "every year you get older, you\'re less likely to earn above 50k"
- `fnlwgt`: This is a creation of our fine curators, Ron Kohavi and Barry Becker of Silcon Graphics. In the [docs](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names), they write "_The weights on the files are controlled to independent estimates of the civilian noninstitutional population of the US.  These are prepared monthly for us by Population Division here at the Census Bureau. ...  The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. ... People with similar demographic characteristics should have similar weights. _"
- education level, represented ordinally. 

# Overall, I feel like we've learned something meaningful about income in the united states. 
- Coefficient magnitudes are moderate, meaning nothing has a _suspiciously extreme_ impact on the prediction.
- Some coefficients are negligible, and could probably be factored out in a more parsimonious model. 
- If I had done an ordinal interpretation of the `workclass` feature rather than onehot, the impact would have been _huge_. 



# Part 4: Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

## 1. We need some way to represent `at_risk` in `{0,1}`. Ultimately, we want to do _survival analysis_


## 2. We'd use the _censorship_ idea in preprocessing because we're interested in _timeline_. Ultimately, we want _quantile regression_ because median time-to-launch is more useful than mean time-to-launch. 


## 3. Plant size and yield are both _continuous, quantitative_ dependent variables. We want _ridge regression_ here so that our coefficients are _regularized_ between weird sample sizes and a sub-ideal number of relevant covariates. 