# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [2]:
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
#https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
#https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

--2019-01-25 11:11:00--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu... 128.195.10.249
Connecting to archive.ics.uci.edu|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3974305 (3.8M) [text/plain]
Saving to: ‘adult.data’


2019-01-25 11:11:04 (1.11 MB/s) - ‘adult.data’ saved [3974305/3974305]



In [3]:
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names'

--2019-01-25 11:11:52--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
Resolving archive.ics.uci.edu... 128.195.10.249
Connecting to archive.ics.uci.edu|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5229 (5.1K) [text/plain]
Saving to: ‘adult.names’


2019-01-25 11:11:53 (193 MB/s) - ‘adult.names’ saved [5229/5229]



In [4]:
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'

--2019-01-25 11:12:25--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
Resolving archive.ics.uci.edu... 128.195.10.249
Connecting to archive.ics.uci.edu|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2003153 (1.9M) [text/plain]
Saving to: ‘adult.test’


2019-01-25 11:12:31 (340 KB/s) - ‘adult.test’ saved [2003153/2003153]



In [1]:
!head adult.data

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >5

In [7]:
!cat adult.names

| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of
|   reasonably clean records was extracted using the following conditions:
|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a

In [154]:
# names = ['age',#: continuous.
# 'workclass',#: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
# 'fnlwgt',# : continuous.
# 'education',# : Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
# 'education-num',#: continuous.
# 'marital-status',#: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
# 'occupation',#: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
# 'relationship',#: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
# 'race',#: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
# 'sex',#: Female, Male.
# 'capital-gain',#: continuous.
# 'capital-loss',#: continuous.
# 'hours-per-week',#: continuous.
# 'native-country',
# #: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
# 'futureSalaryClass'         
#         ]
# print(names)


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [155]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from datetime import datetime
import math
from scipy.stats import mode
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import statsmodels.api as s
import time
from functools import reduce
import regex
from numpy import array
from numpy import argmax,argmin
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import scale, LabelEncoder, OneHotEncoder
import copy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
import io
import regex

In [202]:
with open('adult.names') as f:
   read_data = f.read()
data = read_data[read_data.index('>50K, <=50K.')+13:]
pattern = regex.compile(r"^([\w-]+(?=:))", regex.MULTILINE)
names = pattern.findall(data)
names.append('futureSalaryClass')
# assert names_ == names, 'mismatching names'
print(names_)

['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'futureSalaryClass']


In [203]:
pd.set_option('display.max_columns', None)
df = pd.read_csv('adult.data',header=None, names=names, index_col=False)
df.head(10)
# short_df = train_df.sample(frac = 0.01, random_state=42)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,futureSalaryClass
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [204]:
print(f'|{df.sex.iloc[0]}||{df.sex.iloc[4]}|')
print(f'|{df.futureSalaryClass.iloc[0]}||{df.futureSalaryClass.iloc[7]}|')

| Male|| Female|
| <=50K|| >50K|


In [205]:
target = df.futureSalaryClass.map({' <=50K': 1, ' >50K': 0})
dff = df.drop(['sex', 'futureSalaryClass'],inplace=False, axis=1)
dff['female'] = df.sex.map({' Female': 1, ' Male': 0})
print(target[0:10], dff.female[0:10])

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    0
Name: futureSalaryClass, dtype: int64 0    0
1    0
2    0
3    0
4    1
5    1
6    1
7    0
8    1
9    0
Name: female, dtype: int64


In [206]:
df_d = pd.get_dummies(dff)
print(list(df.columns))
df_d.head()

['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'futureSalaryClass']


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,female,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,education_ 7th-8th,education_ 9th,education_ Assoc-acdm,education_ Assoc-voc,education_ Bachelors,education_ Doctorate,education_ HS-grad,education_ Masters,education_ Preschool,education_ Prof-school,education_ Some-college,marital-status_ Divorced,marital-status_ Married-AF-spouse,marital-status_ Married-civ-spouse,marital-status_ Married-spouse-absent,marital-status_ Never-married,marital-status_ Separated,marital-status_ Widowed,occupation_ ?,occupation_ Adm-clerical,occupation_ Armed-Forces,occupation_ Craft-repair,occupation_ Exec-managerial,occupation_ Farming-fishing,occupation_ Handlers-cleaners,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,relationship_ Husband,relationship_ Not-in-family,relationship_ Other-relative,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,native-country_ ?,native-country_ Cambodia,native-country_ Canada,native-country_ China,native-country_ Columbia,native-country_ Cuba,native-country_ Dominican-Republic,native-country_ Ecuador,native-country_ El-Salvador,native-country_ England,native-country_ France,native-country_ Germany,native-country_ Greece,native-country_ Guatemala,native-country_ Haiti,native-country_ Holand-Netherlands,native-country_ Honduras,native-country_ Hong,native-country_ Hungary,native-country_ India,native-country_ Iran,native-country_ Ireland,native-country_ Italy,native-country_ Jamaica,native-country_ Japan,native-country_ Laos,native-country_ Mexico,native-country_ Nicaragua,native-country_ Outlying-US(Guam-USVI-etc),native-country_ Peru,native-country_ Philippines,native-country_ Poland,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [207]:
columns = df_d.columns
df_s = pd.DataFrame(scale(df_d), columns=columns)
df_s.head(10)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,female,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,education_ 7th-8th,education_ 9th,education_ Assoc-acdm,education_ Assoc-voc,education_ Bachelors,education_ Doctorate,education_ HS-grad,education_ Masters,education_ Preschool,education_ Prof-school,education_ Some-college,marital-status_ Divorced,marital-status_ Married-AF-spouse,marital-status_ Married-civ-spouse,marital-status_ Married-spouse-absent,marital-status_ Never-married,marital-status_ Separated,marital-status_ Widowed,occupation_ ?,occupation_ Adm-clerical,occupation_ Armed-Forces,occupation_ Craft-repair,occupation_ Exec-managerial,occupation_ Farming-fishing,occupation_ Handlers-cleaners,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,relationship_ Husband,relationship_ Not-in-family,relationship_ Other-relative,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,native-country_ ?,native-country_ Cambodia,native-country_ Canada,native-country_ China,native-country_ Columbia,native-country_ Cuba,native-country_ Dominican-Republic,native-country_ Ecuador,native-country_ El-Salvador,native-country_ England,native-country_ France,native-country_ Germany,native-country_ Greece,native-country_ Guatemala,native-country_ Haiti,native-country_ Holand-Netherlands,native-country_ Honduras,native-country_ Hong,native-country_ Hungary,native-country_ India,native-country_ Iran,native-country_ Ireland,native-country_ Italy,native-country_ Jamaica,native-country_ Japan,native-country_ Laos,native-country_ Mexico,native-country_ Nicaragua,native-country_ Outlying-US(Guam-USVI-etc),native-country_ Peru,native-country_ Philippines,native-country_ Poland,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,-0.703071,-0.24445,-0.174295,-0.262097,-0.014664,-1.516792,-0.188389,-0.290936,4.9077,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,2.253993,-0.113344,-0.689942,-0.236374,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,-0.922841,-0.114037,1.431058,-0.180285,-0.177358,-0.244944,2.763489,-0.016628,-0.379495,-0.377746,-0.17745,-0.209578,-0.255954,-0.335541,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,-0.825333,1.708991,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,-0.325768,-0.091612,0.41302,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,-0.703071,-0.24445,-0.174295,-0.262097,-0.014664,-1.516792,-0.188389,3.437186,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,2.253993,-0.113344,-0.689942,-0.236374,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,1.083611,-0.114037,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,2.647285,-0.17745,-0.209578,-0.255954,-0.335541,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,1.211632,-0.585141,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,-0.325768,-0.091612,0.41302,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,-0.703071,-0.24445,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,-0.443657,-0.113344,1.449397,-0.236374,-0.039607,-0.134196,-0.537144,2.515672,-0.026587,-0.922841,-0.114037,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,-0.377746,-0.17745,4.771494,-0.255954,-0.335541,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,-0.825333,1.708991,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,-0.325768,-0.091612,0.41302,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,-0.703071,-0.24445,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074,-0.171753,5.168316,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,-0.443657,-0.113344,-0.689942,-0.236374,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,1.083611,-0.114037,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,-0.377746,-0.17745,4.771494,-0.255954,-0.335541,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,1.211632,-0.585141,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,3.069667,-0.091612,-2.421192,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,1.422331,-0.24445,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,2.253993,-0.113344,-0.689942,-0.236374,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,1.083611,-0.114037,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,-0.377746,-0.17745,-0.209578,-0.255954,-0.335541,-0.067802,2.62011,-0.142608,-0.355316,-0.171279,-0.227104,-0.825333,-0.585141,-0.17625,-0.429346,-0.344032,4.445891,-0.098201,-0.181552,3.069667,-0.091612,-2.421192,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,18.48641,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,-2.932948,-0.045408,-0.022173
5,-0.115955,0.898201,1.523438,-0.14592,-0.21666,-0.035429,1.422331,-0.24445,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,-0.443657,-0.113344,-0.689942,4.230585,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,1.083611,-0.114037,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,2.647285,-0.17745,-0.209578,-0.255954,-0.335541,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,-0.825333,-0.585141,-0.17625,-0.429346,-0.344032,4.445891,-0.098201,-0.181552,-0.325768,-0.091612,0.41302,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
6,0.763796,-0.280358,-1.974858,-0.14592,-0.21666,-1.979184,1.422331,-0.24445,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,7.896091,-0.184064,-0.210534,-0.443657,-0.113344,-0.689942,-0.236374,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,-0.922841,8.769101,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,-0.377746,-0.17745,-0.209578,-0.255954,2.980259,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,-0.825333,1.708991,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,3.069667,-0.091612,-2.421192,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,20.024676,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,-2.932948,-0.045408,-0.022173
7,0.983734,0.188195,-0.42006,-0.14592,-0.21666,0.369519,-0.703071,-0.24445,-0.174295,-0.262097,-0.014664,-1.516792,-0.188389,3.437186,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,-0.443657,-0.113344,1.449397,-0.236374,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,1.083611,-0.114037,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,2.647285,-0.17745,-0.209578,-0.255954,-0.335541,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,1.211632,-0.585141,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,-0.325768,-0.091612,0.41302,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
8,-0.55583,-1.364279,1.523438,1.761142,-0.21666,0.774468,1.422331,-0.24445,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,-0.443657,-0.113344,-0.689942,4.230585,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,-0.922841,-0.114037,1.431058,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,-0.377746,-0.17745,-0.209578,-0.255954,-0.335541,-0.067802,2.62011,-0.142608,-0.355316,-0.171279,-0.227104,-0.825333,1.708991,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,-0.325768,-0.091612,0.41302,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
9,0.250608,-0.28735,1.134739,0.555214,-0.21666,-0.035429,-0.703071,-0.24445,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,-0.184064,-0.210534,2.253993,-0.113344,-0.689942,-0.236374,-0.039607,-0.134196,-0.537144,-0.397508,-0.026587,1.083611,-0.114037,-0.698784,-0.180285,-0.177358,-0.244944,-0.361861,-0.016628,-0.379495,2.647285,-0.17745,-0.209578,-0.255954,-0.335541,-0.067802,-0.381663,-0.142608,-0.355316,-0.171279,-0.227104,1.211632,-0.585141,-0.17625,-0.429346,-0.344032,-0.224927,-0.098201,-0.181552,-0.325768,-0.091612,0.41302,-0.135023,-0.024163,-0.061073,-0.048049,-0.042606,-0.054094,-0.046416,-0.029337,-0.057149,-0.052647,-0.029857,-0.065002,-0.029857,-0.044378,-0.036785,-0.005542,-0.019985,-0.024791,-0.019985,-0.055503,-0.036364,-0.027159,-0.047402,-0.049938,-0.043678,-0.023518,-0.141934,-0.032331,-0.02074,-0.03087,-0.078218,-0.042966,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173


In [208]:
X = df_s
y = target
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, multi_class='multinomial')
log_reg.fit(X, y);

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

In [209]:
log_reg.score(X, y)

0.85332145818617366

In [210]:
log_reg.coef_.shape

(1, 107)

In [211]:
o = []
for x,c in zip(log_reg.coef_[0],columns):
   o.append({'v': x, 'c': c})
 
o.sort(key = lambda d: d['v'], reverse = True)
print(o[0:5])
o.sort(key = lambda d: d['v'], reverse = False)
print('____________________\n\n',o[0:5])

[{'v': 0.26961684636841893, 'c': 'marital-status_ Never-married'}, {'v': 0.25583225136528415, 'c': 'education_ Preschool'}, {'v': 0.20247592405771053, 'c': 'female'}, {'v': 0.14862240756881262, 'c': 'relationship_ Own-child'}, {'v': 0.13880257261061765, 'c': 'occupation_ Priv-house-serv'}]
____________________

 [{'v': -1.1754533524083617, 'c': 'capital-gain'}, {'v': -0.38032509070701587, 'c': 'marital-status_ Married-civ-spouse'}, {'v': -0.18614694745722893, 'c': 'education-num'}, {'v': -0.18333048233734026, 'c': 'hours-per-week'}, {'v': -0.17397971631717934, 'c': 'age'}]


<s>Positive features: 'capitol-gain', 'hours-per-week', 'marital-status == 'Married-civ-spouse'
Negative features: 'occupation' == 'Tech-support', 'relationship' == 'Own-child', 'education' == '11th'

Having large Capitol Gains and working long hours is good for future income, so is being married.
Working a crummy job, having a child to support, and not graduating high school are bad for future earnings.</s>  
  
  
With the all the 'adult.data' rows and possibly some change from using get_dummies: 
  
Positive features are 'capitol-gain', 'marital-status' == 'Married-civ-spouse', and 'education_num'  
Negative features are 'marital-status' == 'Never-married', 'education' == 'Preschool', and 'sex' == 'Female'

Having large Capitol Gains, being married, and having more years of eduction are good for future earnings.  
Never being married, having no education past preschool, and being a woman are bad for future earnings.

In [212]:
train_df['marital-status'].unique()

array([' Never-married', ' Married-civ-spouse', ' Divorced',
       ' Married-spouse-absent', ' Separated', ' Married-AF-spouse',
       ' Widowed'], dtype=object)

In [213]:
train_df.relationship.unique()

array([' Not-in-family', ' Husband', ' Wife', ' Own-child', ' Unmarried',
       ' Other-relative'], dtype=object)

In [214]:
print(train_df[(train_df.education == ' Preschool') & (train_df['education-num'] != 1) ]
      ['education-num'][0:10])

Series([], Name: education-num, dtype: int64)


In [215]:
X = sm.add_constant(X)

model = sm.OLS(list(target), X).fit()
predictions_test = model.predict(X) 

print_model_test = model.summary()
print(print_model_test)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.365
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                     190.2
Date:                Sun, 27 Jan 2019   Prob (F-statistic):               0.00
Time:                        00:48:37   Log-Likelihood:                -11149.
No. Observations:               32561   AIC:                         2.250e+04
Df Residuals:                   32462   BIC:                         2.333e+04
Df Model:                          98                                         
Covariance Type:            nonrobust                                         
                                                 coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------

**TODO - your answers!**
1. Quantile Regression can tune threshold to get more accurate results.
2. Survival Analysis for if a event happens at certain time. 
3. Ridge Regression relatively more variables per observation leads to overfitting that Ridge Regression corrects. 