# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [43]:
# Let's import the tools we might need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Hodgepodge of Sci-kit learn for all occasions
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error, r2_score

In [44]:
#column names are incorrect or messy, let's grab the right ones from the meta-data
col_names = ['age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','target']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                header=None, names=col_names)
df.head()
# we have data, with the correct column names.

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [45]:
# Some general info about the dataset

# Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
# Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)


# Prediction task is to determine whether a person makes over 50K a year.

In [46]:
#Now for some exploration and cleaning
df.shape

(32561, 15)

Looks good given that we're just using the train data
48842 instances, mix of continuous and discrete    ( *_train=32561_* , test=16281)

In [47]:
# Let's see what we have for missing values
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
target            0
dtype: int64

In [48]:
# After some digging it seems that our missing values are a ? with a space in front of it. 
# Let's draw them into the light.
df = df.replace(' ?', np.nan)

In [49]:
#Okay, we have three columns with missing values at this point
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
target               0
dtype: int64

In [50]:
# I'm going to ffill to distribute the unknowns among our distribution
# I could just group with private but 1800 values isn't non-trivial
df.workclass.fillna(method = 'ffill', inplace=True)
print('Work Class Null Values :', df.workclass.isnull().sum())
df.workclass.value_counts()

Work Class Null Values : 0


 Private             24094
 Self-emp-not-inc     2688
 Local-gov            2204
 State-gov            1374
 Self-emp-inc         1177
 Federal-gov          1002
 Without-pay            15
 Never-worked            7
Name: workclass, dtype: int64

In [51]:
# There is already an 'Other-service' but the number of misallocated values would skew the count heavily. 
# I'll opt for a ffill again to distribute them across the categories
df.occupation.fillna(method='ffill', inplace=True)
print('Occupation Null Values :', df.occupation.isnull().sum())
df.occupation.value_counts()

Occupation Null Values : 0


 Prof-specialty       4386
 Craft-repair         4364
 Exec-managerial      4317
 Adm-clerical         3982
 Sales                3863
 Other-service        3470
 Machine-op-inspct    2134
 Transport-moving     1703
 Handlers-cleaners    1471
 Farming-fishing      1038
 Tech-support          981
 Protective-serv       683
 Priv-house-serv       159
 Armed-Forces           10
Name: occupation, dtype: int64

In [52]:
df.native_country.value_counts()

 United-States                 29170
 Mexico                          643
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 France                           29
 

In [53]:
df.native_country.fillna(method = 'ffill', inplace=True)
print('Native Country Null Values :', df.native_country.isnull().sum())
df.native_country.value_counts()

Native Country Null Values : 0


 United-States                 29694
 Mexico                          657
 Philippines                     200
 Germany                         141
 Canada                          124
 Puerto-Rico                     118
 El-Salvador                     109
 India                           101
 Cuba                             97
 England                          93
 Jamaica                          83
 South                            80
 China                            77
 Dominican-Republic               74
 Italy                            73
 Vietnam                          72
 Guatemala                        66
 Japan                            63
 Columbia                         61
 Poland                           60
 Taiwan                           51
 Haiti                            45
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 Greece                           30
 

In [54]:
pd.set_option('display.max_columns', None)
df.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [55]:
# it looks like our data is is pretty good shape
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
target            object
dtype: object

In [59]:
# We have some ints which are fine the way they 
# are but we are going to use one hot-encoding 
# to make our categorical values more digestible.

#first, let's process our target value:
df.target.replace({' <=50K': 0, ' >50K': 1}, inplace=True)

TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'

In [60]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
target             int64
dtype: object

In [66]:
# now let's one-hot encode our categorical values
df2 = pd.get_dummies(df, prefix_sep="__")
df2.head(10)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,target,workclass__ Federal-gov,workclass__ Local-gov,workclass__ Never-worked,workclass__ Private,workclass__ Self-emp-inc,workclass__ Self-emp-not-inc,workclass__ State-gov,workclass__ Without-pay,education__ 10th,education__ 11th,education__ 12th,education__ 1st-4th,education__ 5th-6th,education__ 7th-8th,education__ 9th,education__ Assoc-acdm,education__ Assoc-voc,education__ Bachelors,education__ Doctorate,education__ HS-grad,education__ Masters,education__ Preschool,education__ Prof-school,education__ Some-college,marital_status__ Divorced,marital_status__ Married-AF-spouse,marital_status__ Married-civ-spouse,marital_status__ Married-spouse-absent,marital_status__ Never-married,marital_status__ Separated,marital_status__ Widowed,occupation__ Adm-clerical,occupation__ Armed-Forces,occupation__ Craft-repair,occupation__ Exec-managerial,occupation__ Farming-fishing,occupation__ Handlers-cleaners,occupation__ Machine-op-inspct,occupation__ Other-service,occupation__ Priv-house-serv,occupation__ Prof-specialty,occupation__ Protective-serv,occupation__ Sales,occupation__ Tech-support,occupation__ Transport-moving,relationship__ Husband,relationship__ Not-in-family,relationship__ Other-relative,relationship__ Own-child,relationship__ Unmarried,relationship__ Wife,race__ Amer-Indian-Eskimo,race__ Asian-Pac-Islander,race__ Black,race__ Other,race__ White,sex__ Female,sex__ Male,native_country__ Cambodia,native_country__ Canada,native_country__ China,native_country__ Columbia,native_country__ Cuba,native_country__ Dominican-Republic,native_country__ Ecuador,native_country__ El-Salvador,native_country__ England,native_country__ France,native_country__ Germany,native_country__ Greece,native_country__ Guatemala,native_country__ Haiti,native_country__ Holand-Netherlands,native_country__ Honduras,native_country__ Hong,native_country__ Hungary,native_country__ India,native_country__ Iran,native_country__ Ireland,native_country__ Italy,native_country__ Jamaica,native_country__ Japan,native_country__ Laos,native_country__ Mexico,native_country__ Nicaragua,native_country__ Outlying-US(Guam-USVI-etc),native_country__ Peru,native_country__ Philippines,native_country__ Poland,native_country__ Portugal,native_country__ Puerto-Rico,native_country__ Scotland,native_country__ South,native_country__ Taiwan,native_country__ Thailand,native_country__ Trinadad&Tobago,native_country__ United-States,native_country__ Vietnam,native_country__ Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,37,284582,14,0,0,40,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
6,49,160187,5,0,0,16,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,52,209642,9,0,0,45,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
8,31,45781,14,14084,0,50,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
9,42,159449,13,5178,0,40,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [67]:
df2.dtypes

age                                            int64
fnlwgt                                         int64
education_num                                  int64
capital_gain                                   int64
capital_loss                                   int64
hours_per_week                                 int64
target                                         int64
workclass__ Federal-gov                        uint8
workclass__ Local-gov                          uint8
workclass__ Never-worked                       uint8
workclass__ Private                            uint8
workclass__ Self-emp-inc                       uint8
workclass__ Self-emp-not-inc                   uint8
workclass__ State-gov                          uint8
workclass__ Without-pay                        uint8
education__ 10th                               uint8
education__ 11th                               uint8
education__ 12th                               uint8
education__ 1st-4th                           

In [64]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Columns: 106 entries, age to native_country_ Yugoslavia
dtypes: int64(7), uint8(99)
memory usage: 4.8 MB


In [15]:
# I think we're ready for some analysis

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [16]:
df2.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,target,sex__ Female,sex__ Male,workclass__ Federal-gov,workclass__ Local-gov,workclass__ Never-worked,workclass__ Private,workclass__ Self-emp-inc,workclass__ Self-emp-not-inc,workclass__ State-gov,workclass__ Without-pay,education__ 10th,education__ 11th,education__ 12th,education__ 1st-4th,education__ 5th-6th,education__ 7th-8th,education__ 9th,education__ Assoc-acdm,education__ Assoc-voc,education__ Bachelors,education__ Doctorate,education__ HS-grad,education__ Masters,education__ Preschool,education__ Prof-school,education__ Some-college,marital_status__ Divorced,marital_status__ Married-AF-spouse,marital_status__ Married-civ-spouse,marital_status__ Married-spouse-absent,marital_status__ Never-married,marital_status__ Separated,marital_status__ Widowed,occupation__ Adm-clerical,occupation__ Armed-Forces,occupation__ Craft-repair,occupation__ Exec-managerial,occupation__ Farming-fishing,occupation__ Handlers-cleaners,occupation__ Machine-op-inspct,occupation__ Other-service,occupation__ Priv-house-serv,occupation__ Prof-specialty,occupation__ Protective-serv,occupation__ Sales,occupation__ Tech-support,occupation__ Transport-moving,relationship__ Husband,relationship__ Not-in-family,relationship__ Other-relative,relationship__ Own-child,relationship__ Unmarried,relationship__ Wife,race__ Amer-Indian-Eskimo,race__ Asian-Pac-Islander,race__ Black,race__ Other,race__ White,native_country__ Cambodia,native_country__ Canada,native_country__ China,native_country__ Columbia,native_country__ Cuba,native_country__ Dominican-Republic,native_country__ Ecuador,native_country__ El-Salvador,native_country__ England,native_country__ France,native_country__ Germany,native_country__ Greece,native_country__ Guatemala,native_country__ Haiti,native_country__ Holand-Netherlands,native_country__ Honduras,native_country__ Hong,native_country__ Hungary,native_country__ India,native_country__ Iran,native_country__ Ireland,native_country__ Italy,native_country__ Jamaica,native_country__ Japan,native_country__ Laos,native_country__ Mexico,native_country__ Nicaragua,native_country__ Outlying-US(Guam-USVI-etc),native_country__ Peru,native_country__ Philippines,native_country__ Poland,native_country__ Portugal,native_country__ Puerto-Rico,native_country__ Scotland,native_country__ South,native_country__ Taiwan,native_country__ Thailand,native_country__ Trinadad&Tobago,native_country__ United-States,native_country__ Vietnam,native_country__ Yugoslavia
0,39,77516,13,2174,0,40,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
# For the sake of time I'm going to scale the numerical values off the bat.
# I may come back and compare pre and post scaling if I have enough time

#SKL Scale
#  --> type coercion was breaking this
# def scale_numeric_data(df):
#     for col in df.columns:
#         if df[col].dtype == np.int64:
#             df[col] = scale(df[col])

#     return df

# df3 = scale_numeric_data(df2);
df3 = df2
#just making this change manually
int_cols = df.select_dtypes('number').columns.tolist()
print(int_cols)
df3[int_cols] = scale(df2[int_cols])

# looks like we have what we want!
df3.head()

['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week', 'target']


Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,target,sex__ Female,sex__ Male,workclass__ Federal-gov,workclass__ Local-gov,workclass__ Never-worked,workclass__ Private,workclass__ Self-emp-inc,workclass__ Self-emp-not-inc,workclass__ State-gov,workclass__ Without-pay,education__ 10th,education__ 11th,education__ 12th,education__ 1st-4th,education__ 5th-6th,education__ 7th-8th,education__ 9th,education__ Assoc-acdm,education__ Assoc-voc,education__ Bachelors,education__ Doctorate,education__ HS-grad,education__ Masters,education__ Preschool,education__ Prof-school,education__ Some-college,marital_status__ Divorced,marital_status__ Married-AF-spouse,marital_status__ Married-civ-spouse,marital_status__ Married-spouse-absent,marital_status__ Never-married,marital_status__ Separated,marital_status__ Widowed,occupation__ Adm-clerical,occupation__ Armed-Forces,occupation__ Craft-repair,occupation__ Exec-managerial,occupation__ Farming-fishing,occupation__ Handlers-cleaners,occupation__ Machine-op-inspct,occupation__ Other-service,occupation__ Priv-house-serv,occupation__ Prof-specialty,occupation__ Protective-serv,occupation__ Sales,occupation__ Tech-support,occupation__ Transport-moving,relationship__ Husband,relationship__ Not-in-family,relationship__ Other-relative,relationship__ Own-child,relationship__ Unmarried,relationship__ Wife,race__ Amer-Indian-Eskimo,race__ Asian-Pac-Islander,race__ Black,race__ Other,race__ White,native_country__ Cambodia,native_country__ Canada,native_country__ China,native_country__ Columbia,native_country__ Cuba,native_country__ Dominican-Republic,native_country__ Ecuador,native_country__ El-Salvador,native_country__ England,native_country__ France,native_country__ Germany,native_country__ Greece,native_country__ Guatemala,native_country__ Haiti,native_country__ Holand-Netherlands,native_country__ Honduras,native_country__ Hong,native_country__ Hungary,native_country__ India,native_country__ Iran,native_country__ Ireland,native_country__ Italy,native_country__ Jamaica,native_country__ Japan,native_country__ Laos,native_country__ Mexico,native_country__ Nicaragua,native_country__ Outlying-US(Guam-USVI-etc),native_country__ Peru,native_country__ Philippines,native_country__ Poland,native_country__ Portugal,native_country__ Puerto-Rico,native_country__ Scotland,native_country__ South,native_country__ Taiwan,native_country__ Thailand,native_country__ Trinadad&Tobago,native_country__ United-States,native_country__ Vietnam,native_country__ Yugoslavia
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,-0.563199,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,-0.563199,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,-0.563199,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,-0.563199,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,-0.563199,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [19]:
# Separating my independant and dependant variables
y = df3.target
X = df3.drop(columns='target')
# Defining my logistic regression
lr = LogisticRegression()

# Fitting logisitic regression
lr.fit(X, y)

ValueError: Unknown label type: 'continuous'

In [None]:
lr.score(X, y)

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**TODO - your answers!**