<a href="https://colab.research.google.com/github/extrajp2014/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
# Attribute Information:
# Listing of attributes: 
# >50K, <=50K. 

# age: continuous. 
# workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
# fnlwgt: continuous. 
# education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
# education-num: continuous. 
# marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
# occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
# relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
# race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
# sex: Female, Male. 
# capital-gain: continuous. 
# capital-loss: continuous. 
# hours-per-week: continuous. 
# native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, 
# Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, 
# Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, 
# France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, 
# Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [0]:
from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import math
import seaborn as sns

from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
#Survival
# import lifelines
# Quantile
import statsmodels.formula.api as smf
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV

pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)

In [68]:
names = ['age', 'workclass', 'finalweight', 'education', 'education_num', 
         'marital_status', 'occupation', 'relationship', 'race','gender',
         'capital_gain', 'capital_loss','hoursperweek','native_country', 'income']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None, names=names, na_values=' ?')
df.head().T

Unnamed: 0,0,1,2,3,4
age,39,50,38,53,28
workclass,State-gov,Self-emp-not-inc,Private,Private,Private
finalweight,77516,83311,215646,234721,338409
education,Bachelors,Bachelors,HS-grad,11th,Bachelors
education_num,13,13,9,7,13
marital_status,Never-married,Married-civ-spouse,Divorced,Married-civ-spouse,Married-civ-spouse
occupation,Adm-clerical,Exec-managerial,Handlers-cleaners,Handlers-cleaners,Prof-specialty
relationship,Not-in-family,Husband,Not-in-family,Husband,Wife
race,White,White,White,Black,Black
gender,Male,Male,Male,Male,Female


In [69]:
# preview data
print("df shape:"), print(df.shape), print("")
print("df columns:"), print(df.columns), print("")
print("df select_dtypes(include=[np.number]).columns.values:"), print(df.select_dtypes(include=[np.number]).columns.values), print("")
print("df select_dtypes(exclude=[np.number]).columns:"), print(df.select_dtypes(exclude=[np.number]).columns), print("")
print("df dtypes.sort_values(ascending=False):"), print(df.dtypes.sort_values(ascending=False)), print("")
print("df head().T:"), print(df.head().T), print("")
print("df isnull().sum().sum():"), print(df.isnull().sum().sum()), print("")
# nan finder
print("columns[df.isna().any()].tolist():"), print(df.columns[df.isna().any()].tolist()), print("")
# stats data
print("df corr().T:"), print(df.corr().T), print("")

df shape:
(32561, 15)

df columns:
Index(['age', 'workclass', 'finalweight', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'gender', 'capital_gain', 'capital_loss', 'hoursperweek', 'native_country', 'income'], dtype='object')

df select_dtypes(include=[np.number]).columns.values:
['age' 'finalweight' 'education_num' 'capital_gain' 'capital_loss'
 'hoursperweek']

df select_dtypes(exclude=[np.number]).columns:
Index(['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'gender', 'native_country', 'income'], dtype='object')

df dtypes.sort_values(ascending=False):
income            object
native_country    object
gender            object
race              object
relationship      object
occupation        object
marital_status    object
education         object
workclass         object
hoursperweek       int64
capital_loss       int64
capital_gain       int64
education_num      int64
finalweight        int64
age         

(None, None, None)

In [70]:
df=df.replace(np.nan,' Unknown')
# Convert categoric to numeric
df['income'] = df['income'].replace({' <=50K': 0, ' >50K': 1})
df = pd.get_dummies(df)
print(df.shape)
df.head().T

(32561, 109)


Unnamed: 0,0,1,2,3,4
age,39,50,38,53,28
finalweight,77516,83311,215646,234721,338409
education_num,13,13,9,7,13
capital_gain,2174,0,0,0,0
capital_loss,0,0,0,0,0
hoursperweek,40,13,40,40,40
income,0,0,0,0,0
workclass_ Federal-gov,0,0,0,0,0
workclass_ Local-gov,0,0,0,0,0
workclass_ Never-worked,0,0,0,0,0


In [74]:
# preview data
print("df shape:"), print(df.shape),  print("")
print("df columns:"), print(df.columns), print("")
print("df isnull().sum().sum():"), print(df.isnull().sum().sum()), print("")
print("df head().T:"), print(df.head().T), print("")

df shape:
(32561, 109)

df columns:
Index(['age', 'finalweight', 'education_num', 'capital_gain', 'capital_loss', 'hoursperweek', 'income', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked',
       ...
       'native_country_ Puerto-Rico', 'native_country_ Scotland', 'native_country_ South', 'native_country_ Taiwan', 'native_country_ Thailand', 'native_country_ Trinadad&Tobago', 'native_country_ United-States', 'native_country_ Unknown', 'native_country_ Vietnam', 'native_country_ Yugoslavia'], dtype='object', length=109)

df isnull().sum().sum():
0

df head().T:
                                                0      1       2       3       4
age                                            39     50      38      53      28
finalweight                                 77516  83311  215646  234721  338409
education_num                                  13     13       9       7      13
capital_gain                                 2174      0       0       0       0

(None, None, None)

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [71]:
# Data source
################################ 
X = df.drop(['income'], axis=1)
y = df['income'].values
print(X.shape)
print(y.shape)

# Standardize dataset
for col in df.columns:
  if df[col].dtype == np.number: df[col] = scale(df[col])

(32561, 108)
(32561,)


In [77]:
# Logistic Regression
################################
def logistic_regression_std(X,y,random_state=42,max_iter=1000):
  model = LogisticRegression(random_state=random_state, solver='lbfgs', multi_class='multinomial', max_iter=max_iter)
  model.fit(X,y)
  m_hat = model.coef_[0]
  b_hat = model.intercept_
  R2 = model.score(X, y)
  return m_hat, b_hat,R2
m_hat, b_hat,R2 = logistic_regression_std(X,y)
print(R2)
print(len(m_hat))
print(X.shape)
print("------------")

for i,j in zip(X.columns,m_hat):
  print(f'{i:50} {j:}')

0.7957679432449863
108
(32561, 108)
------------
age                                                -1.3799589403588732e-05
finalweight                                        -3.249995113183428e-06
education_num                                      -3.1524544074676594e-06
capital_gain                                       0.0001605843384469455
capital_loss                                       0.0003502913667220305
hoursperweek                                       -1.544639759091287e-05
workclass_ Federal-gov                             3.4294081039886085e-08
workclass_ Local-gov                               9.181919309518862e-10
workclass_ Never-worked                            -6.874623908858325e-10
workclass_ Private                                 -6.778132383890236e-07
workclass_ Self-emp-inc                            9.304892499746427e-08
workclass_ Self-emp-not-inc                        -3.790220832297871e-08
workclass_ State-gov                               -1.66406218359

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
  * Education - Bachelor degree
  * Education - Master degree
  * Relationship - Married
2. What are 3 features negatively correlated with income above 50k?
 * Education - High school degree
 * Education - Associate / Vocational degree
 * Relationship - Divorced
3. Overall, how well does the model explain the data and what insights do you derive from it?
  * The logistic regression score  is 0.7957679432449863, which is a mean accuracy of predicting income above 50K.  This is a good baseline number for comparison.  From here, we can engineer extra features or use other methods to improve our prediction model.
  * Individuals with college education may have greater opportunities in choosing better jobs so it is reasonable that feature has positive correlation with income.  Likewise, individuals with associate degree or less may not have the qualifications to pursue 50K or better paying jobs with higher education requirement.
  * Also, married individuals may have higher income since they probably have better support system or network. Single individuals hypothetically have less time to pursue higher paying jobs since they are self-reliance and may require more time to perform daily tasks by themselves.  However, this could also be a combined income of both spouses but such information cannot be confirmed. If that was the case, being single or divorced only account for one income so it is understandable that the combined income of a married couple is higher than 50K.  

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
 * Quantile regression is a great choice for this scenerio  since it requires a percentile cutoff to identify "at-risk" students at a specific lower threshold in the dataset.  By regressing towards the median of the lower percentile , this method can measure the association of explanatory variables with a conditional quantile of a dependent variable, which is students with bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
 * Survival analysis, or time-to-event analysis should be use in this case.  We can set the birth event as when the product was developed and the launch of a new product to the public as the final event of interest, which marks the end of our observation period. This allow us to analyze the effect of various risk factors and predict the time of duration between development and launch.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.
 * Ridge Regression is a good choice for this case. Using this method, we can reduce the possible multicollinearity that may arise from correlated features of the few dozen plants that are being evaluated at a time.