<a href="https://colab.research.google.com/github/derek-shing/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.


In [0]:
# TODO - your work!
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')

In [12]:
df.shape

(32560, 15)

In [16]:
df.head()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [17]:
df.columns

Index(['39', ' State-gov', ' 77516', ' Bachelors', ' 13', ' Never-married',
       ' Adm-clerical', ' Not-in-family', ' White', ' Male', ' 2174', ' 0',
       ' 40', ' United-States', ' <=50K'],
      dtype='object')

In [0]:
col =['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','sex','capital-gain',
      'capital-loss','hours-per-week','native-country','result']

In [0]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=col)

In [30]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,result
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [31]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
result            0
dtype: int64

In [65]:
df['result'].value_counts()

 <=50K    24720
 >50K      7841
Name: result, dtype: int64

In [0]:
df['result']=df['result'].replace({' <=50K':0,' >50K':1})

In [0]:
df['sex']=df['sex'].replace({' Female':0,' Male':1})

In [39]:
df['result'].value_counts()

0    24720
1     7841
Name: result, dtype: int64

In [76]:
df['sex'].value_counts()

1    21790
0    10771
Name: sex, dtype: int64

In [43]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'result'],
      dtype='object')

In [0]:
df = pd.get_dummies(df, columns=['workclass',
       'marital-status', 'occupation', 'relationship', 'race',
       'native-country'])

In [69]:
df.head()

Unnamed: 0,age,fnlwgt,education,education-num,sex,capital-gain,capital-loss,hours-per-week,result,workclass_ ?,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,Bachelors,13,1,2174,0,40,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,Bachelors,13,1,0,0,13,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,HS-grad,9,1,0,0,40,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,11th,7,1,0,0,40,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,Bachelors,13,Female,0,0,40,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
df.columns

Index(['age', 'fnlwgt', 'education', 'education-num', 'sex', 'capital-gain',
       'capital-loss', 'hours-per-week', 'result', 'workclass_ ?',
       'workclass_ Federal-gov', 'workclass_ Local-gov',
       'workclass_ Never-worked', 'workclass_ Private',
       'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
       'workclass_ State-gov', 'workclass_ Without-pay',
       'marital-status_ Divorced', 'marital-status_ Married-AF-spouse',
       'marital-status_ Married-civ-spouse',
       'marital-status_ Married-spouse-absent',
       'marital-status_ Never-married', 'marital-status_ Separated',
       'marital-status_ Widowed', 'occupation_ ?', 'occupation_ Adm-clerical',
       'occupation_ Armed-Forces', 'occupation_ Craft-repair',
       'occupation_ Exec-managerial', 'occupation_ Farming-fishing',
       'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct',
       'occupation_ Other-service', 'occupation_ Priv-house-serv',
       'occupation_ Prof-specialty',

In [70]:
df.shape

(32561, 93)

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [85]:
# TODO - your work!

X = df.drop(['result','education'],axis='columns')
y = df['result']

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.7979791775436872

In [0]:
log_reg.coef_

In [86]:
x_col=X.columns

X = scale(X)  # Very helpful for regularization!

# Put it in a dataframe
X = pd.DataFrame(X, columns=x_col)

  This is separate from the ipykernel package so we can avoid doing imports until


In [87]:
X.head()

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.030671,-1.063611,1.134739,0.703071,0.148453,-0.21666,-0.035429,-0.24445,-0.174295,-0.262097,...,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
1,0.837109,-1.008707,1.134739,0.703071,-0.14592,-0.21666,-2.222153,-0.24445,-0.174295,-0.262097,...,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
2,-0.042642,0.245079,-0.42006,0.703071,-0.14592,-0.21666,-0.035429,-0.24445,-0.174295,-0.262097,...,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
3,1.057047,0.425801,-1.197459,0.703071,-0.14592,-0.21666,-0.035429,-0.24445,-0.174295,-0.262097,...,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,0.340954,-0.045408,-0.022173
4,-0.775768,1.408176,1.134739,-1.422331,-0.14592,-0.21666,-0.035429,-0.24445,-0.174295,-0.262097,...,-0.033729,-0.059274,-0.019201,-0.049628,-0.039607,-0.023518,-0.024163,-2.932948,-0.045408,-0.022173


In [88]:
log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.8533521697736556

In [92]:
log_reg.coef_

array([[ 3.58559261e-01,  7.40377659e-02,  7.18464068e-01,
         4.02177846e-01,  2.30775929e+00,  2.61309658e-01,
         3.70861162e-01, -6.62662409e-02,  1.05063493e-01,
        -1.53769756e-02, -6.26166284e-02,  5.73459842e-02,
         5.58661921e-02, -9.39332873e-02, -3.30275902e-02,
        -1.28163891e-01, -2.24080655e-01,  5.18663520e-02,
         7.42469204e-01, -7.44283507e-02, -5.25192358e-01,
        -1.36688047e-01, -9.11095865e-02, -7.01204752e-02,
        -1.08640927e-02, -1.93544594e-02,  1.30724189e-02,
         2.51466588e-01, -1.78629358e-01, -1.40759758e-01,
        -7.42380464e-02, -2.57373711e-01, -2.69771132e-01,
         1.77948917e-01,  7.56154480e-02,  7.86498943e-02,
         1.01806889e-01, -2.97126435e-02, -5.77467024e-02,
         1.67480096e-01, -9.45777911e-02, -2.99090662e-01,
         7.69595631e-02,  2.62801382e-01, -4.65515970e-02,
         3.00077508e-02, -2.85113505e-02, -2.83703911e-02,
         2.89883494e-02, -1.66059789e-02,  3.05507007e-0

In [93]:
index=0
for i in x_col:
  print(i,log_reg.coef_[0][index])
  index=index+1

age 0.3585592609043
fnlwgt 0.07403776594602827
education-num 0.7184640681242446
sex 0.4021778460252258
capital-gain 2.3077592902744883
capital-loss 0.2613096577353619
hours-per-week 0.37086116186269974
workclass_ ? -0.06626624089079877
workclass_ Federal-gov 0.10506349298088728
workclass_ Local-gov -0.01537697555681459
workclass_ Never-worked -0.06261662836300208
workclass_ Private 0.05734598416394402
workclass_ Self-emp-inc 0.05586619212457379
workclass_ Self-emp-not-inc -0.09393328725933028
workclass_ State-gov -0.03302759021452939
workclass_ Without-pay -0.12816389126666972
marital-status_ Divorced -0.22408065484300765
marital-status_ Married-AF-spouse 0.05186635200002795
marital-status_ Married-civ-spouse 0.7424692041440113
marital-status_ Married-spouse-absent -0.07442835069659179
marital-status_ Never-married -0.5251923575052199
marital-status_ Separated -0.13668804704719997
marital-status_ Widowed -0.09110958646709172
occupation_ ? -0.07012047517037151
occupation_ Adm-clerical -

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**TODO - your answers!**

Ans 3.1

The feature with higher coefficients will be more positvely corelated with income above 50 K. In our model, eduction , hours per week, and occupation as an Prof-specialty arepositively corelated with income above 50K with following coefficients:


1.   education-num : 0.7184640681242446
2.   hours-per-week : 0.37086116186269974
3.   occupation_ Prof-specialty : 0.17794891736472765




Ans 3.2


The feature with negative coefficients will be more negatively corelated with income above 50 K. In our model, people who are self-employed without incoperate, working in a farming or fishing industry and native in China are negatively corelated with income above 50K with following coefficients:


1.   native-country_ China -0.02962876140116269
2.   occupation_ Farming-fishing -0.17862935820229317
3.   workclass_ Self-emp-not-inc -0.09393328725933028

Ans 3.3

The model explain the data well with 85% accuary. We can see what type of job have the better chance to earn more by comparing their coefficient. These also apply to native country, type of employment, race etc. A White person native in US working as a professional has high chance to earn more than 50K.

Ans 4

We will use Quantile Regression to predict "at-risk" students who are likely to receive the bottom tier of grades. Because we want to foucs on "at-risk" students instead of all student, we would like to use quantile regression to fit student at the low quantile, for example the bottom 20%.

We will use Survial Analysis to predict when a new product is likely to be launched. It is a problem related to timeline and the key is to calcuated the cumulative probability to a given time, that is what survial analysis can do.

We will use Ridge Regression to modeling expected plant size and yield because the number of data is too small. Regular linear regression may have overfit problem. Ridge Regression can help on generalize the feature and increase the prediction power for new data
