<a href="https://colab.research.google.com/github/ed-chin-git/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

### IMPORTS

In [0]:
### imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.


### Load and name the features
>50K, <=50K.

**age**: continuous.

**workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

**fnlwgt**: continuous.

**education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

**education-num**: continuous.

**marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

**occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

**relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

**race**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

**sex**: Female, Male.

**capital-gain**: continuous.

**capital-loss**: continuous.

**hours-per-week**: continuous.

**native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [65]:
data_url='https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
cols=['age','workclass','finalwgt','education','education-num' ,'marital-status','occupation','relationship' ,'race' ,'sex','capital_gain','capital_loss','hours_per_week','native_country','target' ]
adult_df=pd.read_csv(data_url,header=None, index_col=False,names=cols)
print(adult_df.shape)
adult_df.head()

(32561, 15)


Unnamed: 0,age,workclass,finalwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [0]:
adult_orig=adult_df.copy()

In [67]:
adult_df.isna().sum()

age               0
workclass         0
finalwgt          0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
target            0
dtype: int64

### There are ' ?' as nulls in occupation and workclass,  Drop those rows for now

In [68]:
adult_df=adult_df[adult_df.occupation!=' ?']
print(adult_df['occupation'].value_counts())

adult_df=adult_df[adult_df.workclass!=' ?']
print('\n',adult_df['workclass'].value_counts())


 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: occupation, dtype: int64

  Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
Name: workclass, dtype: int64


### One Hot Encode all categorical features  except native_country, sex, and target

In [69]:
cols_enc=['workclass','education','marital-status','occupation','relationship' ,'race', 'sex', 'native_country']
adult_enc=pd.get_dummies(adult_df, columns=cols_enc, prefix=cols_enc)
adult_enc.head(10)


Unnamed: 0,age,finalwgt,education-num,capital_gain,capital_loss,hours_per_week,target,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,...,native_country_ Portugal,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia
0,39,77516,13,2174,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,<=50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,<=50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,<=50K,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,37,284582,14,0,0,40,<=50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
6,49,160187,5,0,0,16,<=50K,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,52,209642,9,0,0,45,>50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,31,45781,14,14084,0,50,>50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
9,42,159449,13,5178,0,40,>50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [70]:
adult_enc.columns

Index(['age', 'finalwgt', 'education-num', 'capital_gain', 'capital_loss',
       'hours_per_week', 'target', 'workclass_ Federal-gov',
       'workclass_ Local-gov', 'workclass_ Private',
       ...
       'native_country_ Portugal', 'native_country_ Puerto-Rico',
       'native_country_ Scotland', 'native_country_ South',
       'native_country_ Taiwan', 'native_country_ Thailand',
       'native_country_ Trinadad&Tobago', 'native_country_ United-States',
       'native_country_ Vietnam', 'native_country_ Yugoslavia'],
      dtype='object', length=106)

### Label Encode other categories



In [71]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df
 
dummyEncode(adult_enc)
adult_enc.head()

Unnamed: 0,age,finalwgt,education-num,capital_gain,capital_loss,hours_per_week,target,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,...,native_country_ Portugal,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

### Scale the data and run Logistic Regression

In [72]:
X = adult_enc.drop('target', axis='columns')
y = adult_enc['target']


# feature scaling for faster convergence
std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)

adult_log1 = LogisticRegression(max_iter=500, multi_class='auto', solver='lbfgs').fit(X, y)


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


### Score  :  Psuedo R2

In [73]:
adult_log1.score(X, y)

0.8495344749007097

### Here are the Coefficients and the intercept

In [74]:
adult_log1.coef_

array([[ 3.35153571e-01,  7.61987051e-02,  3.63555760e-01,
         2.40823567e+00,  2.64628832e-01,  3.51866620e-01,
         1.03975652e-01, -2.12314731e-02,  4.32810598e-02,
         5.14777549e-02, -1.07846991e-01, -4.23080861e-02,
        -1.48375086e-01, -8.24941602e-02, -1.02862444e-01,
        -4.04794602e-02, -3.17653018e-02, -3.80589373e-02,
        -9.82753671e-02, -6.94984233e-02, -1.57769865e-02,
         1.09089852e-02,  1.44088545e-01,  1.10339795e-01,
        -7.81277327e-02,  1.35726268e-01, -4.73872209e-01,
         1.32013134e-01,  1.22737751e-02, -2.20851181e-01,
         5.55864542e-02,  7.27462629e-01, -7.35810826e-02,
        -5.23103551e-01, -1.32032458e-01, -7.55821187e-02,
        -1.22603971e-02, -2.07263192e-02,  9.29331402e-03,
         2.53199372e-01, -1.82947334e-01, -1.50475240e-01,
        -8.12717522e-02, -2.68799878e-01, -2.88776542e-01,
         1.62742478e-01,  7.69177114e-02,  7.76415235e-02,
         1.05917025e-01, -3.48550479e-02, -5.29875862e-0

In [75]:
adult_log1.intercept_

array([-1.97985002])

###  Sorted List of Feature coeffficients 

In [76]:
# drop the 'target' feature 
# so that coefs match feature names properly
df=adult_enc.copy()
df.drop(columns='target', inplace=True)  

results=[]
for i in range(0,len(df.columns)):
    results.append([adult_log1.coef_[0,i],df.columns[i]])
coefs_df=pd.DataFrame(results,columns=['Coef','Feature'])
coefs_df=coefs_df.sort_values('Coef',ascending=False)

coefs_df.head(len(coefs_df)+2)

Unnamed: 0,Coef,Feature
3,2.408236,capital_gain
31,0.727463,marital-status_ Married-civ-spouse
2,0.363556,education-num
5,0.351867,hours_per_week
0,0.335154,age
4,0.264629,capital_loss
55,0.260523,relationship_ Wife
39,0.253199,occupation_ Exec-managerial
62,0.203540,sex_ Male
45,0.162742,occupation_ Prof-specialty


## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
           FEATURE..........................Coef
               hours Per_week...............+0.36702437
                Age ........................+0.3538
                Sex_male....................+0.1997
                workclass_ Federal-gov......+0.1075

2. What are 3 features negatively correlated with income above 50k?
                SEX_Female .................  -0.19976024
               marital-status_never_married. - 0.53323959
               education_ Preschool......... - 0.49536169
3. Overall, how well does the model explain the data and what insights do you derive from it?
               Pretty good at first glance.  the score (Psuedo R squared)=0.8520


*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades
                          use Quantile Regression 
                          your are modeling and estimating a specific threshold

2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
                      use Survival Analysis  
                      you are trying to predict  the time of a certain event occurring.


3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.
                                   use Ridge Regression
                                   you have few observations and many features ( suffering from Overfitting)


Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**TODO - your answers!   See above please for answers**