<a href="https://colab.research.google.com/github/albert-h-wong/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression

  from pandas.core import datetools


In [18]:
col_names = ['age', 'workclass', 'finalweight', 'educationlevel',
           'educationnumber', 'maritalstatus', 'occupation', 'relationship',
           'race','gender','capitalgain', 'capitalloss','hoursperweek',
           'nativecountry', 'incomelevel']

income_data = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
     header=None, names=col_names)
income_data.sample(15)

Unnamed: 0,age,workclass,finalweight,educationlevel,educationnumber,maritalstatus,occupation,relationship,race,gender,capitalgain,capitalloss,hoursperweek,nativecountry,incomelevel
28562,31,Private,227146,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,55,United-States,>50K
5126,20,Private,116791,HS-grad,9,Never-married,Machine-op-inspct,Other-relative,White,Female,0,0,40,United-States,<=50K
18391,39,Private,97136,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,Black,Female,0,0,40,United-States,<=50K
23662,37,Private,234807,HS-grad,9,Divorced,Adm-clerical,Unmarried,White,Female,0,0,37,United-States,<=50K
2701,34,Private,92682,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Female,4865,0,40,United-States,<=50K
22193,46,Private,156926,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,60,United-States,>50K
8191,40,Local-gov,105717,Masters,14,Never-married,Adm-clerical,Not-in-family,White,Female,0,1876,35,United-States,<=50K
8219,39,Private,444219,HS-grad,9,Married-civ-spouse,Craft-repair,Wife,Black,Female,0,0,45,United-States,<=50K
8324,25,Private,166415,HS-grad,9,Never-married,Transport-moving,Unmarried,White,Male,0,0,52,United-States,<=50K
3663,32,Private,344129,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K


In [19]:
income_data.shape

(32561, 15)

In [20]:
income_data.dtypes

age                 int64
workclass          object
finalweight         int64
educationlevel     object
educationnumber     int64
maritalstatus      object
occupation         object
relationship       object
race               object
gender             object
capitalgain         int64
capitalloss         int64
hoursperweek        int64
nativecountry      object
incomelevel        object
dtype: object

In [21]:
income_data.isna().sum()

age                0
workclass          0
finalweight        0
educationlevel     0
educationnumber    0
maritalstatus      0
occupation         0
relationship       0
race               0
gender             0
capitalgain        0
capitalloss        0
hoursperweek       0
nativecountry      0
incomelevel        0
dtype: int64

In [22]:
df = income_data.replace('?', np.nan)
df.isna().sum()

age                0
workclass          0
finalweight        0
educationlevel     0
educationnumber    0
maritalstatus      0
occupation         0
relationship       0
race               0
gender             0
capitalgain        0
capitalloss        0
hoursperweek       0
nativecountry      0
incomelevel        0
dtype: int64

In [0]:
# Replace is not working, spent too much time troubleshooting, will need to figure it out later or find other one hot encoding method
income_data.gender.replace({' Female ' : 0, ' Male ': 1}, inplace=True)

In [26]:
income_data.gender

0           Male
1           Male
2           Male
3           Male
4         Female
5         Female
6         Female
7           Male
8         Female
9           Male
10          Male
11          Male
12        Female
13          Male
14          Male
15          Male
16          Male
17          Male
18          Male
19        Female
20          Male
21        Female
22          Male
23          Male
24        Female
25          Male
26          Male
27          Male
28          Male
29          Male
          ...   
32531     Female
32532       Male
32533       Male
32534     Female
32535       Male
32536     Female
32537       Male
32538     Female
32539       Male
32540     Female
32541     Female
32542       Male
32543     Female
32544     Female
32545     Female
32546     Female
32547       Male
32548       Male
32549     Female
32550       Male
32551       Male
32552       Male
32553       Male
32554       Male
32555       Male
32556     Female
32557       Male
32558     Fema

In [9]:
income_data.dtypes

age                 int64
workclass          object
finalweight         int64
educationlevel     object
educationnumber     int64
maritalstatus      object
occupation         object
relationship       object
race               object
gender             object
capitalgain         int64
capitalloss         int64
hoursperweek        int64
nativecountry      object
incomelevel        object
dtype: object

In [0]:
# Need to fix replacements

df.gender.replace({'Female' : 0, 'Male': 1}, inplace=True)
df.workclass.replace({'Private' : 1, 'Local-gov' : 1, 'State-gov' : 1,
                             'Self-emp-not-inc' : 1, 'Without-pay' : 0,
                             'Self-emp-inc': 1, 'Never-worked' : 0,
                             'Federal-gov' : 1}, inplace=True)
df.incomelevel.replace({'<=50K' : 0, '>50K': 1}, inplace=True)



In [13]:
# Need to fix variables for one hot encoding before model can be run

X = df[['age', 'workclass', 'educationnumber','gender','capitalgain','capitalloss','hoursperweek']]
y = df.incomelevel

log_reg1 = LogisticRegression().fit(X, y)
log_reg.score(X, y)



ValueError: ignored

In [0]:
df.gender.replace({'Female' : 0, 'Male': 1}, inplace=True)
df.workclass.replace({'Private' : 1, 'Local-gov' : 1, 'State-gov' : 1,
                             'Self-emp-not-inc' : 1, 'Without-pay' : 0,
                             'Self-emp-inc': 1, 'Never-worked' : 0,
                             'Federal-gov' : 1}, inplace=True)
df.incomelevel.replace({'<=50K' : 0, '>50K': 1}, inplace=True)
df.maritalstatus.replace({'Married-civ-spouse' : 1, 'Divorced' : 0, 'Separated' : 0,
                             'Widowed' : 0, 'Never-married' : 0,
                             'Married-AF-spouse': 1, 'Married-spouse-absent' : 1, 
                          inplace=True)
df.education.replace({'Bachelors' : 1, 'Some-college' : 0, '11th' : 0, '12th' : 0,
                             'Prof-school' : 0, 'HS-grad' : 0, '7th-8th' : 0,
                             'Assoc-acdm': 0, 'Assoc-voc' : 0, '9th' : 0, 
                             'Masters' : 1, '1st-4th' : 0, ' 10th' : 0, 
                             'Doctorate' : 1, "5th-6th" : 0, 'Preschool' : 0, 
                      inplace=True)

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [0]:
# Did not get a chance to interpret results

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

1. Student grade performance: I would try to use a survival analysis to model the student performance over time until "failure" which in this case would be falling into a bottom tier of grades. There will be students or observations that have not yet falling below the threshold that would be considered the censored data. The cutoff point would be a certain time point that would be established so we can track the academic indicators up until that point in time. 

2. Tech company product launch: I would try to use a ridge regression to help model and predict new product launches for tech companies. The timing of development cycles can be very unpredictable so I would proceed investigating how "chaotic" product launches can be. By simply go over historical product launches would not necessarily be indicative because it depends on research and developmant discoveries and market demand. Larger tech companies with mature products have a more predictible timeline of annual product launches but novel products from small to medium sized tech companies would have much larger variance. If we tried to fit a model to each product launch from the data of our universe of tech companies, we could be overfitting the data causing large swings that we could correct with regularization.

3. Plant size and yield: I would use a quantile regression that can focus on limited dozens of plants that can be studied at a time. By targeting this particular quantile of total plants that have great physical data, we would be able to better model the expected plant size and yield of this selected species. OLS and looking at the mean would  potentially be less effective because we are unable to look at all the plants in the laboratory. It would depend on the overall goal of the study, but if it was to better understand the best yield we would want to construct a model based on targeted cutoff of plants we could most effectively study and measure. 

**TODO - your answers!**