<a href="https://colab.research.google.com/github/chadeowen/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [118]:
import pandas as pd

df_adult = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',header=None)
df_adult.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [119]:
df_adult.shape

(32561, 15)

In [0]:
# age: continuous.
# workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
# fnlwgt: continuous.
# education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
# education-num: continuous.
# marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
# occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
# relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
# race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
# sex: Female, Male.
# capital-gain: continuous.
# capital-loss: continuous.
# hours-per-week: continuous.
# native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [124]:
y_values = df_adult[14].unique()
print(sorted(y_values))

[' <=50K', ' >50K']


In [125]:
df = df_adult.rename(columns={0:'age',1:'workclass',2:'fnlwgt',3:'education',
                         4:'education_num',5:'marital_status',6:'occupation',
                         7:'relationship',8:'race',9:'sex',10:'capital_gain',
                         11:'capital_loss',12:'hours_per_week',
                         13:'native_country'})
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [0]:
df = df.rename(columns={14:'y'})

In [127]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
y                 object
dtype: object

In [0]:
# one hot encoding 
from sklearn.preprocessing import LabelEncoder

df['workclass_encoded'] = LabelEncoder().fit_transform(df.workclass)
df['education_encoded'] = LabelEncoder().fit_transform(df.education)
df['marital_status_encoded'] = LabelEncoder().fit_transform(df.marital_status)
df['occupation_encoded'] = LabelEncoder().fit_transform(df.occupation)
df['relationship_encoded'] = LabelEncoder().fit_transform(df.relationship)
df['race_encoded'] = LabelEncoder().fit_transform(df.race)
df['sex_encoded'] = LabelEncoder().fit_transform(df.sex)
df['native_country_encoded'] = LabelEncoder().fit_transform(df.native_country)
df['y_encoded'] = LabelEncoder().fit_transform(df.y)

In [129]:
df_encoded = df.drop(columns=['workclass','education','marital_status',
                              'occupation','relationship','race','sex',
                             'native_country','y'])

df_encoded.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_encoded,education_encoded,marital_status_encoded,occupation_encoded,relationship_encoded,race_encoded,sex_encoded,native_country_encoded,y_encoded
0,39,77516,13,2174,0,40,7,9,4,1,1,4,1,39,0
1,50,83311,13,0,0,13,6,9,2,4,0,4,1,39,0
2,38,215646,9,0,0,40,4,11,0,6,1,4,1,39,0
3,53,234721,7,0,0,40,4,1,2,6,0,2,1,39,0
4,28,338409,13,0,0,40,4,9,2,10,5,2,0,5,0


In [130]:
# drop unknowns
# y: 0=equal or less, 1=above

df_encoded.dtypes

# unknowns are now encoded instead of dropped

age                       int64
fnlwgt                    int64
education_num             int64
capital_gain              int64
capital_loss              int64
hours_per_week            int64
workclass_encoded         int64
education_encoded         int64
marital_status_encoded    int64
occupation_encoded        int64
relationship_encoded      int64
race_encoded              int64
sex_encoded               int64
native_country_encoded    int64
y_encoded                 int64
dtype: object

In [131]:
df_encoded.shape

(32561, 15)

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [134]:
# regularize data

from sklearn.preprocessing import scale

scale_x = df_encoded.drop(columns=['y_encoded'])

scale_x = pd.DataFrame(scale(scale_x),columns=scale_x.columns)

  


In [135]:
scale_x.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_encoded,education_encoded,marital_status_encoded,occupation_encoded,relationship_encoded,race_encoded,sex_encoded,native_country_encoded
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,2.150579,-0.335437,0.921634,-1.317809,-0.277805,0.393668,0.703071,0.291569
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,1.463736,-0.335437,-0.406212,-0.608387,-0.900181,0.393668,0.703071,0.291569
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,0.09005,0.181332,-1.734058,-0.135438,-0.277805,0.393668,0.703071,0.291569
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,0.09005,-2.402511,-0.406212,-0.135438,-0.900181,-1.962621,0.703071,0.291569
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,0.09005,-0.335437,-0.406212,0.810458,2.211698,-1.962621,-1.422331,-4.054223


In [136]:
# prediction task is to determine whether a person makes over 50k a year

from sklearn.linear_model import LogisticRegression

y = df_encoded['y_encoded']
X = scale_x

df_log1 = LogisticRegression().fit(X, y)
df_log1.score(X, y)



0.8250667977027732

**Great fit -- I decided to NOT scale my Y values and increased my fit by ~3%**

In [137]:
# coefficients after applying log regression to df

df_log1.coef_

array([[ 0.4646181 ,  0.05373115,  0.85222912,  2.32323214,  0.27335535,
         0.36885636, -0.03154805,  0.06094938, -0.35307137,  0.04470072,
        -0.19192929,  0.09619363,  0.42082713,  0.02689289]])

In [138]:
X.columns

Index(['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week', 'workclass_encoded', 'education_encoded',
       'marital_status_encoded', 'occupation_encoded', 'relationship_encoded',
       'race_encoded', 'sex_encoded', 'native_country_encoded'],
      dtype='object')

In [141]:
# double check / easier to read

print(df_log1.coef_[0,0].mean(),'age')
print(df_log1.coef_[0,1].mean(),'fnlnwgt')
print(df_log1.coef_[0,2].mean(),'education num')
print(df_log1.coef_[0,3].mean(),'capital gain')
print(df_log1.coef_[0,4].mean(),'capital loss')
print(df_log1.coef_[0,5].mean(),'hours per week')
print(df_log1.coef_[0,6].mean(),'workclass')
print(df_log1.coef_[0,7].mean(),'education')
print(df_log1.coef_[0,8].mean(),'marital status')
print(df_log1.coef_[0,9].mean(),'occupation')
print(df_log1.coef_[0,10].mean(),'relationship')
print(df_log1.coef_[0,11].mean(),'race')
print(df_log1.coef_[0,12].mean(),'sex')
print(df_log1.coef_[0,13].mean(),'native country')

0.46461809781598684 age
0.05373114780758919 fnlnwgt
0.8522291154827533 education num
2.323232144749244 capital gain
0.2733553470255125 capital loss
0.36885636249426884 hours per week
-0.03154805058023154 workclass
0.06094937540324338 education
-0.35307137038644554 marital status
0.04470071984275184 occupation
-0.19192928757002667 relationship
0.09619363165733534 race
0.42082712933404326 sex
0.026892885287979597 native country


## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

#Part 3
**1.**
- Capital Gain, Education Num, and Age are positively correlated with income above 50k

**2.**
- Workclass, Marital Status, and Relationship are negatively correlated with income above 50k

**3.**
- Based on my Logistic Regression I have a model with 82.51% accuracy for predicting whether or not someone's income is above 50k
- Whether or not this data is overfit or accurate we cannot say with the information I have provided... how we might find this out is to run tests for multicollinearity, run hypothesis tests, or using this model on future data and seeing how close our predictions based on this model are compared to the actual outcomes

#Part 4
**1.**
- **Ridge Regression** matches to **situation Three**. Given a small number of independent variables (plants, in this case), we are prone to overfitting with traditional OLS. Regularization is a solution that often fits these types of situations, and Ridge Regression does this by adding more information to the model to enhance tractability. Because our variance sensitivity will be so high (given our numbers of features and observations are similar), we will need to increase our bias; in comes Ridge Regression. Ridge Regression reduces SSE of Residuals just like OLS, but also reduces the Squared Slope of the model times Alpha. As we play with Alpha, we will find an optimum 'penalty' based on how overfit our model is (how steep our slopes are) and ultimately solve our issue for a similar number of features (plants) and observations (plant data). 

**2.**
- **Quantile Regression** matches to **situation One**. Quantile Regression allows optimization for values other than the mean, for example, 'at risk students.' This is useful in this scenario as many teachers have different outlooks on what grade percentiles are considered at-risk. For example, maybe a teacher's class is graded on a bell-curve, and although most students are receiving marks below 40%, no student is at-risk of failure except for those below 25%. To contrast, many teachers grade on a more standardized scale, deeming scores below 65% worthy of being at-risk (below a flat D, in my experience). Additionally, some students may be exceeding expectations or failing miserably -- Quantile Regression efficiently handles outliers so not to skew findings. Quantile Regression allows us to cut a probability distribution and solve these modeling problems successfully.

**3.**
- **Survival Analysis** matches to **situation Two**. Survival Analysis models 'time until failure', or in this case, 'time until  launch.' The variables that might take place for this model could be capital raised, engineers hired, product launch seminars held, time spent building product, product type, or a whole bunch of other independent variables. The dependent variable is the product launch which will ultimately be a Yes or No, but for the sake of our model, will always be a Yes. This Yes is our lack of survival or death value in our model. The noticeable up-front downside of our model is the assumption that all products in the making will eventually be launched, which we know is not true and will therefore see out failures. Luckily there is a hazard 'bathtub' function that successfully addresses this problem -- think Biopharmaceutical companies that are close to receiving FDA approval, only to be sent back to phase 1 or 2 of clinical trials.