<a href="https://colab.research.google.com/github/Phantasm320/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler

In [198]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", names = ['age', 'workclass', 'fnlwgt', 'education',
                                                                                                          'education_num', 'marital_status', 'occupation',
                                                                                                          'serv_arm', 'race', 'sex', 'capital_gain',
                                                                                                          'capital_loss', 'hours_per_week', 'native_country',
                                                                                                          'class'])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,serv_arm,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [199]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
serv_arm          32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
class             32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [200]:
df.shape

(32561, 15)

In [223]:
df['class'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [224]:
workclass = pd.get_dummies(df['workclass'], drop_first=True); workclass.head()

Unnamed: 0,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay
0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,0
2,0,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0,0


In [0]:
education = pd.get_dummies(df['education'], drop_first=True)
marital_status = pd.get_dummies(df['marital_status'], drop_first=True)
occupation = pd.get_dummies(df['occupation'], drop_first=True)
serv_arm = pd.get_dummies(df['serv_arm'], drop_first=True)
race = pd.get_dummies(df['race'], drop_first=True)
sex = pd.get_dummies(df['sex'], drop_first=True)
native_country = pd.get_dummies(df['native_country'], drop_first=True)

In [226]:
df_train = pd.concat([df, workclass, education, marital_status, occupation, serv_arm, race, sex, native_country], axis=1);df_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,serv_arm,race,sex,...,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,...,0,0,0,0,0,0,0,1,0,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,0,0,0,0,0,0,0,1,0,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,...,0,0,0,0,0,0,0,1,0,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,...,0,0,0,0,0,0,0,1,0,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,...,0,0,0,0,0,0,0,0,0,0


In [227]:
df_train.drop(['workclass', 'education', 'marital_status', 'occupation', 'serv_arm', 'race', 'sex', 'native_country'], axis=1, inplace=True); df_train.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,class,Federal-gov,Local-gov,Never-worked,...,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,39,77516,13,2174,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
money = {' <=50K': 1, ' >50K': 0}

In [229]:
df_train['class'] = df_train['class'].map(money); df_train.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,class,Federal-gov,Local-gov,Never-worked,...,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,39,77516,13,2174,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [230]:
scaler = MinMaxScaler(feature_range=(0,1))
scaler.fit(df_train[['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']])

  return self.partial_fit(X, y)


MinMaxScaler(copy=True, feature_range=(0, 1))

In [231]:
df_train[['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']] = scaler.transform(df_train[['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']])
                                                                                       
df_train.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,class,Federal-gov,Local-gov,Never-worked,...,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
0,0.30137,0.044302,0.8,0.02174,0.0,0.397959,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0.452055,0.048238,0.8,0.0,0.0,0.122449,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0.287671,0.138113,0.533333,0.0,0.0,0.397959,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.493151,0.151068,0.4,0.0,0.0,0.397959,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0.150685,0.221488,0.8,0.0,0.0,0.397959,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [247]:
df_train.isnull().sum().sum()

0

In [248]:
df_train.shape

(32561, 101)

In [249]:
df_train[['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']].describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,0.295639,0.120545,0.605379,0.010777,0.020042,0.402423
std,0.186855,0.071685,0.171515,0.073854,0.092507,0.125994
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.150685,0.071679,0.533333,0.0,0.0,0.397959
50%,0.273973,0.112788,0.6,0.0,0.0,0.397959
75%,0.424658,0.152651,0.733333,0.0,0.0,0.44898
max,1.0,1.0,1.0,1.0,1.0,1.0


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [0]:
X = df_train.drop('class', axis=1)
y = df_train['class']

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [0]:
log_model = LogisticRegression()

In [235]:
log_model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [0]:
predictions = log_model.predict(X_test)

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [238]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.74      0.60      0.67      1927
           1       0.88      0.94      0.91      6214

   micro avg       0.86      0.86      0.86      8141
   macro avg       0.81      0.77      0.79      8141
weighted avg       0.85      0.86      0.85      8141



In [239]:
confusion_matrix(y_test, predictions)

array([[1163,  764],
       [ 401, 5813]])

- The first row states that model was able to correctly classify 1163 people with income below 50K and 764 incorrectly
- The second row states that the model was able to correctly classify 5813 people with income above 50K and 401 incorrectly

- The class report shows these percentages in the precision columns and how many observations is in the support column



In [240]:
coef = log_model.coef_[0]
names = X.columns
model_coef = list(zip(names, coef)); model_coef

[('age', -1.7652451466924701),
 ('fnlwgt', -0.9382070121999346),
 ('education_num', -1.4586038296839137),
 ('capital_gain', -16.102151816605588),
 ('capital_loss', -2.312657238754684),
 ('hours_per_week', -2.672013727060718),
 (' Federal-gov', -0.9366374223643636),
 (' Local-gov', -0.2257243570646299),
 (' Never-worked', 0.14842610368002668),
 (' Private', -0.39903616099336325),
 (' Self-emp-inc', -0.5522785503970382),
 (' Self-emp-not-inc', 0.08761162022365991),
 (' State-gov', -0.07346649666694827),
 (' Without-pay', 0.7858619483725492),
 (' 11th', 0.26855849971219253),
 (' 12th', -0.026785541942484094),
 (' 1st-4th', 0.2483475788892172),
 (' 5th-6th', 0.6505348652686718),
 (' 7th-8th', 0.436010625514934),
 (' 9th', 0.3650245525836322),
 (' Assoc-acdm', -0.41834588513637316),
 (' Assoc-voc', -0.5072571222358053),
 (' Bachelors', -0.8631348152138728),
 (' Doctorate', -1.4888013459136464),
 (' HS-grad', -0.1311866293901494),
 (' Masters', -1.1136605578970946),
 (' Preschool', 0.7496490

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

1. What are three features positively correlated with income above 50K: hours_per_week, education level, and capital gain. These are based on the coefs in the previous part (above). Large negative numbers are associsated with postive influence.

2. What are 3 features negatively correlated with income above 50K?: age, has a child, and work class. Large postive numbers are associated with negative correlations

3. Overall, how well does the model explain the data and what insights do you derive from it? The model explains the data well based on the summary and confusion matrix I provided above. We can see that education, the amount of hours you work, and if you are married are big factors in determining how much you make. People that are younger with kids tend to make less than 50K

1. Survival Analysis:  Ideal because we are analyzing data where the outcome variable is the time until the occurrence (likely to receive the bottom tier of grades) of an event of interest (academic performance). 

2. Quantile Regression: Ideal because we want to model the relation between a set of predictor variables (product launch) and
specific quantiles of the response variable(tech companies). It specifies changes in the quantiles of the
response. A median regression of tech company pattern characterizes the changes in the median company patterns as a function of the predictors(product launch). The effect of product launch over the median of the tech companies can be compared to its effect on other quantiles of companies (what top companies are doing)

3. Ridge Regression: Is ideal because you want a penalized likelihood method for regularizing linear regression coefficients. The reason is because the number of observations is small relative to the number of parameters.