<a href="https://colab.research.google.com/github/BrianBehnke/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [0]:
# TODO - your work!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


url_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
url_test = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'


In [156]:
col_names = ('age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
            'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
            'hours-per-week', 'native-country', 'income')
df_data = pd.read_csv(url_data, header=None, names=col_names)
df_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [157]:
df_data.income.value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

In [158]:
df_data.loc[:, 'income'].replace(regex=True, to_replace="<=", value="under_", inplace=True)
df_data.loc[:, 'income'].replace(regex=True, to_replace=">", value="over_", inplace=True)
df_data.income.value_counts()

 under_50K    24720
 over_50K      7841
Name: income, dtype: int64

In [159]:
df_data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [160]:
df_data.shape

(32561, 15)

In [161]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [0]:
df.replace('?', np.NaN)

In [0]:
df_data.isna().sum()

In [164]:
df_data.sex.value_counts()

 Male      21790
 Female    10771
Name: sex, dtype: int64

In [0]:
# not working, gah
# encode sex from categorical to numeric
#df_data['sex'] = [1 if each == "Male" else 0 for each in df_data['sex']]
#df_data.head()

In [0]:
# more one hot tries
one_hot_sex = pd.get_dummies(df_data.sex)
one_hot_sex



In [167]:
# Drop column 'sex' as it is now encoded
#df_data = df_data.drop('sex',axis = 1)
# Join the encoded df
df_data = df_data.join(one_hot_sex)
df_data.head() 


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,Female,Male
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,under_50K,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,under_50K,0,1
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,under_50K,0,1
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,under_50K,0,1
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,under_50K,1,0


In [0]:
df_data.head()

In [0]:
df_data.describe()

In [0]:
# more one hot tries
one_hot_income2 = pd.get_dummies(df_data.income)
one_hot_income2

In [189]:
df_data = df_data.join(one_hot_income2)
df_data_add.head() 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,Female,Male,over_50K,under_50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,under_50K,0,1,0,1
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,under_50K,0,1,0,1
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,under_50K,0,1,0,1
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,under_50K,0,1,0,1
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,under_50K,1,0,0,1


In [193]:
df_data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,Female,Male,over_50K,under_50K
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456,0.330795,0.669205,0.24081,0.75919
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429,0.470506,0.470506,0.427581,0.427581
min,17.0,12285.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0,0.0,0.0,0.0,1.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0,0.0,1.0,0.0,1.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0,1.0,1.0,0.0,1.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0,1.0,1.0,1.0,1.0


In [191]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 19 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
income            32561 non-null object
 Female           32561 non-null uint8
 Male             32561 non-null uint8
 over_50K         32561 non-null uint8
 under_50K        32561 non-null uint8
dtypes: int64(6), object(9), uint8(4)
memory usage: 3.9+ MB


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [196]:
# TODO - your work!
from sklearn.linear_model import LinearRegression

X = df_data[['age', 'education-num', 'hours-per-week', ' Female', ' Male']]
y = df_data[' over_50K']

#X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
#X_scaled = X_std * (max - min) + min

linear_reg = LinearRegression().fit(X, y)
linear_reg.score(X, y)

0.21519900047413665

In [0]:
linear_reg.predict(df_data[['age', 'education-num', 'hours-per-week', ' Female', ' Male']])

In [199]:
linear_reg.coef_

array([ 0.00624225,  0.05088328,  0.00460692, -0.07453019,  0.07453019])

In [200]:
test_case = np.array([[47, 16, 40, 0, 1]])  # 47 year old male with a 4 year degree and works 40 hours a week
linear_reg.predict(test_case)

array([0.64184651])

In [201]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.8026473388409446

In [0]:
np.set_printoptions(threshold=np.nan)
log_reg.predict(df_data[['age', 'education-num', 'hours-per-week', ' Female', ' Male']])

In [207]:
log_reg.predict(test_case)[0]

1

In [208]:
log_reg.predict_proba(test_case)[0]

array([0.22058331, 0.77941669])

In [209]:
# What's the math?
log_reg.coef_

array([[ 0.04513222,  0.35197304,  0.03516167, -3.40350045, -2.24673946]])

In [210]:
log_reg.intercept_

array([-5.65023991])

In [0]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [212]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))

array([[0.77941669]])

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

**What are 3 features positively correlated with income above 50k?**

'age', 'education-num', 'hours-per-week', ' Female', ' Male'
 0.00624225,  0.05088328,  0.00460692, -0.07453019,  0.07453019
 
 From my findings it appears that age, years of education, hours per week and being male all are positive correlations with earning an income over 50k.




**What are 3 features negatively correlated with income above 50k?**
I have only found 1 feature(currently) that negatively correlated to earnings over 50k income: female gender



**Overall, how well does the model explain the data and what insights do you derive from it?**

Due to a low regession score(0.21519900047413665), this model does not appear to be very good at explaining the data. I would need to add more features and perhaps engineer some additional features to get a more accurate result in being able to predict income from the data. I used limited features, as I was running out of time, but I would feel certain that some of the other features would more accurately determine the income of a worker. I would have liked to have added the following features: workclass, education, occupation and native-country, as I feel these all play a central role in salary earned. I just ran out of time getting those features munged.





**Situations**:

1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.

Quantile Regression - We use Quantile Regression to help find and analyze outliers(the at-risk' students)

2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.

Ridge Regression - we would use Ridge Regression here as the patterns are not always clear and there may be multiple answers to the questions asked.

3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Survival Analysis - We would use Survival Analysis for this because we know the start and end data for the plants size and yield in detail, which allows us to make determinations on which types of plants are likely to survive and which would fail, using the data collected.

