<a href="https://colab.research.google.com/github/wel51x/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/DS_Unit_2_Sprint_3_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [52]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

column_names = ['age', 'workclass', 'fnlwgt', 'education', 'educational-num','marital-status',
                'occupation', 'relationship', 'race', 'gender','capital-gain', 'capital-loss',
                'hours-per-week', 'native_country','income']
df_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                      header=None, names=column_names)
print(df_data.shape)
df_test = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                      header=None, names=column_names, skiprows=1)

pd.set_option('display.max_columns', 24)
pd.set_option('display.width', 222)

print(df_test.shape)
print(df_data.head())
print(df_test.head())


(32561, 15)
(16281, 15)
   age          workclass  fnlwgt   education  educational-num       marital-status          occupation    relationship    race   gender  capital-gain  capital-loss  hours-per-week  native_country  income
0   39          State-gov   77516   Bachelors               13        Never-married        Adm-clerical   Not-in-family   White     Male          2174             0              40   United-States   <=50K
1   50   Self-emp-not-inc   83311   Bachelors               13   Married-civ-spouse     Exec-managerial         Husband   White     Male             0             0              13   United-States   <=50K
2   38            Private  215646     HS-grad                9             Divorced   Handlers-cleaners   Not-in-family   White     Male             0             0              40   United-States   <=50K
3   53            Private  234721        11th                7   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male             0      

In [53]:
# check out income
print(df_test.income.value_counts())
df_test['income'].replace(regex=True, inplace=True, to_replace=r'\.', value=r'')
print(df_test.income.value_counts())

df = pd.concat([df_data, df_test])
print(df.shape)
print(df.income.value_counts())
df = df.reset_index(drop = True)


 <=50K.    12435
 >50K.      3846
Name: income, dtype: int64
 <=50K    12435
 >50K      3846
Name: income, dtype: int64
(48842, 15)
 <=50K    37155
 >50K     11687
Name: income, dtype: int64


In [54]:
# get rid of nans
df.replace(' ?', np.NaN, inplace = True)
print(df.isnull().sum())
print(df.info())


age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native_country      857
income                0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age                48842 non-null int64
workclass          46043 non-null object
fnlwgt             48842 non-null int64
education          48842 non-null object
educational-num    48842 non-null int64
marital-status     48842 non-null object
occupation         46033 non-null object
relationship       48842 non-null object
race               48842 non-null object
gender             48842 non-null object
capital-gain       48842 non-null int64
capital-loss       48842 non-null int64
hours-per-week     48842 non-nu

In [55]:
df.dropna(inplace = True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45222 entries, 0 to 48841
Data columns (total 15 columns):
age                45222 non-null int64
workclass          45222 non-null object
fnlwgt             45222 non-null int64
education          45222 non-null object
educational-num    45222 non-null int64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
gender             45222 non-null object
capital-gain       45222 non-null int64
capital-loss       45222 non-null int64
hours-per-week     45222 non-null int64
native_country     45222 non-null object
income             45222 non-null object
dtypes: int64(6), object(9)
memory usage: 5.5+ MB
None


In [56]:
# encode income
df.income.replace({' <=50K':0, ' >50K':1}, inplace = True)
print(df.info())

print(df.income.value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45222 entries, 0 to 48841
Data columns (total 15 columns):
age                45222 non-null int64
workclass          45222 non-null object
fnlwgt             45222 non-null int64
education          45222 non-null object
educational-num    45222 non-null int64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
gender             45222 non-null object
capital-gain       45222 non-null int64
capital-loss       45222 non-null int64
hours-per-week     45222 non-null int64
native_country     45222 non-null object
income             45222 non-null int64
dtypes: int64(7), object(8)
memory usage: 5.5+ MB
None
0    34014
1    11208
Name: income, dtype: int64


In [57]:
#encode object columns
object_columns = list(df.select_dtypes(include=['object']))
df[object_columns] = df[object_columns].apply(LabelEncoder().fit_transform)
print(df.info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 45222 entries, 0 to 48841
Data columns (total 15 columns):
age                45222 non-null int64
workclass          45222 non-null int64
fnlwgt             45222 non-null int64
education          45222 non-null int64
educational-num    45222 non-null int64
marital-status     45222 non-null int64
occupation         45222 non-null int64
relationship       45222 non-null int64
race               45222 non-null int64
gender             45222 non-null int64
capital-gain       45222 non-null int64
capital-loss       45222 non-null int64
hours-per-week     45222 non-null int64
native_country     45222 non-null int64
income             45222 non-null int64
dtypes: int64(15)
memory usage: 5.5 MB
None


###==>> for the record, let me state that this data is a pain!

took a bit of time to get it figured out.

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [60]:
y = df['income']
X = df.drop(columns='income')

model = LogisticRegression(multi_class='ovr',
                           solver='liblinear',
                           max_iter=1000).fit(X, y)

print("Logistic Regression Model score:", model.score(X, y))

corr_matrix = df.corr().sort_values('income', ascending=False)
df_corr = pd.DataFrame(corr_matrix.income)
print("\nCoefficient Matrix")
print(df_corr[1:])

coeffs = {'col' : df.drop(columns='income').columns.tolist(),
              'coef' : model.coef_[0]}
coeffs = pd.DataFrame(coeffs)

print("\nLogistic Regression Coefficients")
print(coeffs.sort_values('coef'))


Logistic Regression Model score: 0.7924019282650038

Coefficient Matrix
                   income
educational-num  0.332800
age              0.237040
hours-per-week   0.227199
capital-gain     0.221034
gender           0.215760
capital-loss     0.148687
education        0.081196
race             0.070844
occupation       0.049787
native_country   0.020103
workclass        0.015659
fnlwgt          -0.007264
marital-status  -0.192711
relationship    -0.253402

Logistic Regression Coefficients
        coef              col
13 -0.043008   native_country
7  -0.012338     relationship
5  -0.011107   marital-status
3  -0.006874        education
8  -0.003890             race
6  -0.003653       occupation
1  -0.003181        workclass
2  -0.000002           fnlwgt
10  0.000325     capital-gain
11  0.000746     capital-loss
9   0.001620           gender
4   0.005537  educational-num
12  0.007589   hours-per-week
0   0.011663              age


## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Answers:
1. Per the immediately preceeding Logistic Regression Coefficients, **age, hours-per-week and educational-num** are positively correlated with income above 50K, with coeffs of 0.011663, 0.007589 and 0.005537  respectively

---
2. Looking at the same table, ** native_country** (-0.043008), **relationship** (-0.012338) and **marital-status** (-0.011107) are negatively correlated with income above 50K

---
3. The model returned a score of 0.7924019282650038, indicating it's doing a pretty good job

---

####==>> These compare favorably with the coefficient matrix in the prior table



### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

### Answers:
1. I would use **Quantile Regressio**n here, which would allow me to segment and help identify the poor performers. I could then perhaps use this with outliers to build a more effective model for this group.

---
2. Here I believe one should employ **Survival Analysis**. It deals with a product launch, which is a time/life-cycle event, very well suited to periodic (life-decay-death) analysis.

---
3. This looks like an excellent candidate for **Ridge Regression**. With a small sample size and large amount of characteristics available, there's a great risk of overfitting.

