# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [9]:
#import all the things!!!!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [15]:
mms = MinMaxScaler()
lr = LogisticRegression()
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df
      

In [45]:
cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship',
        'race','sex','capital-gain','capital-loss','hours-per-week','native-country','50k']

>50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [55]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
df1 = pd.read_csv(url, header =None)
df1.columns = cols
#Encode all objects
df2 = dummyEncode(df1)
# Scale data with min max transform
df = pd.DataFrame(mms.fit_transform(df1))
#Check progress
df.head()
df.describe()
# Check for NaN's 
df.isna().sum()
# no nans what about '?'
df1 = df1.replace('?',np.nan)
df.isna().sum()
# check column headers
df.columns = cols
df.head()
#check shape 
df.shape


(32561, 15)

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [76]:
X1= df.drop('50k', axis = 1)
y1= df['50k']
X_train, X_test, Y_train, Y_test = train_test_split(X1, y1, test_size=.5, random_state=42)
X = X_train
y= Y_train
log_reg = LogisticRegression().fit(X, y)
def sigmoid(x):
  return 1 / (1 + np.e**(-x))
print(log_reg.score(X,y))
print(df.columns.drop('50k'))
print(log_reg.coef_)
log_reg.predict_proba(X)[0]
#sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(x)))



0.8216830466830467
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')
[[ 2.45058752 -0.20568339  0.41509639  0.04329489  4.70335415 -1.38727443
   0.19613341 -0.53379997  0.40951506  0.85253637 14.45474339  2.4449571
   2.68615816  0.06138984]]


array([0.92881513, 0.07118487])

In [77]:
X_train, X_test, Y_train, Y_test = train_test_split(X1, y1, test_size=.5, random_state=42)
X = X_test
y= Y_test
log_reg = LogisticRegression().fit(X, y)
def sigmoid(x):
  return 1 / (1 + np.e**(-x))
print(log_reg.score(X,y))
print(df.columns.drop('50k'))
print(log_reg.coef_)
log_reg.predict_proba(X)[0]
#sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(x)))



0.8218782630059579
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')
[[ 2.39251316 -0.17235352  0.72761037  0.30738876  4.80333261 -1.39007316
   0.08286293 -0.7406207   0.30559772  0.81779479 14.18355575  2.82862534
   2.77474398  0.06755968]]


array([0.88632774, 0.11367226])

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

# Answers:
 **1. The three features that are positvely correlated to having an income above 50k are as follows:**
     
     Capital Gain-Intuitively a measurement of capital gain would assist a higher income. 
     
     Education Num- I believe this is a numeric value for level of education. So the higher the education 
                         the more likely to make above 50K. 
     Capital Loss - This is less intuitive than the previous two. I believe in order to report a capital loss 
                         you must have more than 50k to invest capital and subsequently lose it. Most lower income households
                         are less likely to invest and therefor less likely to report a capital loss. 

 **2. The three features that are negatively correlated with income above 50 are as follows:**
     
     Marital Status- Married couples are more likely to have dual incomes which when combined are more likely to be
                     above 50k, while single income households as well as no income would negatively impact income
                     thresholds.
                     
     Relationship- Not quite sure how to interpret this. A child would obviously not be likely to contribute adn therefor 
                    hamper the ability to reach the threshold of 50k or above.
     
     Work Class - Intuitively the type of work weighs heavily on the amount of income recieved. 

 **3. The Model overall explains the data quite well. I think many more insights could be derived from this data if the labels
         were better explained. But some insight can be derived from the model, such as a dual income household will be more
         likely to exceed the threshold of 50 k. As well as the class of work an individaul works in can impact income.**
     
     
    

# Situations:

        1.The best approach for situation 1 would be Quantile Regression, This method best predicts when a certain range of
            outcome becomes likely. Such as the indicators of "at-risk' students. 
        
        2. The best approach for situation 2 would be Survival analysis. This method best predicts the timeframe an event 
            will likely happen. 
        
        3. The best approach for situation 3 would be ridge regression. This method is best suited for data that is well
            featured but low in observations, such as extremely detailed information of a certain plant but very few
            instances. 
            