# Sprint Challenge

In [3]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale, minmax_scale
# personal library
# !pip install --upgrade git+https://github.com/chrisluedtke/clued.git
from clued.get_data import get_uci_data_urls

plt.style.use('seaborn')

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

It has a variety of features - some are continuous, but many are categorical.

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [4]:
data_urls = get_uci_data_urls('https://archive.ics.uci.edu/ml/datasets/adult')

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names


In [5]:
data_dict = {
    'age': 'continuous',
    'workclass': ('Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, '
                  'Local-gov, State-gov, Without-pay, Never-worked'),
    'fnlwgt': 'continuous',
    'education': ('Bachelors, Some-college, 11th, HS-grad, Prof-school, '
                  'Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, '
                  '1st-4th, 10th, Doctorate, 5th-6th, Preschool'),
    'education-num': 'continuous',
    'marital-status': ('Married-civ-spouse, Divorced, Never-married, '
                       'Separated, Widowed, Married-spouse-absent, '
                       'Married-AF-spouse'),
    'occupation': ('Tech-support, Craft-repair, Other-service, Sales, '
                   'Exec-managerial, Prof-specialty, Handlers-cleaners, '
                   'Machine-op-inspct, Adm-clerical, Farming-fishing, '
                   'Transport-moving, Priv-house-serv, Protective-serv, '
                   'Armed-Forces'),
    'relationship': ('Wife, Own-child, Husband, Not-in-family, '
                     'Other-relative, Unmarried'),
    'race': 'White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black',
    'sex': 'Female, Male',
    'capital-gain': 'continuous',
    'capital-loss': 'continuous',
    'hours-per-week': 'continuous',
    'native-country': ('United-States, Cambodia, England, Puerto-Rico, '
                       'Canada, Germany, Outlying-US(Guam-USVI-etc), India, '
                       'Japan, Greece, South, China, Cuba, Iran, Honduras, '
                       'Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, '
                       'Portugal, Ireland, France, Dominican-Republic, Laos, '
                       'Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, '
                       'Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, '
                       'Trinadad&Tobago, Peru, Hong, Holand-Netherlands'),
    'income': '>50K, <=50K',
}

In [6]:
df = pd.read_csv(data_urls[0], header=None, names=data_dict.keys())

In [7]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [8]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [10]:
for col in df.select_dtypes('object'):
    df[col] = df[col].str.strip()

for col in df.select_dtypes('object'):
    print(col, set(df[col].unique()) - set(data_dict[col].split(', ')))

workclass {'?'}
education set()
marital-status set()
occupation {'?'}
relationship set()
race set()
sex set()
native-country {'?'}
income set()


In [11]:
df = df.replace({'?':'unknown'})

In [12]:
df[df.select_dtypes('number').columns] = \
    pd.DataFrame(minmax_scale(df.select_dtypes('number')))

In [13]:
target = 'income'
df[target] = df[target].map({'<=50K':0, '>50K':1})

In [14]:
df = pd.get_dummies(df, drop_first=True)

## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [15]:
predictors = set(df.columns) - set([target])

X = df[predictors].values
y = df[target].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.8530616043178857


In [16]:
coef_dict = {attr: coef for attr, coef in zip(predictors, model.coef_[0])}
coefs = pd.Series(coef_dict, name='Coefficients').sort_values()
coefs

native-country_Columbia                      -1.565022
occupation_Priv-house-serv                   -1.291779
workclass_Without-pay                        -1.137950
relationship_Own-child                       -1.126971
native-country_Vietnam                       -1.056270
workclass_Self-emp-not-inc                   -1.010868
native-country_China                         -0.998063
occupation_Farming-fishing                   -0.975103
native-country_Dominican-Republic            -0.853772
native-country_Puerto-Rico                   -0.844399
workclass_State-gov                          -0.837062
native-country_Greece                        -0.804934
native-country_South                         -0.796763
native-country_Outlying-US(Guam-USVI-etc)    -0.793093
occupation_Other-service                     -0.783761
education_5th-6th                            -0.774621
occupation_unknown                           -0.755651
workclass_Local-gov                          -0.734488
occupation

We should also check the residual plots, but that's beyond the scope of this sprint.

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

**3 features positively correlated with income above 50k?**

In [17]:
coefs[-3:]

capital-loss       2.237756
hours-per-week     2.656385
capital-gain      15.435053
Name: Coefficients, dtype: float64

**3 features negatively correlated with income above 50k?**

In [18]:
coefs[:3]

native-country_Columbia      -1.565022
occupation_Priv-house-serv   -1.291779
workclass_Without-pay        -1.137950
Name: Coefficients, dtype: float64

**How well does the model explain the data and what insights do you derive from it?**

In [19]:
print(model.score(X_test, y_test))

0.8530616043178857


My model is 85% accurate. While some coefficients were obvious (`capital-gain`), it was interesting to see how different countries' coefficients compared. Of course, in analyses like this, we should be mindful not to draw causal inferences. However, these insights might inform more rigorous scientific study and experimentation.

In [20]:
# help(model.score) # this is simply sklearn.metrics.accuracy_score

### Match the following situation descriptions with the model most appropriate to addressing them

Pair the apppropriate regression model with the situations they are most appropriate for, and briefly explain why.

1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
  * **Quantile Regression**. We want to assess a model that is fit for a given quantile of our response/target variable.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
  * **Survival Analysis**. In this situation we have censored data (we don't know when products will be released that are not yet released).
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.
  * **Ridge Regression**. This situation litely involves more features than observations. A ridge regression model helps tune the bias of our regression, which will help mitigate the chacen of overfitting.