# Pre-Modeling: Data Preprocessing and Feature Exploration in Python

## Goals:

* Goal: 
    * Pre-modeling/modeling == 80/20% of work
    * Show the importance of data preprocessing, feature exploration, and feature engineering on model performance
    * Go over a few effective pre-modeling
* Format:
    * Use an edited version of the 'adult' dataset **to predict income** with the objective of building a binary classification model
* Python libraries:
    * Numpy
    * Panda
    * Sci-kit learn
    * Matplotlib
Source of 'adult' dataset: https://archive.ics.uci.edu/ml/datasets/Adult

# Agenda
1. Modeling Overview
2. Introduce the Data
3. Basic Data Cleaning:
*    A. Dealing with data-types
*    B. Handling missing data
4. More Data Exploration:
*    A. Outlier detection
*    B. Plotting distributions
5. Feature Engineering:
*    A. Interactions between features
*    B. Dimensionality reduction using PCA
6. Feature Selection and Model Building

# Part 1: Modeling Overview
## Review of Predictive Modeling
### Definition:
   * Statistical technique to predict unknown outcomes
### Assessing model performance:
   * Randomly split observations (datapoints) into train/test sets
   * Build model on train set and assess performance on test set
   * AUC of ROC is common performance metric (true/false positive rates)
### Types of models for binary classification:
   * Logistic regression
   * Random Forest
   * Gradient Boosted Trees
   * Support Vector Machines
   * etc.

In [1]:
# Part 2: Introduce the Data
import numpy as np
import pandas as pd

df = pd.read_csv('adult.data',na_values=['#NAME?'])
columns = ['age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship','race','sex','capital_gain','capitol_loss','hour_per_week','native_country','income']
df.columns = columns

In [2]:
print(df.head(5))

   age          workclass  fnlwgt   education  education_num  \
0   50   Self-emp-not-inc   83311   Bachelors             13   
1   38            Private  215646     HS-grad              9   
2   53            Private  234721        11th              7   
3   28            Private  338409   Bachelors             13   
4   37            Private  284582     Masters             14   

        marital_status          occupation    relationship    race      sex  \
0   Married-civ-spouse     Exec-managerial         Husband   White     Male   
1             Divorced   Handlers-cleaners   Not-in-family   White     Male   
2   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
3   Married-civ-spouse      Prof-specialty            Wife   Black   Female   
4   Married-civ-spouse     Exec-managerial            Wife   White   Female   

   capital_gain  capitol_loss  hour_per_week  native_country  income  
0             0             0             13   United-States   <=50K 

In [3]:
#Take a look at the outcome variable: 'income'
print(df.columns)
print(df['income'].value_counts())

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capitol_loss', 'hour_per_week', 'native_country',
       'income'],
      dtype='object')
 <=50K    24719
 >50K      7841
Name: income, dtype: int64


In [16]:
#Assign outcome as 0 if income <= 50k and as 1 if income > 50k
df['income'] = [1 if x == '>50K' else 0 for x in df['income']]

#assign X as a dataframe of features and Y as a Series of the outcome variable
X = df.drop('income',1)
y = df.income

In [17]:
print(X.head())


   age          workclass  fnlwgt   education  education_num  \
0   50   Self-emp-not-inc   83311   Bachelors             13   
1   38            Private  215646     HS-grad              9   
2   53            Private  234721        11th              7   
3   28            Private  338409   Bachelors             13   
4   37            Private  284582     Masters             14   

        marital_status          occupation    relationship    race      sex  \
0   Married-civ-spouse     Exec-managerial         Husband   White     Male   
1             Divorced   Handlers-cleaners   Not-in-family   White     Male   
2   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
3   Married-civ-spouse      Prof-specialty            Wife   Black   Female   
4   Married-civ-spouse     Exec-managerial            Wife   White   Female   

   capital_gain  capitol_loss  hour_per_week  native_country  
0             0             0             13   United-States  
1             

In [18]:
print(y.head())

0    0
1    0
2    0
3    0
4    0
Name: income, dtype: int64
