<a href="https://colab.research.google.com/github/btrentini/Appeatit/blob/master/US_Census.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#US Census 

## Task Summary

- Extract and Clean dataset from http://thomasdata.s3.amazonaws.com/ds/us_census_full.zip
- Perform some EDA and Feature Engineering
- Given resident profile, predict if salary is great than or equal to $50,000 per year
- Test different models and validate on test set



## Task Info

>The following link lets you download an archive containing an “exercise” US Census dataset: http://thomasdata.s3.amazonaws.com/ds/us_census_full.zip
This US Census dataset contains detailed but anonymized information for approximately 300,000 people.

>The archive contains 3 files: 
* A large training file (csv)
* Another test file (csv)
* A metadata file (txt) describing the columns of the two csv files (identical for both)

> **The goal** of this exercise is to model the information contained in the last column (42nd), i.e., whether a person makes more or less than $50,000 per year, from the information contained in the other columns. The exercise here consists of modeling a binary variable.

> Work with Python (or R) to carry out the following steps:
*  Load the train and test files.
* Perform an exploratory analysis on the data and create some relevant visualisations.
* Clean, preprocess, and engineer features in the training data, with the aim of building a data set that a model will perform well on.
* Create a model using these features to predict whether a person earns more or less than $50,000 per year. Here, the idea is for you to test a few different models, and see whether there are any techniques you can apply to improve performance over your first results.
* Choose the model that appears to have the highest performance based on a comparison between reality (the 42nd variable) and the model’s prediction. 
* Apply your model to the test file and measure its real performance on it (same method as above).

>The goal of this exercise is not to create the best or the purest model, but rather to describe the steps you took to accomplish it.
Explain areas that may have been the most challenging for you.
>Find clear insights on the profiles of the people that make more than $50,000 / year. For example, which variables seem to be the most correlated with this phenomenon?
>Finally, you push your code on GitHub to share it with me, or send it via email.

>Once again, the goal of this exercise is not to solve this problem, but rather to spend a few hours on it and to thoroughly explain your approach.

## Metadata Info

**From the metadata (see below how this was obtained):**


> This data was extracted from the census bureau database found at
>http://www.census.gov/ftp/pub/DES/www/welcome.html

>Donor: Terran Lane and Ronny Kohavi
       Data Mining and Visualization
       Silicon Graphics.
       e-mail: terran@ecn.purdue.edu, ronnyk@sgi.com for questions.


>The data was split into train/test in approximately $2/3$, $1/3$ proportions using MineSet's MIndUtil mineset-to-mlc.

>**Prediction task** is to determine the income level for the person represented by the record.  Incomes have been binned at the $50K level to present a binary classification problem, much like the original UCI/ADULT database.  The goal field of this data, however, was drawn from the "total person income" field rather than the "adjusted gross income" and may, therefore, behave differently than the orginal ADULT goal field.
>More information detailing the meaning of the attributes can be found in http://www.bls.census.gov/cps/cpsmain.htm

# Setup

In [None]:
!pip install wget

In [None]:
# System utils
import os
import zipfile
import wget 

# Some classic data science stuff
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt 
%matplotlib inline 

# Styling
sns.set_palette('YlGnBu')
pal='YlGnBu'
sns.set(font_scale = 2)
figsize=(23, 15)

In [None]:
!nvidia-smi

# Build datasets

## Download & Extract

In [None]:
wget.download("http://thomasdata.s3.amazonaws.com/ds/us_census_full.zip")

In [None]:
!ls -1

In [None]:
# Define helper to load
local_zip =os.path.join(DATA_PATH, 'us_census_full.zip')

# Unzip Train Set into temporary path
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

In [None]:
!ls -1 /tmp/us_census_full/

In [None]:
train     = '/tmp/us_census_full/census_income_learn.csv'
test      = '/tmp/us_census_full/census_income_test.csv'
metadata  = '/tmp/us_census_full/census_income_metadata.txt'

In [None]:
# Let's see what's in the metadata
!cat $metadata

In [None]:
# Check if there's a header in the train file
!head -2 $train

In [None]:
# Check if there's a header in the test file
!head -2 $test

In [None]:
# Build dataframes, no headers
dat  = pd.read_csv(train, header=None)
test_dat   = pd.read_csv(test, header=None) 

In [None]:
# Check
dat.head(5)

## A trick from metadata for column names
 This will help us a lot during EDA. The metada contains useful information about columns, values and their properties. I can use this file to name columns and later on this will give us the option to address the dataframe by column names, which might be handy in many cases

In [None]:
!tail -42 $metadata

In [None]:
'''
From the above we can see that the last 42 rows are the column names
We can use this info to improve our datasets and help us eith EDA

Besides, the metada tell us to ignore '|_instance_weight'the 24th record
'''

# We will beed a list to append to...
cols = []

# Save metadata last 42 rows
column_names = !tail -42 $metadata

# Remove the record to be ignored '|_instance_weight'
column_names.pop(24)        

# Build column helper
for col in column_names:
  record = col.split(":")[0].replace(" ","_")
  cols.append(record)

# Add tarfet variable's column not listed in metadata
cols.append("target")

# Insert column names into dataframes
dat.columns = cols
test_dat.columns = cols

# Voila!
dat.head(5)

In [None]:
dat.dtypes

In [None]:
dat.target.value_counts()

**Note:** Dataset quite unbalanced...

In [None]:
dat.year = dat.year.astype(str)
test_dat.year = dat.year.astype(str)

dat['encoded_target'] = dat.target.astype('category').cat.codes
test_dat['encoded_target'] = test_dat.target.astype('category').cat.codes

dat.encoded_target.value_counts()

In [None]:
dat.describe()        

# Exploratory Data Analysis

## Correlation

Will help us understand risks of colinearity and some features that we can get scrap

In [None]:
correlation = dat.drop(['target', 'encoded_target'], axis=1).corr()
mask = np.zeros_like(correlation)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
  f, ax = plt.subplots(figsize=figsize)
  ax = sns.heatmap(correlation,
              square=True,
              vmax=1.0,
              vmin=-1.0,
              center=0.0,
              annot_kws={'size': 12},
              linewidths=0.8,
              cmap="YlGnBu",
              linecolor='white',
              mask=mask,
              annot=True, 
              fmt=".2f",
              robust=True)

In [None]:
fig, ax = plt.subplots(figsize=figsize)
ax = sns.boxplot(x='age', y='class_of_worker', data=dat,
                 palette=pal, hue='target')