## Objective:
Visually inspect census data to see if distributions are unimodal or multimodal. Hopefully they are unimodal, as I will transform the data such that a row is assigned the mode of that distribution.

For example, the annual income associated with a given row is the highest-percentage bin for that row.

<h2 id="tocheading">Table of Contents</h2>
<div id="toc"></div>

In [207]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [35]:
from __future__ import division
import pandas as pd
from IPython.display import display
import re

In [3]:
df = pd.read_hdf('../data/data_w_descs_and_census.h5')

In [4]:
df.shape

(905650, 155)

In [9]:
with pd.option_context("display.max_rows", 200):
    display(df.head(1).T.loc['tract_and_block_group':'value_2000000+'])

Unnamed: 0,0
tract_and_block_group,701018
bedroom_total_ppl,249
bedroom_0,0
bedroom_1,43
bedroom_2,167
bedroom_3,39
bedroom_4,0
bedroom_5+,0
school_total,204
school_0_none,0


In [120]:
CATEGORIES = {
    'bedroom': 'bedroom_total_ppl', 
    'school': 'school_total', 
    'rent': 'rent_total', 
    'race': 'race_total', 
    'income': 'income_total', 
    'poverty': None,
    'value': 'value_total'
}

In [10]:
old_df = df

In [12]:
df = old_df.loc[:, 'tract_and_block_group':'value_2000000+']
df.shape

(905650, 117)

In [14]:
df = df.drop_duplicates()
df.shape

(578, 117)

There are 578 unique Census block groups.

In [156]:
top_20_census_block_groups = old_df[['tract_and_block_group', 'bedroom_0']].groupby('tract_and_block_group').count() \
    .sort_values('bedroom_0', ascending=False).head(20)
top_20_census_block_groups

Unnamed: 0_level_0,bedroom_0
tract_and_block_group,Unnamed: 1_level_1
303003,116153
801001,5098
612001,4522
606001,4328
107021,4024
701018,3708
709001,3584
1102011,3511
510001,3447
1401051,3443


In [16]:
df = df.sort_values('tract_and_block_group')
df.head(2)

Unnamed: 0,tract_and_block_group,bedroom_total_ppl,bedroom_0,bedroom_1,bedroom_2,bedroom_3,bedroom_4,bedroom_5+,school_total,school_0_none,...,value_175000_199999,value_200000_249999,value_250000_299999,value_300000_399999,value_400000_499999,value_500000_749999,value_750000_999999,value_1000000_1499999,value_1500000_1999999,value_2000000+
149,1001,560.0,8.0,120.0,193.0,183.0,56.0,0.0,960.0,23.0,...,13.0,38.0,22.0,5.0,51.0,41.0,0.0,0.0,0.0,0.0
1929,1002,485.0,9.0,30.0,239.0,132.0,36.0,39.0,794.0,27.0,...,0.0,9.0,11.0,55.0,19.0,8.0,0.0,0.0,0.0,0.0


## Implementing sparklines

Courtesy of [this repo](http://iiseymour.github.io/sparkline-nb/).

In [17]:
import base64
import requests
import numpy as np
import pandas as pd
from time import sleep
from itertools import chain
from cStringIO import StringIO
from datetime import timedelta, date
from IPython.display import display, HTML

%pylab inline



Populating the interactive namespace from numpy and matplotlib


In [18]:
# Turn off the max column width so the HTML 
# image tags don't get truncated 
pd.set_option('display.max_colwidth', -1)

# Turning off the max column will display all the data in
# our arrays so limit the number of element to display
pd.set_option('display.max_seq_items', 2)

In [104]:
def sparkline(data, figsize=(4, 0.25), **kwags):
    """
    Returns a HTML image tag containing a base64 encoded sparkline style plot
    """
    data = list(data)
    
    fig, ax = plt.subplots(1, 1, figsize=figsize, **kwags)
    ax.plot(data)
    for k,v in ax.spines.items():
        v.set_visible(False)
#     ax.set_xticks([])
    ax.tick_params(top="off")
    ax.set_yticks([])    

    plt.plot(len(data) - 1, data[len(data) - 1], 'r.')

    ax.fill_between(range(len(data)), data, len(data)*[min(data)], alpha=0.1)
    
    img = StringIO()
    plt.savefig(img)
    img.seek(0)
    plt.close()
    return '<img src="data:image/png;base64,{}"/>'.format(base64.b64encode(img.read()))

In [202]:
def is_multimodal(lst, cutoff=0.2):
    # returned too many 'false positives' so I stopped using it
    data = sorted(lst)
    if data[-1] - data[-2] < cutoff:
        return True
    else:
        return False

In [204]:
def does_col_have_num(col_name, category):
    if category == 'race':
        if col_name in ('race_white', 'race_black', 'race_asian', 'race_hispanic', 'race_other'):
            return True
        else:
            return False
    else:
        if re.compile(r'{}_\d\+?'.format(category)).search(col_name):
            return True
        else:
            return False

def add_sparklines_to_df(df, category, total_col_name, check_multimodal=False):
    if category == 'poverty':
        raise Exception('too many cols that aren\'t part of total; look in more detail')
    
    new_df = df[[col for col in df.columns if category in col]]
    
    data = df.loc[:, :'tract_and_block_group'].copy()
    data['normalized_data'] = new_df.apply(
        lambda row: [row[list(new_df.columns).index(col)] / row[total_col_name] for col in new_df.columns if does_col_have_num(col, category)],
        axis=1
    )

    data.loc[:, 'sparklines'] = data['normalized_data'].map(sparkline)
    
    if check_multimodal:
        data['is_{}_multimodal'.format(category)] = data['normalized_data'].map(is_multimodal)
        return HTML(data[['tract_and_block_group', 'sparklines', 'is_{}_multimodal'.format(category)]].to_html(escape=False))        
    else:
        # _repr_html_ escapes HTML so manually handle the rendering
        return HTML(data[['tract_and_block_group', 'sparklines']].to_html(escape=False))

## My prior beliefs

In [137]:
CATEGORIES.keys()

['school', 'bedroom', 'value', 'race', 'poverty', 'rent', 'income']

I think these will be unimodal:
    - school
    - bedroom
    - value
    - rent

And these are less likely to be unimodal:
    - race
    - income    

## Looking at `bedroom` first

Are the distributions mostly unimodal, or multimodal for the 20 Census block groups with the most number of issues?

In [201]:
category = 'bedroom'
add_sparklines_to_df(df[df.tract_and_block_group.isin(top_20_census_block_groups.index)], category, CATEGORIES[category])

Unnamed: 0,tract_and_block_group,sparklines,is_bedroom_multimodal
116,107021,,False
219,202001,,True
1,303003,,False
481,304001,,False
84,406001,,True
786,406002,,False
578,510001,,True
645,511011,,True
303,606001,,True
150,612001,,True


They look pretty unimodal--great.

## School

In [200]:
category = 'school'
add_sparklines_to_df(df[df.tract_and_block_group.isin(top_20_census_block_groups.index)], category, CATEGORIES[category])

Unnamed: 0,tract_and_block_group,sparklines,is_school_multimodal
116,107021,,False
219,202001,,False
1,303003,,True
481,304001,,False
84,406001,,False
786,406002,,True
578,510001,,False
645,511011,,False
303,606001,,False
150,612001,,False


Unimodal enough. Makes sense there is a big bump for HS and for college. I'm fine with the larger of the two categories being associated with a given row, even if the percent with a college degree is just slightly larger than the percent with a HS diploma.

## House value

In [162]:
category = 'value'
add_sparklines_to_df(df[df.tract_and_block_group.isin(top_20_census_block_groups.index)], category, CATEGORIES[category])

Unnamed: 0,tract_and_block_group,sparklines
116,107021,
219,202001,
1,303003,
481,304001,
84,406001,
786,406002,
578,510001,
645,511011,
303,606001,
150,612001,


There are some kind bimodal distributions, but the modes are close to each other, so taking one mode is fine.

## Rent

In [199]:
category = 'rent'
add_sparklines_to_df(df[df.tract_and_block_group.isin(top_20_census_block_groups.index)], category, CATEGORIES[category])

Unnamed: 0,tract_and_block_group,sparklines,is_rent_multimodal
116,107021,,False
219,202001,,True
1,303003,,True
481,304001,,True
84,406001,,True
786,406002,,True
578,510001,,True
645,511011,,False
303,606001,,True
150,612001,,False


Same as with house value.

## Income

In [203]:
category = 'income'
add_sparklines_to_df(df[df.tract_and_block_group.isin(top_20_census_block_groups.index)], category, CATEGORIES[category])

Unnamed: 0,tract_and_block_group,sparklines,is_income_multimodal
116,107021,,True
219,202001,,True
1,303003,,True
481,304001,,True
84,406001,,True
786,406002,,True
578,510001,,True
645,511011,,True
303,606001,,True
150,612001,,True


I was worried of seeing some distinctly bimodal distributions, if a Census block group includes public housing, for example. More of the distributions here have modes farther from each other than for the previous categories.

For simplicity's sake, I will choose mode as the point estimate.

I could look at the next 20 or so Census block groups to explore further.

## Race

In [205]:
category = 'race'
add_sparklines_to_df(df[df.tract_and_block_group.isin(top_20_census_block_groups.index)], category, CATEGORIES[category])

Unnamed: 0,tract_and_block_group,sparklines
116,107021,
219,202001,
1,303003,
481,304001,
84,406001,
786,406002,
578,510001,
645,511011,
303,606001,
150,612001,


Race per census block group has multiple modes, as I'd thought.

Since Race is so multi-modal, and because there is no ordinal meaning involved, I will keep it in the dataset instead of choosing the mode. There is a difficulty around interpreting the coefficients, then. We can't simply increase the percentage of one race by 1%, as the others have to decrease.

My solution for that at the moment is to re-normalize before giving the X values to the model to predict.