# Computing the Diversity Index using Pandas

This code requires the following python libraries
* [`pandas`](http://pandas.pydata.org/)
* [`requests`](http://requests.readthedocs.org/)

You can install both with `pip`.

## Processing the data with python

Most popular data analysis tools for python can open Excel files. However, if you're comfortable programming in python, you may as well also take advantage of Census Reporter's API and some utility code we've got here for you. (Note that Census Reporter API will only return up to 3500 geographies in a single call.)

The code offers a generalized function to get a `pandas` dataframe for any combination of tables and geoids, and then a function which, given a `DataFrame` which is assumed to have the right columns, returns a `Series` which is the diversity index.

In [1]:
import pandas as pd
import requests
import json

def dataframe_from_api(tables=None,geoids=None,include_moe=False):
    """Given data as it comes back from the Census Reporter API, produce a dataframe with 
       columns for each Census column, plus 'name'. the Index will be the geoid.
       
       Table IDs and geoids can be passed as either lists or comma-separated strings.
    """
    if tables is None: tables = ['B01001'] # default to sex by age
    if geoids is None: geoids = ['040|01000US'] # default to all states in the US
        
    try: 
        tables = ','.join(tables)
    except:
        pass
    try:
        geoids = ','.join(geoids)
    except:
        pass
    url = 'https://api.censusreporter.org/1.0/data/show/latest?table_ids={}&geo_ids={}'.format(tables,geoids)
    resp = requests.get(url)
    data = resp.json()
    d = { }
    for geoid in data['data']:
        d[geoid] = { 'name': data['geography'][geoid]['name'] }
        for table_id, table_data in data['data'][geoid].items():
            for column, value in table_data['estimate'].items():
                d[geoid][column] = value
                if include_moe:
                    d[geoid]["{}_moe".format(column)] = table_data['error'][column]
    return pd.DataFrame.from_dict(d,orient='index')

def compute_diversity(df):
    white_pct = df['B02001002'] / df['B02001001']
    black_pct = df['B02001003'] / df['B02001001']
    amerind_pct = df['B02001004'] / df['B02001001']
    asian_pct = df['B02001005'] / df['B02001001']
    nhpi_pct = df['B02001006'] / df['B02001001']
    nonhisp_pct = df['B03003002'] / df['B03003001']
    hisp_pct = df['B03003003'] / df['B03003001']
    return 1-( 
              (white_pct**2 + black_pct**2 + amerind_pct**2 + asian_pct**2 + nhpi_pct**2 ) * 
              (hisp_pct**2 + nonhisp_pct**2)
           )


In [2]:
df = dataframe_from_api(['B02001', 'B03003'],['040|01000US'])
df['diversity'] = compute_diversity(df)

In [3]:
df[['name','diversity']].sort_values(by='diversity',ascending=False)

Unnamed: 0,name,diversity
04000US15,Hawaii,0.821999
04000US06,California,0.789932
04000US35,New Mexico,0.727072
04000US32,Nevada,0.715009
04000US48,Texas,0.694184
04000US36,New York,0.692731
04000US11,District of Columbia,0.673453
04000US34,New Jersey,0.665889
04000US24,Maryland,0.649323
04000US04,Arizona,0.644045


## Your own index

Certainly, the people who came up with the USAT Diversity Index are very experienced and knowledgeable. But depending on your circumstances, you may be interested in developing an index which better reflects your understanding of a situation.

As long as you are careful and clear, this is not a terrible idea. If you go down this path, be sure to understand that you can't cross-compare data from different indices, even if they are based on similar premises. Also, if you have the opportunity, find some outside eyes to critique your choices.

When I looked at the data from the original method, I thought I noticed a pattern of relatively high rankings for states in the South and West, where there's a longer tradition of Hispanic settlement. I wondered if the method of treating hispanic/non-hispanic as distinctions as important as other races might be part of the cause.

In much of the public conversation about race, "hispanic" is treated as a race, despite the Census Bureau's methodology. I wrote an alternative equation which operates on a dataframe with columns from the **B03002 - Hispanic or Latino Origin by Race**. It takes counts for the five "races" of the original equation from the non-hispanic counts for those races, and treats all 'hispanic' as a race independent of which Census races those respondents might have added.

In [4]:
def compute_diversity2(df):
    """An alternative computation, out of curiosity, to see if the original overweights Hispanic diversity
       Depends on table B03002 - Hispanic or Latino Origin by Race
       B03002003 - NH White
       B03002004 - NH Black
       B03002005 - NH Native American
       B03002006 - NH Asian
       B03002007 - NH Pacific Islander
       B03002012 - Hispanic (all races)
    """
    white_pct = df['B03002003'] / df['B03002001']
    black_pct = df['B03002004'] / df['B03002001']
    amerind_pct = df['B03002005'] / df['B03002001']
    asian_pct = df['B03002006'] / df['B03002001']
    nhpi_pct = df['B03002007'] / df['B03002001']
    hisp_pct = df['B03002012'] / df['B03002001']
    return 1-( 
              (white_pct**2 + black_pct**2 + amerind_pct**2 + asian_pct**2 + nhpi_pct**2 + hisp_pct**2)
           )
    
    
df2 = dataframe_from_api(['B03002'],['040|01000US'])
df['diversity2'] = compute_diversity2(df2) # since df2 has the same index we can safely mix it in with df
df['rank'] = df.diversity.rank(ascending=False)
df['rank2'] = df.diversity2.rank(ascending=False)
df[['name','rank','rank2']].sort_values(by='rank')

Unnamed: 0,name,rank,rank2
04000US15,Hawaii,1,1
04000US06,California,2,2
04000US35,New Mexico,3,9
04000US32,Nevada,4,4
04000US48,Texas,5,3
04000US36,New York,6,7
04000US11,District of Columbia,7,5
04000US34,New Jersey,8,8
04000US24,Maryland,9,6
04000US04,Arizona,10,13


There's no objective answer to whether this index is "better" than the original, but we do see that New Mexico and Arizona go down, while the District of Columbia and Maryland go up. We also see Texas and Florida going up -- these are states with considerable Hispanic populations, but also sizeable African American populations, especially compared to NM and AZ. 

Of course, you have to be careful not to just shop around for algorithms that satisfy your intuitions or biases, but if after careful consideration (and perhaps discussion with a data mentor), it makes sense, go ahead and use it!