# Why Is Your Region Hard To Count?

### By Benjamin Livingston, [*NewsCounts*](http://www.newscounts.org)

## Introduction

"Hard to count" is a popular buzzphrase as the 2020 US Census takes shape - but what does it mean for your region?

As a journalist, you probably have a great idea of your area's demographic makeup, and what groups *might* be particularly hard-to-count - but it's difficult to quantify.

*NewsCounts* has built a tool that will allow you to do this - **and we're prepared to tailor it for your newsroom using both 2010 and up-to-date 2020 data.**

By compiling demographic and mail response data from the 2010 and 2020 censuses, we've developed a model for what makes an area hard-to-count, and are prepared to tailor it to your newsroom.

## What We'll Do Here

We'll start by examining which factors made the nation as a whole hard-to-count in 2010. 

To do this, we will look at basic tract-level data from the 2010 decennial census and five-year American Community Survey estimates (also from 2010), and use them as a means of modeling exactly *why* America is hard-to-count.

We'll then move to a more granular analysis that demostrates how the story might change in your local area, and how you can tell that story.

## Methodology

To do this, we used three highly interpretable models that allow us to glean which demographic factors correlate most with counting difficulties:
* A basic linear model that simply calculates the linear correlation between each demographic factor and undercounting
* A simple tree model that finds which factors have the biggest impact in determining why areas are hard to count
* A five-feature random forest model, a more stabilized tree model that considers how different combinations of demographic factors might impact undercounting

We'll then present the weights from these models to determine which factors are most important - first the correlation coefficients for the linear model, then the variable importances from the two tree models.

## Results

We now present the factors that correlated most with undercounting in 2010.

In [62]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [63]:
data = pd.read_csv('responses.csv',index_col=0)

X_train = data.drop('response',axis=1)
y_train = data.response
                                                    
tree = DecisionTreeRegressor(max_depth=8)
tree.fit(X_train,y_train)
                  
rf_5 = RandomForestRegressor(max_depth=12,n_estimators=5,random_state=0)
rf_5.fit(X_train,y_train)
                                                    
correlations = list(data.corr()['response'][1:])
tree_importances = list(tree.feature_importances_)
rf_5_importances = list(rf_5.feature_importances_)
                                                    
results = pd.DataFrame(data=[correlations,tree_importances,rf_5_importances],\
             columns=list(data.drop('response',axis=1).columns),\
             index=['correlations','tree_importances','random_forest_importances'])

results = results.transpose().round(2)
results = results.iloc[results['correlations'].abs().argsort()].iloc[::-1]
results

Unnamed: 0,correlations,tree_importances,random_forest_importances
below_poverty_level,-0.44,0.39,0.28
make_over_75k,0.39,0.1,0.1
white,0.36,0.07,0.07
no_high_school,-0.34,0.01,0.03
black,-0.34,0.02,0.03
pop,0.26,0.06,0.07
other_race,-0.19,0.03,0.04
hispanic,-0.19,0.02,0.03
noncitizen,-0.15,0.0,0.01
spanish_poor_english,-0.15,0.0,0.01


A couple quick definitions:
* **Correlations** indicate how much mail return rates increase (positive) or decrease (negative) based on the increased presence of that subgroup
* [**Feature importances**](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.feature_importances_) indicate how important a demographic factor is to determining how hard-to-count an area is relative to other factors 

It is abundantly clear across the board that socioeconomic factors correlated heavily undercounting, particularly economic factors.

We find a fairly substantial correlation between higher income and a higher likelihood of filling out a census form, and the white- and Asian-heavy areas filled out census forms more frequently than areas with more African-Americans and Hispanics.

More populous, urban areas had higher mail return rates, as did areas with more minors.

It's interesting to note that income seems to be the biggest factor (moreso than race), although this could potentially be because income can function as a generalized proxy for more granular socioeconomic differences.

Let's neutralize income and some of the other economic, education, population, and age data, and throw out the white category to get an idea of which minority groups seemed to struggle most with undercounting.

In [64]:
data = data.drop(columns=['make_over_75k','white','below_poverty_level',\
                          'no_high_school','rural','pop','avgfamsize','under18'])

X_train = data.drop('response',axis=1)
y_train = data.response
                                                    
tree = DecisionTreeRegressor(max_depth=8)
tree.fit(X_train,y_train)
                  
rf_5 = RandomForestRegressor(max_depth=12,n_estimators=5,random_state=0)
rf_5.fit(X_train,y_train)
                                                    
correlations = list(data.corr()['response'][1:])
tree_importances = list(tree.feature_importances_)
rf_5_importances = list(rf_5.feature_importances_)
                                                    
results = pd.DataFrame(data=[correlations,tree_importances,rf_5_importances],\
             columns=list(data.drop('response',axis=1).columns),\
             index=['correlations','tree_importances','random_forest_importances'])

results = results.transpose().round(2)
results = results.iloc[results['correlations'].abs().argsort()].iloc[::-1]
results

Unnamed: 0,correlations,tree_importances,random_forest_importances
black,-0.34,0.37,0.28
hispanic,-0.19,0.05,0.07
other_race,-0.19,0.1,0.1
spanish_poor_english,-0.15,0.01,0.02
noncitizen,-0.15,0.04,0.05
AIAN,-0.14,0.16,0.12
multiracial,-0.11,0.04,0.08
asian,0.11,0.2,0.17
foreign_born,-0.06,0.02,0.05
pacific_islander,-0.03,0.01,0.02


A higher presence of African-Americans correlated with a lower mail return rate, much moreso than for any other race. We can see a similar effect for Hispanics, American Indians & Alaska Natives, multiracial populations, and other minority populations, although to a lesser degree.

It's worth noting, again, that a higher presence of Asians tended to indicate a higher mail return rate, although not to the same degree as a higher presence of whites.



## How Do I Use This In My Newsroom?

We'll show an example here. To do this, we'll pretend we're a newsroom in Kansas. 

**Keep in mind: we can help you run data for your own state or county in a matter of minutes** - see the end of this article for contact details (all we have to do is change one line of code!).

Let's take this same approach for Kansas, and see how the data changes, and what conclusions we might make if we worked at a newsroom in Topeka or Dodge City.

In [65]:
data = pd.read_csv('responses.csv',index_col=0)
data['GEO_ID'] = data.index
data = data[data['GEO_ID'].str.contains('1400000US20')]
data = data.drop(columns='GEO_ID')

X_train = data.drop('response',axis=1)
y_train = data.response
                                                    
tree = DecisionTreeRegressor(max_depth=8)
tree.fit(X_train,y_train)
                  
rf_5 = RandomForestRegressor(max_depth=12,n_estimators=5,random_state=0)
rf_5.fit(X_train,y_train)
                                                    
correlations = list(data.corr()['response'][1:])
tree_importances = list(tree.feature_importances_)
rf_5_importances = list(rf_5.feature_importances_)
                                                    
results = pd.DataFrame(data=[correlations,tree_importances,rf_5_importances],\
             columns=list(data.drop('response',axis=1).columns),\
             index=['correlations','tree_importances','random_forest_importances'])

results = results.transpose().round(2)
results = results.iloc[results['correlations'].abs().argsort()].iloc[::-1]
results

Unnamed: 0,correlations,tree_importances,random_forest_importances
make_over_75k,0.54,0.41,0.29
below_poverty_level,-0.48,0.03,0.05
hispanic,-0.46,0.01,0.02
no_high_school,-0.46,0.04,0.06
other_race,-0.4,0.06,0.02
white,0.39,0.14,0.19
spanish_poor_english,-0.36,0.01,0.01
pop,0.36,0.09,0.12
noncitizen,-0.34,0.02,0.02
black,-0.34,0.03,0.04


Here, we see socioeconomic differences having an even bigger impact. There's a massive divide in mail return rates between the rich and the poor areas, and the majority-heavy and minority-heavy communities.

We also see that, while the issue of African-American undercounting doesn't vary much from the rest of the country, Hispanics and noncitizens appear to be at significantly higher risk in Kansas than in the rest of the country. The negative correlations between those factors and mail response rates more than doubled from the national level to Kansas.

By neutralizing the same variables as before, we can get an even closer look at these factors.

In [66]:
data = data.drop(columns=['make_over_75k','white','below_poverty_level',\
                          'no_high_school','rural','pop','avgfamsize','under18'])

X_train = data.drop('response',axis=1)
y_train = data.response
                                                    
tree = DecisionTreeRegressor(max_depth=8)
tree.fit(X_train,y_train)
                  
rf_5 = RandomForestRegressor(max_depth=12,n_estimators=5,random_state=0)
rf_5.fit(X_train,y_train)
                                                    
correlations = list(data.corr()['response'][1:])
tree_importances = list(tree.feature_importances_)
rf_5_importances = list(rf_5.feature_importances_)
                                                    
results = pd.DataFrame(data=[correlations,tree_importances,rf_5_importances],\
             columns=list(data.drop('response',axis=1).columns),\
             index=['correlations','tree_importances','random_forest_importances'])

results = results.transpose().round(2)
results = results.iloc[results['correlations'].abs().argsort()].iloc[::-1]
results

Unnamed: 0,correlations,tree_importances,random_forest_importances
hispanic,-0.46,0.34,0.23
other_race,-0.4,0.08,0.08
spanish_poor_english,-0.36,0.02,0.07
noncitizen,-0.34,0.03,0.04
black,-0.34,0.15,0.16
multiracial,-0.31,0.09,0.11
foreign_born,-0.29,0.1,0.09
pacific_islander,-0.12,0.01,0.02
AIAN,-0.09,0.05,0.08
other_lang_poor_english,-0.04,0.01,0.02


It's abundantly clear that Kansas' Hispanic and [extremely diverse](https://www.thekansan.com/news/20200119/students-flourish-in-garden-city-high-school-esl-classes) minority and undocumented communities were much more frequently undercounted than it's more affluent, white communities.

This is a [well-documented](https://www.dodgeglobe.com/news/20200207/we-dont-exist-in-push-for-dodge-city-complete-count-census-fears-threaten-resources) issue in Kansas - but this kind of analysis can give new weight to discussions about why a region is hard-to-count by adding evidence.

Ranking these weights and seeing how they compare is an excellent way to glean an idea of why your region is hard to count, and we can help you interpret them if you'd like us to.

## How We Can Help

*NewsCounts* can either run this analysis for your state (or possibly even county if it's large enough), and help you determine which factors correlated with undercounting there. 

We can do this by looking back at 2010, or go in-the-moment and look at why it might be happening right now in 2020.

We're also *very* open to incorporating new data. There are thousands of data points we can add if you feel another demographic factor might be good for modeling why your region is hard-to-count. Drop us a note if you have an idea, and we'll make an analysis that works for you.

## Contact Info & Other Resources

Feel free to email [benjamin.livingston@columbia.edu](mailto:benjamin.livingston@columbia.edu) or post on the NewsCounts Slack channel (email Benjamin for an invite!) any time if you'd like us to help you get started with this.

The GitHub repository for our response rate monitoring is [here](https://github.com/bwliv/censusresponses) if you'd like to play with this data yourself.

[CUNY's hard-to-count map](https://www.censushardtocountmaps2020.us/) is an excellent resource for a general look at which areas were hard-to-count in 2010.

Please don't hesistate to reach out with any census reporting-related questions. We recognize this spring is a challenging time for journalists, and we're here to make census reporting easier on you.