# Adding rurality measures to InfoGroup

> Merge different definitions of rurality.

## Outside of Urban Area

The Census Bureau's concept of Urban Area includes two urban categories: the more densely
populated Urbanized Area and the Urban Cluster. See the gazetteer for the details of 
definition. Urban Areas are not defined in terms of any other standard spatial unit. The
borders of an urban area are defined by the density of commuting patterns in the orbit of
urban cores of various population size.

InfoGroup does not include the code for the Urbanized Area or Urban Cluster in which an
establishment may be located. The Bureau does distribute a shapefile for Urban Areas. It
would therefore be possible in theory to locate each establishment's locational coordinates
in an Urban Area or to determine that it is not included in any Urban Area. However, this 
would be 1) an incredibly CPU-intensive process; and 2) probably irrelevant since we are
concerned mostly with InfoGroup establishments in rural areas.

However, because we do have an establishment's census tract code on the InfoGroup record, 
we can determine with just barely imperfect accuracy whether an establishment is
located in a census tract that is itself centered in an Urban Area or non-urban territory.

We have created a geo-reference file starting with shapefiles for Urban Areas and census 
tracts. The centroid location of each census tract was computed and from that data point 
and the coordinate dimensions of each urban area, the 'parental' urban area, if any,
of each census tract was determined. This was an extrememly machine-intensive process itself.

### rural_outside_UA, UA Code, UA Type

The 'rural_outside_UA' variable identifies tracts that are not located within a Census Bureau 
Urban Area. More precisely, if the spatial centroid of the InfoGroup establishment's census 
tract is not located within the polygon of coordinates that defines an Urban Area, the 
establishment is considered 'rural' and coded '1'. In the 2017 file, 3,596,102 establishments,
24.5% of the total, were flagged 'rural' by this measure.

For the 'urban' establishments (coded '0' on 'rural_outside_UA') we also take the Census code 
for its 'parental' Urban Area ('UA Code') and the code for the parental urban area's type 
('UA Type'): 'U' = Urbanized Area, 'C' = Urban Cluster.

The accuracy of these three variables is 'just barely imperfect' because a census
tract can overlap multiple urban (or non-urban) areas. It is therefore not necessarily 
true that the urban area pinpointed by the centroid of the census tract is the one in which 
the InfoGroup establishment itself is actually located, although in nearly every case it 
would be.

Our locally processed geo-reference file 
('/InfoGroup/data/rurality/reference/geographical/points-in-polygons/data/all_tracts.csv')
consists of one record per 2010 census tract, with the following variables:
    'STATEFP', 'COUNTYFP', 'TRACTCE', 'GEOID', 'NAME', 'NAMELSAD', 'MTFCC',
    'FUNCSTAT', 'ALAND', 'AWATER', 'INTPTLAT', 'INTPTLON', 'geometry',
    'UA_GEOID10', 'UATYP10', 'rural_tract'
'STATEFP' through 'geometry' are simply taken from the Census Bureau's shapefile. 
'GEOID' is the file's 11-digit census tract identifier. 'UA-GEOID10' and 'UATYP10' are the 
Urban Area identifier and the Urban Area type code taken from the Urban Area shapefile, and 
'rural_tract' is the laboriously computed rurality flag for each census tract renamed to
'rural_outside_UA' in the InfoGroup record to distinguish it from other such indicator
variables to be created in step 3.

Having created this file at an earlier time, adding the last three variables to the InfoGroup
record was simply a matter of a pandas dataframe merge, where 'df' is the InfoGroup dataframe
and 'tract_df' is the dataframe created from 'all_tracts.csv'.

'all_tracts.csv' is a locally processed file starting with shapefiles for Urban Areas 
and census tracts. The centroid location of each census tract was computed and from that 
data point and the coordinate dimensions of each urban area, the 'parental' urban area, if 
any, of each census tract was determined. This was an extrememly machine-intensive process.

Urban Areas and census tracts are defined by entirely different criteria. Even though
census tracts are on average much smaller than urban areas, each can overlap several of the 
other. The purpose here is to identify 'rural' census tracts, defined as those whose centroid point does not fall within any urban area. A census tract has a 'parental' urban
area if its centroid point does fall within an urban area, either an urbanized area or a
smaller urban cluster.

The all_tracts.csv file contains one record per census tract and the identifying information 
for the single 'parental' urban area, if there is one. The records for rural tracts are 
coded '1' in the 'rural_tract' variable, which indicates a missing value for the Urban
Area identifier, 'UA_GEOID10'. 'UATYP10' identifies the type of the parental urban area: 
'U' = urbanized area, 'C'= urban cluster.

This file is the source data for the 'rural_outside_UA' variable added in this step to the
basic InfoGroup extract created in step 1. It is also the source for the 'UA Code' and
'UA Type' variables, understood to apply to the 'parental' urban area. Since a census
tract can overlap multiple urban areas, it is not necessarily true that the urban area
identified by the 'UA Code' and 'UA Type' variables is the one in which the InfoGroup
establishment itself is actually located, though in nearly every case it would be.

It would be possible to locate each InfoGroup record in an urban area by computing whether
the establishment's spatial coordinates lie within the polygon of coordinates specified in
the urban area shapefile. However, our focus is on the rural economy and that computation,
for all establishments over two decades, would consume an extraordinary amount of calendar 
time and computational resources.

In [None]:
tract_df = pd.read_csv('/InfoGroup/data/rurality/reference/geographical/points-in-polygons/data/all_tracts.csv',
                 dtype=object)
tract_gdf = gpd.GeoDataFrame(tract_df)

In [None]:
tract_df = pd.DataFrame(tract_gdf[['GEOID','UA_GEOID10','UATYP10','rural_tract']],dtype=object)
tract_df.rename(columns={'UA_GEOID10':'UA Code','UATYP10':'UA Type', \
                         'rural_tract':'rural_outside_UA'},inplace=True)

In [None]:
yr = 2017
df = pd.read_csv(f'/InfoGroup/data/rurality/step1_{yr}.csv',dtype=object)

In [None]:
merged = df.merge(tract_df,how='inner',left_on='Full Census Tract',right_on='GEOID',indicator=True)

## ERS Measures of Urban Spatial Effect

The ERS's three measures of urban influence and spatial effect are the Urban Influence codes,
the Urban-Rural Continuum codes, and the Urban-Rural Commuting Area codes. These measures
are applied to the InfoGroup record by simple pandas merges as described below. The source
files are downloaded as Excel spreadsheets and processed into csv text files and finally
read into pandas dataframes.

The ERS has created files of all three codes for a variety of years. The unit of analysis
for the Urban Influence codes and the Rural-Urban Continuum codes is the county. For the
Rural-Urban Commuting Area codes the unit of analysis is the census tract.

### UI_CODE

https://www.ers.usda.gov/data-products/:
"The 2013 Urban Influence Codes form a classification scheme that distinguishes metropolitan 
counties by population size of their metro area, and nonmetropolitan counties by size of the 
largest city or town and proximity to metro and micropolitan areas. The standard Office of 
Management and Budget (OMB) metro and nonmetro categories have been subdivided into two 
metro and 10 nonmetro categories, resulting in a 12-part county classification."

See https://www.ers.usda.gov/data-products/urban-influence-codes/.
Urban Influence codes "form a classification scheme that distinguishes metropolitan counties
by population size of their metro area, and nonmetropolitan counties by size of the largest 
city or town and proximity to metro and micropolitan areas. The standard Office of Management 
and Budget (OMB) metro and nonmetro categories have been subdivided into two metro and 10 
nonmetro categories, resulting in a 12-part county classification."

There are separate ERS collections of Urban Influence codes for 1974, 1983, 1993, 2001, 
and 2013. Having chosen a single year of data as appropriate for the particular year or 
years of InfoGroup data, the command to apply UI_CODE to the InfoGroup record, where
'df' is the dataframe of InfoGroup data and 'ui_df' is the dataframe of Urban Influence data,
is:
    merged = df.merge(ui_df,how='inner',left_on='FIPS Code',right_on='FIPS')

### RUC_CODE

https://www.ers.usda.gov/data-products/:
"The 2013 Rural-Urban Continuum Codes form a classification scheme that distinguishes 
metropolitan counties by the population size of their metro area, and nonmetropolitan 
counties by degree of urbanization and adjacency to metro areas. The official Office of 
Management and Budget (OMB) metro and nonmetro categories have been subdivided into three
metro and six nonmetro categories. Each county in the U.S. and Puerto Rico is assigned one 
of the 9 codes."

See https://www.ers.usda.gov/data-products/rural-urban-continuum codes/. 
Rural-Urban Continuum codes "form a classification scheme that distinguishes metropolitan 
counties by the population size of their metro area, and nonmetropolitan counties by degree 
of urbanization and adjacency to a metro area. The official Office of Management and Budget 
(OMB) metro and nonmetro categories have been subdivided into three metro and six nonmetro 
categories. Each county in the U.S. is assigned one of the 9 codes."

There are separate ERS collections of Rural-Urban Continuum codes for 1993, 2003, and 2013.
Having first chosen a single year of RUC data, the command to apply RUC_CODE to the InfoGroup 
record, where is:
    merged = df.merge(ruc_df,how='inner',left_on='FIPS Code',right_on='FIPS')

### RUCA_CODE

See https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/.
Rural-Urban Commuting Area codes "classify U.S. census tracts using measures of population 
density, urbanization, and daily commuting....The classification contains two levels. Whole 
numbers (1-10) delineate metropolitan, micropolitan, small town, and rural commuting areas 
based on the size and direction of the primary (largest) commuting flows. These 10 codes are 
further subdivided based on secondary commuting flows, providing flexibility in combining 
levels to meet varying definitional needs and preferences."

There are separate ERS collections of Rural-Urban Commuting Area codes for 1990, 2000, 
and 2010. The three years of RUCA codes "are not directly comparable because many census 
tracts are reconfigured during each decade. Also, changes to census methodologies 
significantly affected the RUCA classifications."


First choose the appropriate year, then match as below:
    
1. Match 'FIPS' in /ers/ui/ui.csv to 'FIPS Code' (county level) in InfoGroup. 
'UI_YEAR' in ui.csv has the values [1974,1983,1993,2001,2013].

2. Match 'FIPS' in /ers/ruc/ruc.csv to 'FIPS Code' in InfoGroup.
'RUC_YEAR' in ruc.csv has the values [1993,2003,2013].

3. Match 'FIPS' in /ers/ruca/ruca.csv to 'Full Census Tract' in InfoGroup.
'YEAR' in ruca.csv has the values [1990,2000,2010].

For example:

In [None]:
import pandas as pd

df = pd.read_csv('/InfoGroup/data/rurality/step2_2017.csv',dtype=object)

In [None]:
ui_df = get_ui_df()
ui_df = ui_df[['UI_YEAR','UI_CODE','FIPS']]
ui_df.dropna(inplace=True)
ui_df = ui_df[ui_df['UI_YEAR'] == 2013]
merged = df.merge(ui_df,how='inner',left_on='FIPS Code',right_on='FIPS',indicator=True)
df = merged.drop(columns=['UI_YEAR','FIPS','_merge'])

In [None]:
ruc_df = get_ruc_df()
ruc_df = ruc_df[['RUC_YEAR','RUC_CODE','FIPS']]
ruc_df.dropna(inplace=True)
ruc_df = ruc_df[ruc_df['RUC_YEAR'] == 2013]
merged = df.merge(ruc_df,how='inner',left_on='FIPS Code',right_on='FIPS',indicator=True)
df = merged.drop(columns=['RUC_YEAR','FIPS','_merge'])

In [None]:
ruca_df = get_ruca_df()
ruca_df = ruca_df[['YEAR','RUCA_CODE','FIPS']]
ruca_df.dropna(inplace=True)
ruca_df = ruca_df[ruca_df['YEAR'] == 2010]
merged = df.merge(ruca_df,how='inner',left_on='Full Census Tract',right_on='FIPS',indicator=True)
df = merged.drop(columns=['YEAR','FIPS','_merge'])

In [None]:
df.to_csv('/InfoGroup/data/rurality/step2_2017.csv',index=None)

## rural_HRSA

Like the 'rural_outside_UA' variable created in step 2, this variable is an 1/0 flag 
indicating rurality at the census tract level.

HRSA refers to the Health Resources and Services Administration. It is particularly its
sub-unit, the Federal Office of Rural Health Policy (FORHP), that is responsible for this
definition of rurality. For its own administrative purposes it considers a census tract to
be rural if it is contained within a county that is not part of a CBSA. To these, they add
2,302 census tracts from CBSA counties that they have specially defined as rural by applying
the RUCA criteria, of which the FORHP was actually a developer in its early phase.

In the 2017 file, 1,277,342 establishments, 8.7% of the total, were flagged 'rural' by this 
measure, about 1/3 the incidence of rurality measured by the 'rural_outside_UA' variable.

## FAR Level

The USDA writes: “To assist in providing policy-relevant information about conditions in 
sparsely-settled, remote areas of the U.S. to public officials, researchers, and the general 
public, ERS has developed ZIP-code-level frontier and remote area (FAR) codes”.

FAR codes are applied to postal zip codes to identify different degrees and criteria of 
remoteness. It is not a code for any functional concept of rurality, but there is an obvious 
family resemblance between “remote” and “rural” which might find some analytical use.

The ERS created four FAR levels based on proximity (conceived of as travel time) to 
“urban” places of different sizes. Levels 1 through 4 measure increasing remoteness.
The ‘FAR Level’ variable captures the highest numbered positive FAR level for a location.

In 2017, 659,070 InfoGroup establishments, 4.56% of the total, were located in zip codes
designated far or remote.

In [None]:
def rurality(df):
    all_rural_tracts = compile_rural_tracts()
    showtime('\trurality (compile_rural_tracts)')
    df['rural_HRSA'] = df['Full Census Tract'].apply(lambda x: 1 if x in all_rural_tracts else 0)
    showtime('\trurality (rural_HRSA)')
    print(df['rural_HRSA'].value_counts(),file=logfile)
    print(df['rural_HRSA'].value_counts(normalize=True) * 100,file=logfile)

    # Merge with FAR data
    # FAR codes apply only to the continental states.
    merged = df.merge(df_zip,how='left',left_on='ZipCode',right_on='ZIP',indicator=True)
    showtime('\trurality (FAR)')
    merged.drop(columns=['ZIP','far1','far2','far3','far4','_merge'],inplace=True)
    return merged

def rural_in_CBSA():
    """ Get 'Full Census Tract' of every IG record with census tract not in a CBSA """
    list_list = df[['CBSA Level','Full Census Tract']].values.tolist()
    rurtracts = [x[1] for x in list_list if x[0] not in ['1','2']]
    return rurtracts

def compile_rural_tracts():
    """ Construct a set of census tracts by adding those from non-metro
        InfoGroup establishments to those defined by the FORHP as rural
        tracts within metro counties. """
    rurtracts = rural_in_CBSA()
    print('not in CBSA:',str(len(rurtracts)),file=logfile)    
    print('hrsa_rural_tracts:',str(len(hrsa_rural_tracts)),file=logfile)                  
    rurtracts.extend(hrsa_rural_tracts) 
    rurtracts = set(rurtracts)
    print('all tracts deduped:',str(len(set(rurtracts))),file=logfile)
    return set(rurtracts)

def farlevel(row):
    if sum([row['far1'],row['far2'],row['far3'],row['far4']]) == 0:
        return '0'
    elif row['far1'] == 1 and sum([row['far2'],row['far3'],row['far4']]) == 0:
        return '1'
    elif row['far2'] == 1 and sum([row['far3'],row['far4']]) == 0:
        return '2'
    elif row['far3'] == 1 and row['far4'] == 0:
        return '3'
    elif row['far4'] == 1:
        return '4'
    else:
        return np.nan
    
def showtime(num):
    now = datetime.now()
    dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
    print(str(num),'  ',dt_string)	

In [None]:
# FORHP list of 2300+ rural census tracts
hrsa_rural_tracts = []
# This is a pre-processed text version of a former PDF file.
with open('/InfoGroup/data/rurality/tract_data.txt','r') as fin:
    for line in fin:
        if line[0] != chr(32):
            continue
        else:
            line = line.strip()
            try:
                if line[0].isnumeric(): 
                    hrsa_rural_tracts.append(line)
            except IndexError:
                pass

# ERS: Frontier and Remote census data tracts
far_file = '/InfoGroup/data/rurality/FARcodesZIPdata2010WithAKandHI.csv'
df_far = pd.read_csv(far_file,dtype=object)
df_far['ZIP'] = df_far['ZIP'].apply(lambda x: x.zfill(5) if len(x) < 5 == 0 else x)
df_zip = df_far[['ZIP','far1','far2','far3','far4']].copy()
df_zip[['far1','far2','far3','far4']] = df_zip[['far1','far2','far3','far4']].astype(int)
df_zip['FAR Level'] = df_zip.apply(farlevel,axis=1)
df_zip = df_zip.drop_duplicates()

In [None]:
# open log file
logfile = open('/InfoGroup/data/rurality/logs/step3.log','w')

In [None]:
for yr in range(2017,2018):
    showtime('start')
    df = pd.read_csv(f'/InfoGroup/data/rurality/step2_{yr}.csv',dtype=object)
    showtime('create dataframe')
    df = rurality(df)
    showtime('rurality function')
    df.to_csv(f'/InfoGroup/data/rurality/InfoGroup_{yr}_step3.csv',index=None)
    showtime('finished')

In [None]:
logfile.close()