No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Scotchester April 2017 update
- Surname list from 2010 Census (see README addendum
  for more information)

- Updated master surname list creation script to also use
  the new 2010 Census surname list
- Updated surname parsing script to account for last names
  that begin with “O” or “D” followed by a space
Latest commit e54f1ed Apr 14, 2017


In conducting fair lending analysis in both supervisory and enforcement contexts, the Bureau’s Office of Research (OR) and Division of Supervision, Enforcement, and Fair Lending (SEFL) rely on a Bayesian Improved Surname Geocoding (BISG) proxy method, which combines geography- and surname-based information into a single proxy probability for race and ethnicity used in fair lending analysis conducted for non-mortgage products. This document describes the steps needed to build the BISG proxies.

The methodology described here is an example of a proxy methodology that OR and SEFL use, although we may alter this methodology in particular analyses, depending on the circumstances involved. In addition, the proxy method may be revised as we become aware of enhancements that would increase accuracy and performance. For more details, see “Using Publicly Available Information to Proxy for Unidentified Race and Ethnicity: A Methodology and Assessment”.

Included are a series of Stata scripts and subroutines that prepare the publicly available census geography and surname data and that construct the surname-only, geography-only, and BISG proxies for race and ethnicity. The scripts, subroutines, and data provided here do not contain directly identifiable personal information or other confidential information, such as confidential supervisory information.

Please note that all scripts and subroutines are written for execution in Stata 12 on a Linux platform and may need to be modified for other environments. Users must define a number of parameters, including file paths and arguments for subroutines. The scripts that define the subroutines also identify and describe arguments, as required.

Users must supply their own application- or individual-level data, and any geocoding of those data must occur prior to the execution of the script sequence: this code assumes that the input application- or individual-level data are already geocoded with census block group, census tract, and 5-digit ZIP code.

However, included is an example designed to instruct the user in executing the proxy building code sequence. It relies on a set of fictitious data constructed by from the publicly available census surname list and geography data. It is provided to illustrate how the is set up to run the proxy building code and does not reflect any particular individual’s or institution’s information.

A control script, /scripts/, is included to step through the process below. The user will need to change paths and define parameters as required.

  1. Geocode the data in a geocoding software package (for example, ArcGIS) to obtain tract and block group identifiers for each record.
  2. Build name and geography proxies from Census files included in /input_files:
    1. Census surname list:
      1. /scripts/—takes .csv file of census surnames, formats surnames to be read as all lower case, and imputes any suppressed values. File created by
        1. /input_files/created/census_surnames_lower.dta
      2. In order to prepare the user-defined datasets for use with the Census surname list, basic cleaning of surnames using regular expressions and other forms of name standardization is required. This script exists at: /scripts/ File created by in user-defined directory:
        1. `dir'/proxy_name.dta
    2. Census geographies:
      1. /scripts/—uses the base information, for individuals age 18 and older, from the Census flat files for block group, tract, and ZIP code1 and allocates "Some Other Race"2 to each group in proportion. It creates three files (one each for block group, tract, and ZIP code) with geo probabilities for use in proxy:
        1. /input_files/created/blkgrp_attr_over18.dta
        2. /input_files/created/tract_attr_over18.dta
        3. /input_files/created/zip_attr_over18.dta
  3. Calculate the BISG probabilities following the methodology described in “Using Publicly Available Information to Proxy for Unidentified Race and Ethnicity: A Methodology and Assessment”.
    1. /scripts/—this program creates three files (one each for block group, tract, and ZIP code) with BISG probabilities in user-defined directory:
      1. /`maindir'/`inst_name'_proxied_blkgrp.dta
      2. /`maindir'/`inst_name'_proxied_tract.dta
      3. /`maindir'/`inst_name'_proxied_zip.dta
  4. The final step is to merge together the block group, tract, and ZIP code-based BISG proxies and choose the most precise proxy given the precision of geocoding, e.g. block group (if available), then tract (if available), or ZIP code (if block group and tract unavailable) using:
    1. /scripts/ File created by in user-defined directory:
      1. /`maindir'/`inst_name'_`file'proxied_final.dta

Please direct all questions, comments, and suggestions to:

1 When referring to ZIP code demographics, we match the institution-based ZIP code information to ZIP Code Tabulation Areas (ZCTAs) as defined by the U.S. Census Bureau.

2 In the 2010 SF1, the U.S. Census Bureau produced tabulations that report counts of Hispanics and non-Hispanics by race. These tabulations include a “Some Other Race” category. We reallocate the “Some Other Race” counts to each of the remaining six race and ethnicity categories using an Iterative Proportional Fitting procedure to make geography based demographic categories consistent with those on the census surname list.

Update to proxy methodology – April 2017

In the summer 2014 edition of Supervisory Highlights,3 the Bureau previously reported that examination teams use a Bayesian Improved Surname Geocoding (BISG) proxy methodology for race and ethnicity in their fair lending analysis of non-mortgage credit products. The BISG methodology relies on the distribution of race and ethnicity based on place-of-residence and surname, which are publicly available information from Census. The method involves constructing a probability of assignment to race and ethnicity based on demographic information associated with surname and then updating this probability using the demographic characteristics of the census block group associated with place of residence. The updating is performed through the application of a Bayesian algorithm, which yields an integrated probability that can be used to proxy for an individual’s race and ethnicity.4

Through March of 2017, examination teams had relied on the surname list derived from the 2000 Decennial Census of the Population in their construction of the BISG proxy for race and ethnicity.5 In December 2016, the U.S. Census Bureau released a list of the most frequently occurring surnames based on data derived from 2010 Decennial Census of the Population. The updated 2010 list generally uses the same definitions and formats as the list based on the 2000 Census but includes updated values for total counts and race and ethnicity shares associated with each surname.6 In total, the new surname list provides information on the 162,253 surnames that appear at least 100 times in the 2010 Census, covering approximately 90% of the population.7 While 146,516 names appear on both the 2000 and 2010 surname lists, the 2010 list contains 15,737 names that do not appear on the 2000 list, whereas the 2000 list contains 5,155 names that do not appear on the 2010 list.8

As of April 2017, examination teams are relying on an updated proxy methodology that reflects the newly available surname data from the Census Bureau. Our updated proxy methodology relies on the race and ethnicity shares for the 162,253 names that appear on the 2010 list and supplements this list with the race and ethnicity shares for the 5,155 names that appear on the 2000 list but not on the 2010 list, resulting in a list of 167,409 surnames in total.9

The updated name list, statistical software code written in Stata, and other publicly available data used to build the BISG proxy are now available in this repository.

Please direct all questions, comments, and suggestions to:

3 See Consumer Financial Protection Bureau, Supervisory Highlights: Summer 2014 (Sept. 17, 2014).

4 For more information on the methodology, see Consumer Financial Protection Bureau, Using publicly available information to proxy for unidentified race and ethnicity (Sept. 2014).

5 See id.

6 For more details on the updated 2010 surname list, including revisions to the 2000 methodology and programming, see Joshua Comenetz, Frequently Occurring Surnames in the 2010 Census (Oct. 2016).

7 The surname data are available on the Census Bureau’s website, see Frequently Occurring Surnames from the 2010 Census (last revised Dec. 27, 2016).

8 Names must appear at least 100 times in the 2010 Decennial Census in order to be included on the surname list.

9 Although these names are not on the 2010 list, and thus likely no longer meet the 100-name threshold, we chose to include them so as to incorporate as much available surname information as possible into the proxy.