# Census workshop
The census is a popular secondary data source in sociology. In this workbook, we will use NHGIS to download census data and clean it in Stata.

<a id="https://www.nhgis.org/">NHGIS</a> is a data hub for census data over time. They have sorted the data made it easy to download. If you use NHGIS data, you must cite them (the citation information is found on their website). You need to make an account in order to use the site. Once you are logged in, click "Select Data" in the top right hand corner.

For census data, there are two main dataset:
### Census
The Census reports data every ten years. They only report race, ethnicity, and home renter/owner. Note that how the census defines race and ethnicity HAS CHANGED over time. If possible, use the Census data. Census data is more reliable as compared to ACS.

### American Community Survey
The American Community Survey (ACS) is report more frequently than the Census. The ACS also report MORE information than the census. ACS reports income, family structures, educational attainment, etc... The ACS has different waves of data: 5-year wave and 3-year wave. Generally census tracts and census blocks are reported ONLY at the five-year wave. Counties and states are reported at the 3-year wave. Generally, the smaller geographic areas are less reliable. For example census tracts are more reliable than census blocks.

There are four green tabs:
#### Geographic Levels
You can chose a variety of level of analysis such as census tract, county, state, etc..
#### Years
You can pick years or waves
#### Topics
There are many topics you can choose from
#### Datasets

#### OR AND
You can choose "or" and "and" options

### An Example
I am going to choose a dataset for county-level for the 3-wave 2011-2013. We want race, ethnicity, and income data. Click the "Popularity" tab to get the most popular data. You can click "table name" variable title and it will give you a breakdown of the variable. Click the green plus arrow to pick the data. Then download. It takes time to download.

Once the file is ready, you can download it. You have to unzip the file.

Once the file is unzipped, there is the .csv (data) and .txt (codebook) file.

You must look at the codebook to figure out what variables. The geographic variables will make up unqiue ID.

In [5]:
*set-up directory
cd "C:\Users\acade\Documents\teaching\SOC 211 spring 2023"

*import the the nhgis csv
import delimited "C:\Users\acade\Documents\teaching\SOC 211 spring 2023\census workshop\nhgis0130_csv\nhgis0130_ds200_20133_county.csv"


C:\Users\acade\Documents\teaching\SOC 211 spring 2023

(encoding automatically selected: ISO-8859-1)
(73 vars, 1,910 obs)


## Make a unique identifier
Census distinguishes geographic areas. Every state, county, tract, and block have a number. You can put them together to make a unique identifier.

FOr this example, we are going to make a FIPS code:
_ _ _ _ _
The first two digits is state number
The last three digits is county number

In [9]:
desc

list in 1/2



Contains data
 Observations:         1,910                  
    Variables:            73                  
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
gisjoin         str8    %9s                   GISJOIN
year            str9    %9s                   YEAR
stusab          str2    %9s                   STUSAB
regiona         byte    %8.0g                 REGIONA
divisiona       byte    %8.0g                 DIVISIONA
state           str20   %20s                  STATE
statea          byte    %8.0g                 STATEA
county          str28   %28s                  COUNTY
countya         int     %8.0g                 COUNTYA
cousuba         byte    %8.0g                 COUSUBA
placea          byte    %8.0g                 PLACEA
aianhha         byte    %8.0g          

     |       . |       . |       . |       . |       .  |       .  |       .  |
     |---------+---------+---------+---------+----------+----------+----------|
     | td9e008 | td9e009 | td9e010 | td9e011 | td9e012  | td9e013  | td9e014  |
     |       . |       . |       . |       . |       .  |       .  |       .  |
     |---------+---------+---------+---------+----------+----------+----------|
     | td9e015 | td9e016 | td9e017 | td9e018 | td9e019  | td9e020  | td9e021  |
     |       . |       . |       . |       . |       .  |       .  |       .  |
     |------------------------------------------------------------------------|
     | txbe001  |                  name_m  | td9m001  |  td9m002  |  td9m003  |
     |   49381  | Baldwin County, Alabama  |       .  |        .  |        .  |
     |------------------------------------------------------------------------|
     | td9m004 | td9m005 | td9m006 | td9m007 | td9m008  | td9m009  | td9m010  |
     |       . |       . |       . |    

In [10]:
*The lines below create a unqiue fips id
* fips = _ _ 	_ _ _
*		STATE	COUNTY
gen state_str= string(statea,"%02.0f")

gen county_str= string(countya,"%03.0f")

egen fips_str= concat(state_str county_str)

destring fips_str, generate(fips)





fips_str: all characters numeric; fips generated as long


### Generally, you want to use percentages

In [11]:
gen latper=100*(td9e012/td9e001)

gen blk_ntlatper=100*(td9e004/td9e001)

gen wht_ntlatper=100*(td9e003/td9e001)


(998 missing values generated)

(998 missing values generated)

(998 missing values generated)


In [12]:
rename txbe001 medhhincome

## Practice making a dataset of your own