In [83]:
library(tidyverse)

### HPI dataset - Housing Price Index

From https://www.fhfa.gov/DataTools/Downloads/Pages/Public-Use-Databases.aspx

FHFA dataset of housing indexes by national, state and MSA level

Notes:
- Will need to convert metro area place names to zip codes? Or conversely, look for zipcode mapping to MSA

In [84]:
tbl = read.csv('../../Data/HPI_master.csv', stringsAsFactors = F, header = T)

In [4]:
head(tbl)
head(tbl %>% filter(level == 'MSA'))

Unnamed: 0_level_0,hpi_type,hpi_flavor,frequency,level,place_name,place_id,yr,period,index_nsa,index_sa
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<dbl>,<dbl>
1,traditional,purchase-only,monthly,USA or Census Division,East North Central Division,DV_ENC,1991,1,100.0,100.0
2,traditional,purchase-only,monthly,USA or Census Division,East North Central Division,DV_ENC,1991,2,100.93,101.0
3,traditional,purchase-only,monthly,USA or Census Division,East North Central Division,DV_ENC,1991,3,101.31,100.92
4,traditional,purchase-only,monthly,USA or Census Division,East North Central Division,DV_ENC,1991,4,101.71,101.0
5,traditional,purchase-only,monthly,USA or Census Division,East North Central Division,DV_ENC,1991,5,102.32,101.36
6,traditional,purchase-only,monthly,USA or Census Division,East North Central Division,DV_ENC,1991,6,102.79,101.51


Unnamed: 0_level_0,hpi_type,hpi_flavor,frequency,level,place_name,place_id,yr,period,index_nsa,index_sa
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<dbl>,<dbl>
1,traditional,all-transactions,quarterly,MSA,"Abilene, TX",10180,1986,2,108.26,
2,traditional,all-transactions,quarterly,MSA,"Abilene, TX",10180,1986,3,107.79,
3,traditional,all-transactions,quarterly,MSA,"Abilene, TX",10180,1986,4,94.57,
4,traditional,all-transactions,quarterly,MSA,"Abilene, TX",10180,1987,1,101.17,
5,traditional,all-transactions,quarterly,MSA,"Abilene, TX",10180,1987,2,100.4,
6,traditional,all-transactions,quarterly,MSA,"Abilene, TX",10180,1987,3,93.77,


In [85]:
unique((tbl %>% filter(level == 'MSA'))$yr)

In [8]:
dim(tbl)
unique(tbl$hpi_type)
unique(tbl$hpi_flavor)
unique(tbl$level)
unique(tbl$frequency)

In [10]:
length(unique((tbl %>% filter(level == 'MSA'))$place_name))

### NCEI dataset of weather and climate disasters, 1980-2021

Downloaded from: https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.nodc:0209268

Note: only has >300 rows

In [11]:
tbl = read.csv('../../Data/NCEI_climate_disasters.csv', stringsAsFactors = F, header = T)

In [12]:
head(tbl)

Unnamed: 0_level_0,Name,Disaster,Begin.Date,End.Date,Total.CPI.Adjusted.Cost..Millions.of.Dollars.,Deaths
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<int>
1,Southern Severe Storms and Flooding (April 1980),Flooding,19800410,19800417,2445.4,7
2,Hurricane Allen (August 1980),Tropical Cyclone,19800807,19800811,2041.4,13
3,Central/Eastern Drought/Heatwave (Summer-Fall 1980),Drought,19800601,19801130,34669.2,1260
4,Florida Freeze (January 1981),Freeze,19810112,19810114,1767.5,0
5,"Severe Storms, Flash Floods, Hail, Tornadoes (May 1981)",Severe Storm,19810505,19810510,1240.3,20
6,"Midwest/Southeast/Northeast Winter Storm, Cold Wave (January 1982)",Winter Storm,19820108,19820116,1887.3,85


In [13]:
unique(tbl$Disaster)

### Worldwide election dataset

Downloaded from: http://www.electiondataarchive.org/data-and-documentation.php

This is a pretty extensive dataset of all national and local level elections, as well as information about the parties. Could be useful

In [15]:
# two .rdata files for lower chamber and upper chamber
load('../../Data/election_dataset/clea_lc_20201216.rdata')
load('../../Data/election_dataset/clea_uc_20190617.rdata')

In [16]:
ls()

In [17]:
dim(clea_lc_20201216)
dim(clea_uc_20190617)

In [18]:
lc_tbl = clea_lc_20201216
head(lc_tbl)
uc_tbl = clea_uc_20190617
head(uc_tbl)

release,id,rg,ctr_n,ctr,yr,mn,sub,cst_n,cst,...,pev2,vot2,vv2,ivv2,to2,cv2,cvs2,pv2,pvs2,seat
<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,Africa,Botswana,72,1969,10,-990,bobirwa,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
1,1,Africa,Botswana,72,1969,10,-990,bobirwa,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
1,1,Africa,Botswana,72,1969,10,-990,bobirwa,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,1
1,1,Africa,Botswana,72,1969,10,-990,boteli,2,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,1
1,1,Africa,Botswana,72,1969,10,-990,boteli,2,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
1,1,Africa,Botswana,72,1969,10,-990,francistown and tati east,3,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,1


release,id,rg,ctr_n,ctr,yr,mn,sub,cst_n,cst,...,pev2,vot2,vv2,ivv2,to2,cv2,cvs2,pv2,pvs2,seat
<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl+lbl>,<chr>,<chr>,<dbl>,...,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>,<dbl+lbl>
12,1,Latin America,Argentina,32,2001,10,-990,BUENOS AIRES,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
12,1,Latin America,Argentina,32,2001,10,-990,BUENOS AIRES,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
12,1,Latin America,Argentina,32,2001,10,-990,BUENOS AIRES,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
12,1,Latin America,Argentina,32,2001,10,-990,BUENOS AIRES,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
12,1,Latin America,Argentina,32,2001,10,-990,BUENOS AIRES,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0
12,1,Latin America,Argentina,32,2001,10,-990,BUENOS AIRES,1,...,-990,-990,-990,-990,-990,-990,-990,-990,-990,0


In [19]:
colnames(lc_tbl)

In [21]:
unlist(uc_tbl[1,])

In [26]:
# also has metadata about parties - this is a pretty extensive dataset
library(readxl)
tbl1 = read_excel('../../Data//election_dataset/clea_lc_enp_20190617/clea_lc_enp_20190617_cst.level.xlsx')
tbl2 = read_excel('../../Data//election_dataset/clea_lc_enp_20190617/clea_lc_enp_20190617_national.level.xlsx')
tbl3 = read_excel('../../Data//election_dataset/clea_lc_enp_20190617/clea_lc_enp_20190617_party.level.xlsx')

In [28]:
head(tbl1)
dim(tbl1)
head(tbl2)
dim(tbl2)
head(tbl3)
dim(tbl3)

id,ctr_n,ctr,yr,mn,cst_n,cst,nvvi,cvvi,ENP_cst,...,inflation2,inflation3,inflation4,inflation5,PSNS,PSNS_s,PSNS_w,PSNS_sw,local_E,cst_tot
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
-999,US,840,1788,-990,new hampshire,7,0,0,-990,...,-990,-990,-990,-990,-990,0.7637005,-990,-990,0.6267806,293
-999,US,840,1788,-990,pennsylvania,1,0,0,-990,...,-990,-990,-990,-990,-990,0.7637005,-990,-990,0.6267806,293
-999,US,840,1788,-990,south carolina 1,2,0,0,-990,...,-990,-990,-990,-990,-990,0.7637005,-990,-990,0.6267806,293
-999,US,840,1788,-990,south carolina 2,3,0,0,-990,...,-990,-990,-990,-990,-990,0.7637005,-990,-990,0.6267806,293
-999,US,840,1788,-990,south carolina 3,4,0,0,-990,...,-990,-990,-990,-990,-990,0.7637005,-990,-990,0.6267806,293
-999,US,840,1788,-990,south carolina 4,5,0,0,-990,...,-990,-990,-990,-990,-990,0.7637005,-990,-990,0.6267806,293


id,ctr_n,ctr,yr,mn,cst_tot,nvvi,ENP_nat,ENP_avg,ENP_wght,inflation1,inflation2,inflation3,inflation4,PSNS,PSNS_s,PSNS_w,PSNS_sw,local_E
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
-999,US,840,1788,-990,293,0,-990,-990,-990,-990,-990,-990,-990,-990.0,0.7637005,-990.0,-990.0,0.6267806
-999,US,840,1789,-990,293,0,-990,-990,-990,-990,-990,-990,-990,0.21873155,0.3872084,0.001506056,1.447806e-05,0.23066576
-999,US,840,1790,-990,293,0,-990,-990,-990,-990,-990,-990,-990,-990.0,0.8583587,0.0006037055,1.298869e-14,0.33076881
-999,US,840,1791,-990,293,0,-990,-990,-990,-990,-990,-990,-990,0.16854841,0.4167238,0.001435462,4.952804e-06,0.32921911
-999,US,840,1792,-990,293,0,-990,-990,-990,-990,-990,-990,-990,-990.0,0.3900414,-990.0,-990.0,0.32228524
-999,US,840,1793,-990,293,0,-990,-990,-990,-990,-990,-990,-990,0.07610942,0.188175,0.002176286,0.002364432,0.08333334


id,ctr_n,ctr,yr,mn,cst_tot,pty_n,pty,PNS,PNS_s,PNS_w,PNS_sw
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
-999,US,840,1788,-990,293,independent,6001,-990,1,-990,-990
-999,US,840,1788,-990,293,independent,6002,-990,1,-990,-990
-999,US,840,1788,-990,293,independent,6003,-990,1,-990,-990
-999,US,840,1788,-990,293,independent,6004,-990,1,-990,-990
-999,US,840,1788,-990,293,independent,6005,-990,1,-990,-990
-999,US,840,1788,-990,293,independent,6006,-990,1,-990,-990


### Behavioral Risk Factor Surveillance System (BRFSS) dataset

Downloaded from: https://www.cdc.gov/brfss/data_documentation/index.htm

The CDC datasets come in SAS format - data needs to be downloaded manually for each year, but it has data going back to 1985

This site has useful stats about many of the variables:
https://www.cdc.gov/brfss/annual_data/2019/pdf/codebook19_llcp-v2-508.HTML


In [31]:
library('haven')

"package 'haven' was built under R version 3.6.3"


In [33]:
tbl = read_xpt('../../Data/BRFSS_dataset/LLCP2019.XPT')

In [34]:
dim(tbl)
head(tbl)

_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1,_FLSHOT7,_PNEUMO3,_AIDTST4
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,1182019,1,18,2019,1100,2019000001,2019000000.0,1,...,114.0,1,1,1,1,0,0,2,1,2.0
1,1,1132019,1,13,2019,1100,2019000002,2019000000.0,1,...,121.0,1,1,1,1,0,0,1,1,2.0
1,1,1182019,1,18,2019,1100,2019000003,2019000000.0,1,...,164.0,1,1,1,1,0,0,1,2,2.0
1,1,1182019,1,18,2019,1200,2019000004,2019000000.0,1,...,,9,9,1,1,1,1,9,9,
1,1,1042019,1,4,2019,1100,2019000005,2019000000.0,1,...,178.0,1,1,1,1,0,0,2,1,2.0
1,1,1182019,1,18,2019,1200,2019000006,2019000000.0,1,...,,9,9,1,1,1,1,9,9,


### Behavioral Risk Factor Surveillance System (BRFSS) dataset - MSA level data

Downloaded from: https://www.cdc.gov/brfss/smart/Smart_data.htm

The BRFSS dataset also has mapped data from 2002-2019.

In [45]:
tbl1 = read_xpt('../../Data/BRFSS_dataset/MMSA2019.xpt')

dim(tbl1)

In [49]:
head(tbl1)

DISPCODE,STATERE1,CELPHONE,LADULT1,COLGSEX,LANDSEX,RESPSLCT,SAFETIME,CADULT1,CELLSEX,...,_VEG23A,_FRUITE1,_VEGETE1,_FLSHOT7,_PNEUMO3,_AIDTST4,_MMSA,_MMSAWT,SEQNO,MMSANAME
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1200,,,,,,,1,1,1,...,1,0,0,,,2,10100,111.21607,2019000000.0,"Aberdeen, SD, Micropolitan Statistical Area"
1200,,,,,,,1,1,1,...,1,0,0,,,2,10100,147.70381,2019000000.0,"Aberdeen, SD, Micropolitan Statistical Area"
1200,,,,,,,1,1,1,...,1,0,1,,,1,10100,68.25476,2019000000.0,"Aberdeen, SD, Micropolitan Statistical Area"
1200,,,,,,,1,1,2,...,1,0,0,,,1,10100,209.27742,2019000000.0,"Aberdeen, SD, Micropolitan Statistical Area"
1200,,,,,,,1,1,1,...,1,0,0,,,2,10100,81.70571,2019000000.0,"Aberdeen, SD, Micropolitan Statistical Area"
1200,,,,,,,1,1,1,...,1,0,0,,,2,10100,88.92933,2019000000.0,"Aberdeen, SD, Micropolitan Statistical Area"


In [50]:
colnames(tbl1)

In [51]:
unique(tbl1[, '_MMSA'])
unique(tbl1[, 'MMSANAME'])

_MMSA
<dbl>
10100
10380
10420
10580
10740
11260
12060
12260
12420
12580


MMSANAME
<chr>
"Aberdeen, SD, Micropolitan Statistical Area"
"Aguadilla-Isabela, PR, Metropolitan Statistical Area"
"Akron, OH, Metropolitan Statistical Area"
"Albany-Schenectady-Troy, NY, Metropolitan Statistical Area"
"Albuquerque, NM, Metropolitan Statistical Area"
"Anchorage, AK, Metropolitan Statistical Area"
"Atlanta-Sandy Springs-Alpharetta, GA, Metropolitan Statistical Area"
"Augusta-Richmond County, GA-SC, Metropolitan Statistical Area"
"Austin-Round Rock-Georgetown, TX, Metropolitan Statistical Area"
"Baltimore-Columbia-Towson, MD, Metropolitan Statistical Area"


### MSU dataset - Correlates of State Policy

Downloaded from: http://ippsr.msu.edu/public-policy/correlates-state-policy

MSU has a dataset on correlates of state policy (CSPP)

There are two data files - the full csv dataset and the excel file with sub-categories

In [87]:
tbl = read.csv('../../Data/MSU_CSPP/cspp_june_2021.csv', header = T, stringsAsFactors = F)

In [54]:
dim(tbl)
head(tbl)

Unnamed: 0_level_0,year,st,stateno,state,state_fips,state_icpsr,popdensity,popfemale,pctpopfemale,popmale,...,med_spending_own,poptotal,taxes,taxrevcorporate,total_debt_outstanding,total_expenditure,total_revenue,popnohealthins,popprivhealthins,popgovhealthins
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<chr>,<int>,<int>,<dbl>,<int>,<dbl>,<int>,...,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
1,1900,AL,1,Alabama,1,41,,,,,...,,1830000.0,,,,,,,,
2,1900,AK,2,Alaska,2,81,,,,,...,,,,,,,,,,
3,1900,AZ,3,Arizona,4,61,,,,,...,,124000.0,,,,,,,,
4,1900,AR,4,Arkansas,5,42,,,,,...,,1314000.0,,,,,,,,
5,1900,CA,5,California,6,71,,,,,...,,1490000.0,,,,,,,,
6,1900,CO,6,Colorado,8,62,,,,,...,,543000.0,,,,,,,,


In [88]:
unique(tbl$year)

In [57]:
unique(tbl$cwinetex)

In [63]:
codebook = read.csv('../../Data/MSU_CSPP/codebook.csv', header = T, stringsAsFactors = F)

In [60]:
dim(codebook)
tail(codebook)

Unnamed: 0_level_0,variable,years,short_desc,long_desc,sources,category,plaintext_cite,bibtex_cite,plaintext_cite2,bibtex_cite2,plaintext_cite3,bibtex_cite3
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>
2129,percentForeignBorn,decadal years from 1900-2010,the percentage of the population identified as foreign-born by the Census,the percentage of the population identified as foreign-born by the Census,,demographics,"Bullock, John G. ""Education and attitudes toward redistribution in the United States."" British Journal of Political Science (2020): 1-21.","@article{bullock2020education,  title={Education and attitudes toward redistribution in the United States},  author={Bullock, John G},  journal={British Journal of Political Science},  pages={1--21},  year={2020},  publisher={Cambridge University Press} }",,,,
2130,percentUrban,decadal years from 1900-2010,"the percentage of the population identified as living in urban areas by the Census. To maximize consistency of the definition of â€œurbanâ€ across Census years, Census respondents were generally coded as living in urban areas if the Census â€œmetroâ€ variable indicated that they were living in a â€œmetro area.â€ The â€œmetroâ€ variable is unavailable in 1970 and 1990 Census data; in these cases, I used the closely related â€œmetareaâ€ variable. As with most coding of urban residence in the United States, this coding is generous: for example, counties that contain more than 10,000 people are considered â€œurbanâ€ by this coding, and the entire state of New Jersey is coded as â€œurbanâ€ from 2000 to the present. See http://usa.ipums.org/usa-action/variables/alphabetical?id=M for more information about the definitions of the â€œmetroâ€ and â€œmetareaâ€ variables","the percentage of the population identified as living in urban areas by the Census. To maximize consistency of the definition of â€œurbanâ€ across Census years, Census respondents were generally coded as living in urban areas if the Census â€œmetroâ€ variable indicated that they were living in a â€œmetro area.â€ The â€œmetroâ€ variable is unavailable in 1970 and 1990 Census data; in these cases, I used the closely related â€œmetareaâ€ variable. As with most coding of urban residence in the United States, this coding is generous: for example, counties that contain more than 10,000 people are considered â€œurbanâ€ by this coding, and the entire state of New Jersey is coded as â€œurbanâ€ from 2000 to the present. See http://usa.ipums.org/usa-action/variables/alphabetical?id=M for more information about the definitions of the â€œmetroâ€ and â€œmetareaâ€ variables",,demographics,"Bullock, John G. ""Education and attitudes toward redistribution in the United States."" British Journal of Political Science (2020): 1-21.","@article{bullock2020education,  title={Education and attitudes toward redistribution in the United States},  author={Bullock, John G},  journal={British Journal of Political Science},  pages={1--21},  year={2020},  publisher={Cambridge University Press} }",,,,
2131,percentWorkInManufacturing,decadal years from 1900-2010,"the percentage of the population identified as living in urban areas by the Census when the respondent was 14, in the state in which he lived when he was 14","the percentage of the population identified as living in urban areas by the Census when the respondent was 14, in the state in which he lived when he was 14",,demographics,"Bullock, John G. ""Education and attitudes toward redistribution in the United States."" British Journal of Political Science (2020): 1-21.","@article{bullock2020education,  title={Education and attitudes toward redistribution in the United States},  author={Bullock, John G},  journal={British Journal of Political Science},  pages={1--21},  year={2020},  publisher={Cambridge University Press} }",,,,
2132,inflation,1978-2017,Annual inflation rate in the state and year,Annual inflation rate in the state and year,,economic-fiscal,"Hazell, Jonathon, Juan HerreÃ±o, Emi Nakamura, and JÃ³n Steinsson. The slope of the Phillips Curve: evidence from US states. No. w28005. National Bureau of Economic Research, 2020.","@techreport{hazell2020slope,  title={The slope of the Phillips Curve: evidence from US states},  author={Hazell, Jonathon and Herreno, Juan and Nakamura, Emi and Steinsson, Jon},  year={2020},  institution={National Bureau of Economic Research} }",,,,
2133,inflation_nt,1978-2017,Annual inflation rate in the non-tradeable sector in state and year,Annual inflation rate in the non-tradeable sector in state and year,,economic-fiscal,"Hazell, Jonathon, Juan HerreÃ±o, Emi Nakamura, and JÃ³n Steinsson. The slope of the Phillips Curve: evidence from US states. No. w28005. National Bureau of Economic Research, 2020.","@techreport{hazell2020slope,  title={The slope of the Phillips Curve: evidence from US states},  author={Hazell, Jonathon and Herreno, Juan and Nakamura, Emi and Steinsson, Jon},  year={2020},  institution={National Bureau of Economic Research} }",,,,
2134,inflation_t,1978-2017,Annual inflation rate in the tradeable sector in state and year,Annual inflation rate in the tradeable sector in state and year,,economic-fiscal,"Hazell, Jonathon, Juan HerreÃ±o, Emi Nakamura, and JÃ³n Steinsson. The slope of the Phillips Curve: evidence from US states. No. w28005. National Bureau of Economic Research, 2020.","@techreport{hazell2020slope,  title={The slope of the Phillips Curve: evidence from US states},  author={Hazell, Jonathon and Herreno, Juan and Nakamura, Emi and Steinsson, Jon},  year={2020},  institution={National Bureau of Economic Research} }",,,,


In [61]:
unique(codebook$category)

In [64]:
codebook[codebook$variable == 'cwinetex',]

Unnamed: 0_level_0,variable,years,short_desc,long_desc,sources,category,plaintext_cite,bibtex_cite,plaintext_cite2,bibtex_cite2,plaintext_cite3,bibtex_cite3
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>
1708,cwinetex,1969-2015,Wine excise tax,"Wine excise tax (dollars per gallon of wine, less than 14% alcohol by volume, off-premise sales; if sales tax is not applied, that amount is deducted)","Sorens, Jason, Fait Muedini, and William P. Ruger. 'State and Local Public Policies in 2006: A New Database.' State Politics & Policy Quarterly 8.3 (2008): 309-26.",drug-alcohol,,,,,,


In [66]:
dim(tbl[, c('year', 'state', 'cwinetex')] %>% drop_na())
head(tbl[, c('year', 'state', 'cwinetex')] %>% drop_na())

Unnamed: 0_level_0,year,state,cwinetex
Unnamed: 0_level_1,<int>,<chr>,<dbl>
3521,1969,Alaska,0.6
3522,1969,Arizona,0.42
3523,1969,Arkansas,0.75
3524,1969,California,0.01
3525,1969,Colorado,0.2
3526,1969,Connecticut,0.25


### Tax returns by zip code

Downloaded from: https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi

The IRS shares annual tax returns by zip code, available from 1998 to 2018.
Data for each year needs to be obtained manually.

In [70]:
taxes = read.csv('../../Data/Tax_by_zip/18zpallnoagi.csv', header = T, stringsAsFactors = F)
taxes_with_agi = read.csv('../../Data/Tax_by_zip/18zpallagi.csv', header = T, stringsAsFactors = F)

In [71]:
dim(taxes)
head(taxes)
dim(taxes_with_agi)
head(taxes_with_agi)

Unnamed: 0_level_0,STATEFIPS,STATE,ZIPCODE,AGI_STUB,N1,MARS1,MARS2,MARS4,ELF,CPREP,...,N85300,A85300,N11901,A11901,N11900,A11900,N11902,A11902,N12000,A12000
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,AL,0,0,2036290,853350,746450,393790,1851240,93480,...,36360,140470,388540,1734772,1602800,5324632,1570180,4701007,36520,559783
2,1,AL,35004,0,5200,2150,2100,820,4730,260,...,20,63,1000,3171,4090,11131,4050,10654,50,433
3,1,AL,35005,0,3190,1410,840,890,2880,160,...,0,0,530,1179,2630,7589,2630,7558,0,0
4,1,AL,35006,0,1240,490,590,140,1120,40,...,0,0,200,498,1010,2834,1010,2793,0,0
5,1,AL,35007,0,12050,4840,5180,1740,10580,840,...,100,307,2780,11619,9020,24590,8900,23889,150,668
6,1,AL,35010,0,7840,3010,2700,2000,7290,290,...,110,498,1300,5018,6370,21697,6210,19500,180,2176


Unnamed: 0_level_0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,ELF,CPREP,...,N85300,A85300,N11901,A11901,N11900,A11900,N11902,A11902,N12000,A12000
Unnamed: 0_level_1,<int>,<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,AL,0,1,768120,466830,90960,198750,696930,37470,...,0,0,59030,50007,669420,1732176,666750,1725286,2730,4220
2,1,AL,0,2,503430,225110,130060,134320,457510,23180,...,0,0,77300,111047,424280,1230668,420960,1220934,4110,9399
3,1,AL,0,3,274590,95560,131770,41020,248630,13210,...,0,0,68920,148870,205830,562490,201770,550675,5790,15182
4,1,AL,0,4,174830,35560,123370,12700,159190,6830,...,0,0,47730,136776,126560,403410,122610,388967,3730,13784
5,1,AL,0,5,245150,25990,207950,6480,224280,10500,...,50,58,98980,446992,145780,598498,137120,521691,9850,65917
6,1,AL,0,6,70170,4300,62340,520,64700,2290,...,36310,140412,36580,841080,30930,797390,20970,293454,10310,451281


In [76]:
head(taxes[, c('ZIPCODE', 'N1', 'A00100', 'A02650', 'A10300', 'A11901', 'A11902')], 20)

Unnamed: 0_level_0,ZIPCODE,N1,A00100,A02650,A10300,A11901,A11902
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,2036290,125478058,126761741,13938078,1734772,4701007
2,35004,5200,302269,304546,26082,3171,10654
3,35005,3190,130555,131420,8913,1179,7558
4,35006,1240,65411,65797,5295,498,2793
5,35007,12050,743431,750814,71665,11619,23889
6,35010,7840,410413,414405,40869,5018,19500
7,35014,1600,79981,80525,6782,692,3707
8,35016,7210,385033,388852,34720,4284,15492
9,35019,890,44467,44852,3152,190,2275
10,35020,8950,254213,256135,13754,1914,26161


In [77]:
# The columns correspond to - total returns, AGI, total income, total tax liability, tax due and tax refunds
head(taxes_with_agi[,c('zipcode', 'agi_stub', 'N1', 'A00100', 'A02650', 'A10300', 'A11901', 'A11902')], 20)

Unnamed: 0_level_0,zipcode,agi_stub,N1,A00100,A02650,A10300,A11901,A11902
Unnamed: 0_level_1,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,1,768120,10119915,10261015,272245,50007,1725286
2,0,2,503430,18156451,18322144,834109,111047,1220934
3,0,3,274590,16867358,17015570,1178827,148870,550675
4,0,4,174830,15167919,15290801,1250551,136776,388967
5,0,5,245150,33353413,33654113,3827935,446992,521691
6,0,6,70170,31813002,32218098,6574411,841080,293454
7,35004,1,1460,18893,19181,566,118,2645
8,35004,2,1330,49466,49983,2432,275,2942
9,35004,3,990,60672,61077,4229,322,2020
10,35004,4,610,52969,53266,4105,319,1352


### Annual surveys of sectors from census.gov

Data available at: https://www.census.gov/about/index.html

There are 6, macroeconomic (aggregate over the whole nation) annual surveys that might be worth looking into:

- Annual Capital Expenditures Survey (ACES)
- Annual Retail Trade Survey
- Annual Service Survey
- Annual Survey of Manufactures (ASM)
- Monthly & Annual Wholesale Trade Survey
- Annual Survey of Entrepreneurs (ASE)

The ASM also has state-level data with sales and payroll info - I've downloaded PA data for example:

In [80]:
tbl = read.csv('../../Data/ASM/2021_example_PA.csv', header = T, stringsAsFactors = F)

In [81]:
dim(tbl)
head(tbl)

Unnamed: 0_level_0,GEO_ID,GEO_ID_F,NAME,NAICS2017,NAICS2017_F,INDLEVEL,NAICS2017_LABEL,SUBSECTOR,SECTOR,INDGROUP,...,PCHADVT_S,PCHPRTE,PCHPRTE_S,PCHTAX,PCHTAX_S,PCHOEXP,PCHOEXP_S,RCPTOT_IMP,PAYANN_IMP,EMP_IMP
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,id,Geo Footnote,Geographic Area Name,2017 NAICS code,2017 NAICS Footnote,Industry level,Meaning of NAICS code,SUBSECTOR,NAICS economic sector,Industry group,...,Relative standard error for estimate of advertising and promotional services (%),"Purchased professional and technical services ($1,000)",Relative standard error for estimate of purchased professional and technical services (%),"Taxes and license fees ($1,000)",Relative standard error for estimate of taxes and license fees (%),"All other operating expenses ($1,000)",Relative standard error for estimate of all other operating expenses (%),"Range indicating percent of total sales, value of shipments, or revenue imputed",Range indicating percent of total annual payroll imputed,Range indicating percent of total employees imputed
2,0400000US42,,Pennsylvania,31-33,,2,Manufacturing,,31,,...,N,N,N,N,N,N,N,40% to less than 50%,20% to less than 30%,40% to less than 50%
3,0400000US42,,Pennsylvania,31-33,,2,Manufacturing,,31,,...,N,N,N,N,N,N,N,40% to less than 50%,30% to less than 40%,40% to less than 50%
4,0400000US42,,Pennsylvania,311,,3,Food manufacturing,311,31,,...,N,N,N,N,N,N,N,40% to less than 50%,30% to less than 40%,40% to less than 50%
5,0400000US42,,Pennsylvania,311,,3,Food manufacturing,311,31,,...,N,N,N,N,N,N,N,40% to less than 50%,30% to less than 40%,40% to less than 50%
6,0400000US42,,Pennsylvania,3111,,4,Animal food manufacturing,311,31,3111,...,N,N,N,N,N,N,N,40% to less than 50%,60% to less than 70%,70% to less than 80%


In [82]:
unique(tbl$NAICS2017)