The role of land in temperate and tropical agriculture
T. Ryan Johnson and Dietrich Vollrath
Replication of Main Results
The Code folder contains the necessary code to replicate all the results in the paper. There are two sets of code, the first generates a series of CSV datasets from raw GIS data (using R), and the second runs regressions using those CSV datasets (using Stata).
There are three control files (both CSV format) in the "Code" folder.
- crops_control.csv: this contains a list of crops to include in the analysis, along with codes used to denote them in some of the datasets. It also contains an indication of which regions these crops were available in prior to 1500.
- iso_codes.csv: this maps countries to ISO codes (3-digits), as well as region codes we use to group countries
- crops_earthstat_control.csv: contains a list of crops to include when creating actual production data for a selected set of crops using the Earthscan data
Edits to these files will alter the results by changing which crops are used to calculate the productivity index in the paper, and/or changing which countries are included in certain sub-samples in the regressions.
These files are all names "Crops_Data_???.r". "Crops_Data_Master.r" is the single script that is used to kick off all the other scripts, and it contains all the settings that control the running of those scripts.
Within "Crops_Data_Master.r", you should edit the following
mdir <- "~/dropbox/project/crops"
to point towards the location of the folder that contains the data and code for the project.
The lines following that assignment show you the necessary set of sub-folders that should exist for the code to run correctly.
refdir <- paste0(mdir,"/Work") # working files and end data codedir <- paste0(mdir,"/Replicate") # code gadmdir <- paste0(mdir,"/data/GADM") # Administrative polygons gaezdir <- paste0(mdir,"/data/GAEZ") # Crop suitability data hydedir <- paste0(mdir,"/data/HYDE") # Population data csidir <- paste0(mdir,"/data/CropCSI") # Crop caloric suitability dmspdir <- paste0(mdir,"/data/DMSP/2000") # Night lights data kgdir <- paste0(mdir,"/data/Koeppen-Geiger-GIS") # KG climate zones esdir <- paste0(mdir,"/data/Earthstat") # Earthstat production data grumdir <- paste0(mdir,"/data/GRUMP") # GRUMP population data datadir <- paste0(mdir,"/Replicate") # Control files
The other options to set in this file refer to the Caloric Suitability Index parameters you want to use to generate the measure of productivity.
water <- "rain_fed" ## alternative is "irrigated" input <- "lo" ## alternatives are "med" and "hi" p1500 <- "" ## alternatives are "" for post-1500, "_p1500" for pre-1500
They are set to use the characteristics associated with exogenous variation in suitability. Setting the "p1500" flag to "_p1500" will tell the code to ignore crops in a region that were not present prior to 1500 (i.e. potatoes in Europe).
With all these parameters set, you should use the script to call the "Crops_Data_Reference.r" script by uncommenting this line. This needs to be run once, and produces a rasterized version of the district boundaries (for use in zone statistics) as well as initializing a CSV file with ID's for each district that other scripts append their data to.
Once the "Crops_Data_Reference.r" script has been run once, you can comment out this call again. Then uncomment any or all of the other scripts listed to call those to produce specific CSV files of data. Calling all of them in order will take approximately 1-2 hours depending on your machine.
The final two scripts called by "Crops_Data_Master.r" refer to pre-1500 versions of the data. We do not use this in the paper, but the scripts are available if you want to use them. "Crops_Data_Regions.r" creates separate rasters that define broad regions, and crops are coded as available pre-1500 or not by region (i.e. Europe, South America, etc..). "Crops_Data_Pre1500.r" creates new pre-1500 versions of select CSI and GAEZ crop files, setting values for productivity or suitability to zero if the crop was not available in that region prior to 1500.
In addition to the geographic data, the raw IPUMS data must be prepared separately for use in the robustness regressions. The script "Crops_Data_IPUMS.do" takes in the raw IPUMS extracts, and collapses those to summary measures of population for each district (denoted by GEOLEV2 variable) provided by IPUMS. This should be run first. Warning, this script takes hours to run, as it is collapsing millions of records from each extract.
With that run, the geography scripts in R can be run for the IPUMS districts. These separate scripts are necessary as IPUMS uses a different definition of districts than GADM. The final section of "Crops_Data_Master.r" shows the order for these scripts.
Once those scripts have been run, then "Crops_Reg_IPUMS_Prep.do" can be run, which merges the collapsed population data with the geographic data. This script also runs the regressions for the IPUMS data.
You can see the organization of the folders for the data in the "Crops_Data_Reference.r" script. All the data we use is public, and freely downloaded from the original sources. To facilitate an exact replication, you can access our files here. Note that the full set of data is around 25GB.
The original sources of the data can be found at the following links:
- CropCSI: From Ozak and Galor (2016). See here, and look for the "Caloric Suitability for Individual Crops" link towards the bottom of the page.
- GADM: From here, and see their download section.
- GAEZ: From the FAO, and click on the "Access Data Portal" button. You need Flash installed to use it. It is also highly frustrating to download this data, as you have to pull down each individual dataset one by one. See below for a description of which files we use.
- HYDE: From here, you click on the link to download data, which takes you to an FTP server. Use guest to login. Data are organized by year, with folders of the name "YYYYAD_pop".
- DMSP: From here. We use the link for 2000/F15 data.
- Koeppen-Geiger-GIS: From here, where you can find GIS shapefiles for the 1976-2000 observed classification towards the bottom of the page.
- Earthstat: From here, go to data downloads and get the zip file for "major crops" from the "Harvested Area and Yield for 175 Crops" section.
- IPUMS: From here, we downloaded the "spatially harmonized second-level geography" shape files (see the Geography and GIS page). We then created an extract of population data for the 39 countries that have data at this second level (see the Appendix to the paper for the list of 39 countries).
GAEZ data. As noted, this is somewhat annoying to access. We use the following sets of data
- lr_lco_faocrp00.tif: Percent of a grid-cell that is cultivated
- lr_soi_sq?b_mze.tif: A set of 7 files (? is 1-7), which are measures of agro-climatic constraints (nutrient availability, excess salts, etc..)
- res01_???_crav6190.tif: A set of 7 files (??? are codes ID's the files) which measure more agro-climatic constraints (growing period, reference evapotranspiration, etc..)
- res03_crav6190l_sxlr_???.tif: A set of files (??? denote crop codes - see the crop_control.csv file) that measure suitability for a crop on a 0 to 100 scale
The shell script "Crops_GAEZ_unzip.sh" in the Replicate folder is a utility that will unzip the sets of zip files downloaded from GAEZ.
These files are all named "Crops_Reg_????.do". There is a single script, "Crops_Reg_Master.do" that will run everything necessary. In that script, it calls two scripts to prepare the data:
Crops_Reg_Collapse.do: This will take each CSV file, aggregate the data up to the given level (country, province, district), and then merge them to one DTA file. You need to set (A) the directories for the CSV files and the data file of ISO codes and (B) the level of aggregation you want. In normal use, run this script twice. Once with the level of aggregation "gadm2" (for a district-level dataset) and once with the level "gadm1" (for a province-level dataset).
Crops_Reg_Prep.do: This takes the DTA file and produces several new variables (yields, population densities, etc..), winsorizes data, and creates several categorical variables for regions. It also produces summary stats tables and density plots. You need to set (A) the directories for the DTA files and where output (figures, etc..) should go and (B) the level of aggregation you want. In normal use, you'd use this script twice. Once with the level of aggregation "gadm2" (for a district-level dataset) and once with the level "gadm1" (for a province-level dataset).
Once those scripts are run, the DTA files necessary for the regressions are ready.
"Crops_Reg_Master.do" then calls two do-files to put programs in memory - neither produce any output.
- ols_spatial_HAC.ado: this is code from Hsiang (firstname.lastname@example.org) to calculate Conley standard errors. You should not have to edit or touch this.
- Crops_Reg_Program.do: these are programs that perform spatial regressions using variables passed to them. You should not have to edit or touch this.
"Crops_Reg_Master.do" then calls do-files to produce various results and tables.
Replication of Table 5, Population Change
We do a validation check using the Acemoglu and Johnson dataset from "Disease and Development". The dataset, named "disease.dta", is available from Acemoglu's website. We do not alter it in any way.
The code "Crops_Reg_Mortality.do" uses this data, after estimating separate elasticities for each country. You will only have to edit the directory locations in that do-file to reproduce Table 5.