Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

The role of land in temperate and tropical agriculture

T. Ryan Johnson and Dietrich Vollrath

Replication of Main Results

The Code folder contains the necessary code to replicate all the results in the paper. There are two sets of code, the first generates a series of CSV datasets from raw GIS data (using R), and the second runs regressions using those CSV datasets (using Stata).

Control files

There are three control files (both CSV format) in the "Code" folder.

  1. crops_control.csv: this contains a list of crops to include in the analysis, along with codes used to denote them in some of the datasets. It also contains an indication of which regions these crops were available in prior to 1500.
  2. iso_codes.csv: this maps countries to ISO codes (3-digits), as well as region codes we use to group countries
  3. crops_earthstat_control.csv: contains a list of crops to include when creating actual production data for a selected set of crops using the Earthscan data

Edits to these files will alter the results by changing which crops are used to calculate the productivity index in the paper, and/or changing which countries are included in certain sub-samples in the regressions.

Generating data

These files are all names "Crops_Data_???.r". "Crops_Data_Master.r" is the single script that is used to kick off all the other scripts, and it contains all the settings that control the running of those scripts.

Within "Crops_Data_Master.r", you should edit the following

mdir <- "~/dropbox/project/crops"

to point towards the location of the folder that contains the data and code for the project.

The lines following that assignment show you the necessary set of sub-folders that should exist for the code to run correctly.

refdir  <- paste0(mdir,"/Work") # working files and end data
codedir <- paste0(mdir,"/Replicate") # code
gadmdir <- paste0(mdir,"/data/GADM") # Administrative polygons
gaezdir <- paste0(mdir,"/data/GAEZ") # Crop suitability data
hydedir <- paste0(mdir,"/data/HYDE") # Population data
csidir  <- paste0(mdir,"/data/CropCSI") # Crop caloric suitability
dmspdir <- paste0(mdir,"/data/DMSP/2000") # Night lights data
kgdir   <- paste0(mdir,"/data/Koeppen-Geiger-GIS") # KG climate zones
esdir   <- paste0(mdir,"/data/Earthstat") # Earthstat production data
grumdir <- paste0(mdir,"/data/GRUMP") # GRUMP population data
datadir <- paste0(mdir,"/Replicate") # Control files

The other options to set in this file refer to the Caloric Suitability Index parameters you want to use to generate the measure of productivity.

water <- "rain_fed" ## alternative is "irrigated"
input <- "lo" ## alternatives are "med" and "hi"
p1500 <- "" ## alternatives are "" for post-1500, "_p1500" for pre-1500

They are set to use the characteristics associated with exogenous variation in suitability. Setting the "p1500" flag to "_p1500" will tell the code to ignore crops in a region that were not present prior to 1500 (i.e. potatoes in Europe).

With all these parameters set, you should use the script to call the "Crops_Data_Reference.r" script by uncommenting this line. This needs to be run once, and produces a rasterized version of the district boundaries (for use in zone statistics) as well as initializing a CSV file with ID's for each district that other scripts append their data to.

Once the "Crops_Data_Reference.r" script has been run once, you can comment out this call again. Then uncomment any or all of the other scripts listed to call those to produce specific CSV files of data. Calling all of them in order will take approximately 1-2 hours depending on your machine.

The final two scripts called by "Crops_Data_Master.r" refer to pre-1500 versions of the data. We do not use this in the paper, but the scripts are available if you want to use them. "Crops_Data_Regions.r" creates separate rasters that define broad regions, and crops are coded as available pre-1500 or not by region (i.e. Europe, South America, etc..). "Crops_Data_Pre1500.r" creates new pre-1500 versions of select CSI and GAEZ crop files, setting values for productivity or suitability to zero if the crop was not available in that region prior to 1500.

IPUMS data

In addition to the geographic data, the raw IPUMS data must be prepared separately for use in the robustness regressions. The script "" takes in the raw IPUMS extracts, and collapses those to summary measures of population for each district (denoted by GEOLEV2 variable) provided by IPUMS. This should be run first. Warning, this script takes hours to run, as it is collapsing millions of records from each extract.

With that run, the geography scripts in R can be run for the IPUMS districts. These separate scripts are necessary as IPUMS uses a different definition of districts than GADM. The final section of "Crops_Data_Master.r" shows the order for these scripts.

Once those scripts have been run, then "" can be run, which merges the collapsed population data with the geographic data. This script also runs the regressions for the IPUMS data.

Data Sources

You can see the organization of the folders for the data in the "Crops_Data_Reference.r" script. All the data we use is public, and freely downloaded from the original sources. To facilitate an exact replication, you can access our files here. Note that the full set of data is around 25GB.

The original sources of the data can be found at the following links:

  1. CropCSI: From Ozak and Galor (2016). See here, and look for the "Caloric Suitability for Individual Crops" link towards the bottom of the page.
  2. GADM: From here, and see their download section.
  3. GAEZ: From the FAO, and click on the "Access Data Portal" button. You need Flash installed to use it. It is also highly frustrating to download this data, as you have to pull down each individual dataset one by one. See below for a description of which files we use.
  4. HYDE: From here, you click on the link to download data, which takes you to an FTP server. Use guest to login. Data are organized by year, with folders of the name "YYYYAD_pop".
  5. DMSP: From here. We use the link for 2000/F15 data.
  6. Koeppen-Geiger-GIS: From here, where you can find GIS shapefiles for the 1976-2000 observed classification towards the bottom of the page.
  7. Earthstat: From here, go to data downloads and get the zip file for "major crops" from the "Harvested Area and Yield for 175 Crops" section.
  8. IPUMS: From here, we downloaded the "spatially harmonized second-level geography" shape files (see the Geography and GIS page). We then created an extract of population data for the 39 countries that have data at this second level (see the Appendix to the paper for the list of 39 countries).

GAEZ data. As noted, this is somewhat annoying to access. We use the following sets of data

  1. lr_lco_faocrp00.tif: Percent of a grid-cell that is cultivated
  2. lr_soi_sq?b_mze.tif: A set of 7 files (? is 1-7), which are measures of agro-climatic constraints (nutrient availability, excess salts, etc..)
  3. res01_???_crav6190.tif: A set of 7 files (??? are codes ID's the files) which measure more agro-climatic constraints (growing period, reference evapotranspiration, etc..)
  4. res03_crav6190l_sxlr_???.tif: A set of files (??? denote crop codes - see the crop_control.csv file) that measure suitability for a crop on a 0 to 100 scale

The shell script "" in the Replicate folder is a utility that will unzip the sets of zip files downloaded from GAEZ.

Running Regressions

These files are all named "Crops_Reg_????.do". There is a single script, "" that will run everything necessary. In that script, it calls two scripts to prepare the data:

  1. This will take each CSV file, aggregate the data up to the given level (country, province, district), and then merge them to one DTA file. You need to set (A) the directories for the CSV files and the data file of ISO codes and (B) the level of aggregation you want. In normal use, run this script twice. Once with the level of aggregation "gadm2" (for a district-level dataset) and once with the level "gadm1" (for a province-level dataset).

  2. This takes the DTA file and produces several new variables (yields, population densities, etc..), winsorizes data, and creates several categorical variables for regions. It also produces summary stats tables and density plots. You need to set (A) the directories for the DTA files and where output (figures, etc..) should go and (B) the level of aggregation you want. In normal use, you'd use this script twice. Once with the level of aggregation "gadm2" (for a district-level dataset) and once with the level "gadm1" (for a province-level dataset).

Once those scripts are run, the DTA files necessary for the regressions are ready.

"" then calls two do-files to put programs in memory - neither produce any output.

  1. ols_spatial_HAC.ado: this is code from Hsiang ( to calculate Conley standard errors. You should not have to edit or touch this.
  2. these are programs that perform spatial regressions using variables passed to them. You should not have to edit or touch this.

"" then calls do-files to produce various results and tables.

Replication of Table 5, Population Change

We do a validation check using the Acemoglu and Johnson dataset from "Disease and Development". The dataset, named "disease.dta", is available from Acemoglu's website. We do not alter it in any way.

The code "" uses this data, after estimating separate elasticities for each country. You will only have to edit the directory locations in that do-file to reproduce Table 5.


Code for "How Tight are Malthusian Constraints?"



No releases published


No packages published
You can’t perform that action at this time.