# What is the impact of agricultural corn field expansion rate on deforestation rate in countries?


By Platypus

## Table of Contents
1. [Building our table](#Getting-the-Data)
    - [Environment setup](#Environment-up)
    - [Variable X: Cornfield data](#Cornfield-Data)
    - [Variable Y: Forest area](#Forest-Area)
    - [Heterogeneity Variable: Land available for corn expansion other than forests](#Heterogeneity-variable)
    - [Confounders](#Confounders)
        - [Average Temperature](#Average-Temperature)
        - [GDP](#GDP)
    

In [525]:
# CSV files
filepath = 'https://raw.githubusercontent.com/ZeliaDec/DataScience/main/Data/'
csv_cornland = filepath + 'FAOSTAT_data_en_10-2-2024.csv'
csv_landcover = filepath + "FAOSTAT_data_en_11-18-2024.csv"

## Getting the Data

### Environment setup

#### Installing packages

In [526]:
pip install wbdata

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [527]:
pip install pycountry

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


#### Import libraries

In [528]:
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
import wbdata
import numpy as np
import pycountry
import plotly.express as px

#### Functions 
The functions below to map country reference (country name or M49) with standardize country names (ISO3)

##### From Country Name to ISO3

In [529]:
# Function to get ISO-3 country code from country name
def get_iso3(country_name):
    try:
        return pycountry.countries.lookup(country_name).alpha_3
    except LookupError:
        return None

In [530]:
# Lookup the country by ISO3 code
country = pycountry.countries.get(alpha_3='PSE')

# Display the country name
country.name

'Palestine, State of'

##### From M49 to ISO3
M49 is a 3-digit numerical and alphabetical code assigned to countries

In [531]:
def m49_to_iso3(m49_code):
    for country in pycountry.countries:
        if hasattr(country, 'numeric') and int(country.numeric) == m49_code:
            return country.alpha_3
    return None

# <hr>

### Cornfield Data

<p style="font-family: 'Trebuchet', sans-serif; font-size:16px;text-align:justify;line-height: 1.6;">Import of the csv file with that contains the data for the size of the cornfield (in ha) in countries  </p>

In [532]:
corn_df = pd.read_csv(csv_cornland)
corn_df.head()

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
0,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),1961,1961,ha,500000,A,Official figure,
1,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),1962,1962,ha,500000,A,Official figure,
2,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),1963,1963,ha,500000,A,Official figure,
3,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),1964,1964,ha,505000,A,Official figure,
4,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),1965,1965,ha,500000,A,Official figure,


<p style="font-family: 'Trebuchet', sans-serif; font-size:16px;text-align:justify;line-height: 1.6;"> The heterogeneity variable that is build some steps further, need the data of cornfield in 1992.</p>

In [533]:
# Filter the data for the year 1992
data_1992 = corn_df[corn_df["Year"] == 1992]

# Find countries with missing values in the "Value" column
countries_no_value_1992 = data_1992[data_1992["Value"].isnull()]["Area"].unique()

# Convert to list
countries_no_value_1992_list = countries_no_value_1992.tolist()

# Keep the relevant columns
data_1992 = data_1992[["Area", "Value"]]

# Rename the columns
data_1992.rename(columns={"Area": "country", "Value": "corn1992"}, inplace=True)

# Convert ha in sq km 
data_1992["corn1992"] = data_1992["corn1992"]*0.01

# Drop rows with missing values
data_1992 = data_1992.dropna()

print("List of countries with no data in 1992: ", countries_no_value_1992_list)

data_1992.head()

List of countries with no data in 1992:  []


Unnamed: 0,country,corn1992
31,Afghanistan,2000.0
93,Albania,627.36
155,Algeria,2.9
217,Angola,8470.0
274,Antigua and Barbuda,0.3


<p style="font-family: 'Trebuchet', sans-serif; font-size:16px;text-align:justify;line-height: 1.6;"> Getting the data from year 2000 to 2021 to construct the x variable.</p>

In [534]:
# Filter the data between 2000 and 2021
corn_field_data = corn_df
corn_field_data = corn_field_data[(corn_field_data['Year'] >= 2000) & (corn_field_data['Year'] <= 2021)]
corn_field_data.head()

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
39,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),2000,2000,ha,96000,A,Official figure,
40,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),2001,2001,ha,80000,A,Official figure,
41,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),2002,2002,ha,100000,A,Official figure,
42,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),2003,2003,ha,250000,A,Official figure,
43,QCL,Crops and livestock products,4,Afghanistan,5312,Area harvested,112,Maize (corn),2004,2004,ha,250000,A,Official figure,


In [535]:
print("Values hold in the column 'Note': \n", corn_field_data["Note"].unique())

print("Values hold in the column 'Flag Description': \n",corn_field_data["Flag Description"].unique())

# Keep only the rows with official figures
corn_field_data = corn_field_data[(corn_field_data["Flag Description"] == "Official figure") & (corn_field_data["Note"] != "Unofficial figure")]

Values hold in the column 'Note': 
 [nan 'Unofficial figure']
Values hold in the column 'Flag Description': 
 ['Official figure' 'Estimated value' 'Imputed value'
 'Figure from international organizations'
 'Missing value (data cannot exist, not applicable)']


In [536]:
# Number of countries in the cornfield dataset
print("Number of countries in the cornfield dataset: ", corn_field_data["Area"].nunique())

Number of countries in the cornfield dataset:  162


In [537]:
# Keep only the relevant columns
df_corn = corn_field_data
df_corn = df_corn[["Area","Value","Year"]]

# Countries with 0 values
areas_to_exclude = df_corn[df_corn["Value"] == 0]["Area"].unique()
# Exclude these countries
df_corn = df_corn[~df_corn['Area'].isin(areas_to_exclude)]

# Number of observations per country
value_counts_per_area = df_corn.groupby("Area")["Value"].count()

# List the countries with a number of observations different from 22 (2000-2021)
areas_not_equal_to_22 = value_counts_per_area[value_counts_per_area != 22].index
# Exclude these countries
df_corn = df_corn[~df_corn["Area"].isin(areas_not_equal_to_22)]

df_corn.head()

Unnamed: 0,Area,Value,Year
39,Afghanistan,96000,2000
40,Afghanistan,80000,2001
41,Afghanistan,100000,2002
42,Afghanistan,250000,2003
43,Afghanistan,250000,2004


In [538]:
# Reverting to previous data before dropping and then manually resolving country mismatches
country_mapping_corn = {
    'Bolivia (Plurinational State of)':'Bolivia, Plurinational State of',
    'China, Taiwan Province of':'Taiwan, Province of China',
    'China': 'China_',
    'China, mainland':'China',
    'Democratic Republic of the Congo':'Congo, The Democratic Republic of the',
    'Iran (Islamic Republic of)':'Iran, Islamic Republic of',
    'Micronesia (Federated States of)': 'Micronesia, Federated States of',
    'Netherlands (Kingdom of the)':'Netherlands',
    'Republic of Korea': 'Korea, Republic of',
    'Venezuela (Bolivarian Republic of)':'Venezuela, Bolivarian Republic of',
    # You can add more mappings if necessary
}


In [539]:
# Replace the country names, so that they match the ISO-3 codes
df_corn['Area'] = df_corn['Area'].replace(country_mapping_corn)

# Get the ISO-3 code for each country
df_corn['iso3'] = df_corn['Area'].apply(get_iso3)

# Print the countries with missing ISO-3 codes
# Note: China mainland was converted into China and China was dropped
#print(df_corn[df_corn['iso3'].isnull()]["Area"].unique())

# Drop rows with missing values, here missing values for ISO-3 codes
df_corn = df_corn.dropna()

df_corn = df_corn.reset_index()
df_corn.drop('index', axis=1, inplace=True)

# Rename the columns
df_corn.rename(columns={'Area':'country', 'Value':'corn_ha', 'Year': 'year'}, inplace=True)
# Convert ha in sq km
df_corn['corn'] = df_corn['corn_ha']*0.01
# Drop the column corn in hectares
df_corn.drop('corn_ha', axis=1, inplace=True)

df_corn.head()

Unnamed: 0,country,year,iso3,corn
0,Afghanistan,2000,AFG,960.0
1,Afghanistan,2001,AFG,800.0
2,Afghanistan,2002,AFG,1000.0
3,Afghanistan,2003,AFG,2500.0
4,Afghanistan,2004,AFG,2500.0


In [540]:
df_corn = pd.merge(df_corn, data_1992, on='country', how='left')
df_corn = df_corn.dropna()

last_df = df_corn
print("Number of countries in the database", last_df['country'].nunique())

Number of countries in the database 78


# <hr>

## Forest area 
 
<p style="font-family: 'Trebuchet', sans-serif; font-size:16px; text-align:justify; line-height: 1.6;">
  For both the variable Y and the heterogeneity variable, we will use a 
  <a href="https://www.fao.org/faostat/en/#data/LC">database about Global Land Cover</a>
  provided by FAO, where we have data for 247 countries, and for 14 classes of land cover:
</p>
<div style="column-count: 2; font-size: 14px; line-height: 1.8; font-family: 'Trebuchet', sans-serif;">
  <ol>
    <li>Artificial surfaces (including urban and associated areas)</li>
    <li>Herbaceous crops</li>
    <li>Woody crops</li>
    <li>Multiple or layered crops (Not mapped)</li>
    <li>Grassland</li>
    <li>Tree-covered areas</li>
    <li>Mangroves</li>
    <li>Shrub-covered areas</li>
    <li>Shrubs and/or herbaceous vegetation, aquatic or regularly flooded</li>
    <li>Sparsely natural vegetated areas (Not mapped)</li>
    <li>Terrestrial barren land</li>
    <li>Permanent snow and glaciers</li>
    <li>Inland water bodies</li>
    <li>Coastal water bodies and intertidal areas</li>
  </ol>
</div>

In [541]:
# Load the landcover data
landcover_data = pd.read_csv(csv_landcover)

# Get the ISO-3 code for each country
# Note: the function m49_to_iso3 runs for a long time
landcover_data['iso3']=landcover_data['Area Code (M49)'].apply(m49_to_iso3)

# Convert 'Value' from 1000 hectares to square km
landcover_data['Value_sq_km'] = landcover_data['Value'] * 10

print("Number of unique countries:", len(landcover_data['iso3'].unique()))

print("List of countries with no ISO3 code:", landcover_data[landcover_data['iso3'].isnull()]['Area'].unique())

Number of unique countries: 237
List of countries with no ISO3 code: ['Belgium-Luxembourg' 'Channel Islands' 'China' 'Czechoslovakia'
 'Ethiopia PDR' 'Johnston Island' 'Midway Island'
 'Netherlands Antilles (former)' 'Serbia and Montenegro' 'Sudan (former)'
 'Wake Island']


In [542]:
# Filter the data to have only the forest area (Tree-covered areas)
forests = landcover_data[landcover_data['Item']=='Tree-covered areas']

# Rename the columns
forests.rename(columns={'Area': 'country', 'Value_sq_km': 'forest', 'Year': 'year'}, inplace=True)

# Keep only the relevant columns
forests = forests[['country','iso3', 'year', 'Element','forest']]
forests = forests.dropna()

# Data from high resolution satellite imagery
forests = forests[forests['Element']=='Area from CCI_LC']
# Keep only the relevant columns
forests = forests[['country','iso3','year', 'forest']]

print("Number of countries with forest data", forests['country'].nunique())

Number of countries with forest data 236


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  forests.rename(columns={'Area': 'country', 'Value_sq_km': 'forest', 'Year': 'year'}, inplace=True)


# <hr>

## Heterogeneity variable
### Land available for corn expansion other than forests

<style>
    body {
      font-family: 'Trebuchet', sans-serif;
      line-height: 1.6;
      font-size: 14px;
    }
    p {
      text-align: justify;
    }
    ul {
      margin-left: 20px;
    }
    li {
      margin-bottom: 10px;
    }
  </style>
</head>
<body>
  <h4>Why?</h4>
  <ul>
    <li><strong>Direct Relevance to Corn Expansion:</strong>
      <p>This variable captures the potential for agricultural growth, specifically for corn, by estimating land areas that could feasibly be converted to cornfields. Unlike general measures of land availability, it focuses on lands that are ecologically and practically suitable for corn cultivation.</p>
    </li>
    <li><strong>Variation Across Countries:</strong>
      <p>The variable inherently reflects differences between countries, such as urbanization levels, existing cropland distribution, and natural geographic constraints, making it an ideal heterogeneity factor. Countries with more available land for corn expansion may exhibit stronger links between corn expansion and deforestation, while those with limited availability may rely on intensification rather than land clearing.</p>
    </li>
    <li><strong>Focus on Agricultural Pressure:</strong>
      <p>This variable aligns directly with the agricultural pressures driving deforestation, providing a more targeted perspective than broader variables like general cropland area.</p>
    </li>
  </ul>

  <h4>How is it constructed?</h4>
  <p>To compute the land available for corn expansion, we will compute the sum of land categories that seem suitable for corn expansion, so potentially convertible lands other than forests:</p>
  <ul>
    <li><strong>Herbaceous crops:</strong> Represents existing cropland already used for agricultural purposes. These lands are highly suitable for corn expansion and may involve crop rotation or intensification strategies.</li>
    <li><strong>Grassland:</strong> Grasslands are often used as pastures but can be converted into cropland. These areas are considered moderately suitable for corn expansion, especially in regions with high land-use pressure.</li>
    <li><strong>Shrub-covered areas:</strong> Shrublands, while less fertile than grasslands, can still be converted for agricultural use with proper inputs and management. These areas are often targeted in marginal expansions for crops like corn.</li>
  </ul>

  <p>The reasoning behind this approach is that we might see a stronger relationship between corn expansion and deforestation in countries where there is little land available for corn expansion other than forests, specifically in countries where there is low regulation and high forest covers. In these regions, clearing forests can be economically cheaper, as the sale of timber from deforestation provides an additional revenue stream, offsetting the costs of converting forests to farmland. In highly regulated or land-constrained regions, converting existing croplands rather than forests is more likely because there are strong regulations protecting forests, infrastructure for existing croplands is already in place, and incentives encourage intensification (e.g., improving yields) over land expansion.</p>

  <p>We will create a dummy variable that splits our dataset into two groups:</p>
  <ul>
    <li><strong>Group 0:</strong> Countries with land suitable and available for corn expansion lower than its median.</li>
    <li><strong>Group 1:</strong> Countries with land suitable and available for corn expansion higher than its median.</li>
  </ul>
</body>

In [543]:
# Select the data with the categories 'Herbaceous crops', 'Grassland', 'Shrub-covered areas'
suitable_categories = ['Herbaceous crops', 'Grassland', 'Shrub-covered areas']
suitable_land = landcover_data[landcover_data['Item'].isin(suitable_categories)]

# Pivot the data to have categories as columns
pivoted_suitable_land = suitable_land.pivot_table(index=['Area', 'Year', 'iso3'], columns='Item', values='Value', aggfunc='sum').reset_index()

# Fill NaN values with 0
pivoted_suitable_land = pivoted_suitable_land.fillna(0)

# Sum the suitable categories for each country and year
pivoted_suitable_land['Total Suitable Land'] = pivoted_suitable_land['Herbaceous crops'] + pivoted_suitable_land['Grassland'] + pivoted_suitable_land['Shrub-covered areas']

# Display the updated DataFrame
pivoted_suitable_land.head()

Item,Area,Year,iso3,Grassland,Herbaceous crops,Shrub-covered areas,Total Suitable Land
0,Afghanistan,1992,AFG,24035.39,5763.35,3350.55,33149.29
1,Afghanistan,1993,AFG,24035.41,5769.19,3350.55,33155.15
2,Afghanistan,1994,AFG,24035.44,5767.49,3350.54,33153.47
3,Afghanistan,1995,AFG,24045.96,5771.52,3312.12,33129.6
4,Afghanistan,1996,AFG,24044.98,5785.1,3306.66,33136.74


In [544]:
# Keep only the data from 1992
pivoted_suitable_land = pivoted_suitable_land[pivoted_suitable_land["Year"]==1992]
# keep only the relevant columns
pivoted_suitable_land = pivoted_suitable_land[["Area", "iso3", "Total Suitable Land"]]

pivoted_suitable_land = pivoted_suitable_land.dropna()
pivoted_suitable_land.reset_index(drop = True, inplace = True)
pivoted_suitable_land.rename(columns={'Area': 'country'}, inplace=True)

print("The number of countries that have some data in 1992: ", pivoted_suitable_land["country"].nunique())
# The heterogeneity variable takes data from 1992. This data are before the period of interest (2000-2021) for forest and corn data.
# The data from 1992 were chosen, as it is the earliest data that we have and as it is before the period of interest, there is no correlation between the heterogeneity variable and the other variables.

pivoted_suitable_land.head()

The number of countries that have some data in 1992:  226


Item,country,iso3,Total Suitable Land
0,Afghanistan,AFG,33149.29
1,Albania,ALB,1710.49
2,Algeria,DZA,11258.87
3,American Samoa,ASM,5.14
4,Andorra,AND,11.56


In [545]:
# Merge of the data
df_final = pd.merge(last_df, pivoted_suitable_land, on=['iso3'], how='outer')
df_final1 = pd.merge(df_final,forests,on=['iso3', 'year'], how='outer')

# Substract the area of corn from the total suitable land to get the area suitable for corn expansion
df_final1['Total Size Land suitable for corn expansion (sq km)'] = (df_final1['Total Suitable Land']-df_final1['corn1992'])
# Replace negative values with 0
df_final1['Total Size Land suitable for corn expansion (sq km)'] = df_final1['Total Size Land suitable for corn expansion (sq km)'].clip(lower=0)

df_final1 = df_final1.drop(['Total Suitable Land',"country_y", "country", "corn1992"], axis=1)
df_final1.rename(columns={'Total Size Land suitable for corn expansion (sq km)': 'suitability', "country_x":"country"}, inplace=True)
df_final1 = df_final1.dropna()

print("Number of countries in the database with forest and land suitability:", df_final1['country'].nunique())

last_df=df_final1
last_df.reset_index(drop = True, inplace = True)
last_df.head()

Number of countries in the database with forest and land suitability: 78


Unnamed: 0,country,year,iso3,corn,forest,suitability
0,Afghanistan,2000.0,AFG,960.0,12281.1,31149.29
1,Afghanistan,2001.0,AFG,800.0,11975.3,31149.29
2,Afghanistan,2002.0,AFG,1000.0,11851.1,31149.29
3,Afghanistan,2003.0,AFG,2500.0,11735.3,31149.29
4,Afghanistan,2004.0,AFG,2500.0,11667.1,31149.29


# <hr>

## Confounder's data

### Average Temperature
<p style="font-family: 'Trebuchet', sans-serif; font-size:16px; text-align:justify; line-height: 1.6;"> in degrees Celsius (ºC) </p>

<p style="font-family: 'Trebuchet', sans-serif; font-size:16px; text-align:justify; line-height: 1.6;"> Source: <a href="https://ourworldindata.org/grapher/monthly-average-surface-temperatures-by-year" >https://ourworldindata.org/grapher/monthly-average-surface-temperatures-by-year</a>


In [546]:
avg_temp_df = pd.read_csv(filepath + "monthly-average-surface-temperatures-by-year.csv")
avg_temp_df = avg_temp_df.drop(["Year"], axis = 1)

# Calculate the average temperature for each country across all years
avg_temp_df = (avg_temp_df.groupby(by = ["Entity","Code"]).mean().reset_index())
# Transform the DataFrame so that each row represents a single year's temperature for a country
avg_temp_df = pd.melt(
    avg_temp_df,
    id_vars=["Entity", "Code"],  # Keep these columns fixed
    var_name="Reported_Year",    # Name for the variable column (years)
    value_name="Temperature"     # Rename the values column (temperature)
)

# Convert Reported_Year to integer
avg_temp_df["Reported_Year"] = avg_temp_df["Reported_Year"].astype(int)

# Include only data between 2000 and 2021
avg_temp_df = (avg_temp_df
               .query("2000 <= Reported_Year <= 2021")
               .rename(columns = {"Temperature":"Average_Temperature",
                                  "Code":"iso3",
                                  "Reported_Year" : "year"}))

avg_temp_df = avg_temp_df.drop(["Entity"], axis = 1)

avg_temp_df.head()

Unnamed: 0,iso3,year,Average_Temperature
585,AFG,2021,13.982914
586,ALB,2021,13.125356
587,DZA,2021,25.220117
588,ASM,2021,26.756304
589,AND,2021,5.152789


In [547]:
last_df = pd.merge(last_df,avg_temp_df,on = ["iso3","year"],how = "left")
last_df.head()

Unnamed: 0,country,year,iso3,corn,forest,suitability,Average_Temperature
0,Afghanistan,2000.0,AFG,960.0,12281.1,31149.29,12.586175
1,Afghanistan,2001.0,AFG,800.0,11975.3,31149.29,13.413867
2,Afghanistan,2002.0,AFG,1000.0,11851.1,31149.29,13.051083
3,Afghanistan,2003.0,AFG,2500.0,11735.3,31149.29,12.485457
4,Afghanistan,2004.0,AFG,2500.0,11667.1,31149.29,13.23336


### GDP
<p style="font-family: 'Trebuchet', sans-serif; font-size:16px; text-align:justify; line-height: 1.6;"> In million USD </p>

In [548]:
gdp_data = pd.read_csv(filepath + "FAOSTAT_GDP.csv")

# Filter the data for the years 2000-2021
gdp_data = gdp_data[gdp_data['Year'] >= 2000]
gdp_data = gdp_data[gdp_data['Year'] <= 2021]

gdp_data.drop('Note', axis=1, inplace=True)
gdp_data = gdp_data.dropna()

gdp_data.head()

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
30,MK,Macro Indicators,4,Afghanistan,6110,Value US$,22008,Gross Domestic Product,2000,2000,million USD,3531.869351,X,Figure from international organizations
31,MK,Macro Indicators,4,Afghanistan,6110,Value US$,22008,Gross Domestic Product,2001,2001,million USD,3620.52525,X,Figure from international organizations
32,MK,Macro Indicators,4,Afghanistan,6110,Value US$,22008,Gross Domestic Product,2002,2002,million USD,4285.191376,X,Figure from international organizations
33,MK,Macro Indicators,4,Afghanistan,6110,Value US$,22008,Gross Domestic Product,2003,2003,million USD,4898.791114,X,Figure from international organizations
34,MK,Macro Indicators,4,Afghanistan,6110,Value US$,22008,Gross Domestic Product,2004,2004,million USD,5504.073142,X,Figure from international organizations


In [549]:
# Number of observations per country
value_counts_per_area = gdp_data.groupby("Area")["Value"].count()
# List the countries with a number of observations different from 22 (2000-2021)
area_not_equal_to_22 = value_counts_per_area[value_counts_per_area !=22].index

gdp_data = gdp_data[~gdp_data["Area"].isin(area_not_equal_to_22)]
# Keep only the relevant columns
gdp_data = gdp_data[['Area', 'Year', 'Value']]

gdp_data.head()

Unnamed: 0,Area,Year,Value
30,Afghanistan,2000,3531.869351
31,Afghanistan,2001,3620.52525
32,Afghanistan,2002,4285.191376
33,Afghanistan,2003,4898.791114
34,Afghanistan,2004,5504.073142


In [550]:
country_mapping_temp = {
    'Bolivia (Plurinational State of)':'Bolivia, Plurinational State of',
    'China, Taiwan Province of':'Taiwan, Province of China',
    'China': 'China_',
    'China, mainland':'China',
    'Democratic Republic of the Congo':'Congo, The Democratic Republic of the',
    'Iran (Islamic Republic of)':'Iran, Islamic Republic of',
    'Micronesia (Federated States of)': 'Micronesia, Federated States of',
    'Netherlands (Kingdom of the)':'Netherlands',
    'Republic of Korea': 'Korea, Republic of',
    'Venezuela (Bolivarian Republic of)':'Venezuela, Bolivarian Republic of',
    'China, Hong Kong SAR': 'Hong Kong',
    'Palestine': 'Palestine, State of',
    # You can add more mappings if necessary
}

In [551]:
# Replace the country names, so that they match the ISO-3 codes
gdp_data['Area'] = gdp_data['Area'].replace(country_mapping_temp)
# Get the ISO-3 code for each country
gdp_data['iso3'] = gdp_data['Area'].apply(get_iso3)

#print("Countries without ISO-3 codes: ",gdp_data[gdp_data['iso3'].isnull()]["Area"].unique())

gdp_data = gdp_data.dropna()
gdp_data.reset_index(drop = True, inplace = True)
gdp_data.rename(columns={'Area':'country', 'Year': 'year', 'Value':'gdp'}, inplace=True)

gdp_data.head()

Unnamed: 0,country,year,gdp,iso3
0,Afghanistan,2000,3531.869351,AFG
1,Afghanistan,2001,3620.52525,AFG
2,Afghanistan,2002,4285.191376,AFG
3,Afghanistan,2003,4898.791114,AFG
4,Afghanistan,2004,5504.073142,AFG


In [552]:
# Merge the data
last_df = pd.merge(last_df,gdp_data, on=['iso3','year'], how='outer')

last_df.drop(["country_y"], axis=1, inplace=True)
last_df.rename(columns={'country_x':'country'}, inplace=True)
last_df = last_df.dropna()
last_df.reset_index(drop = True, inplace = True)

last_df.head()

Unnamed: 0,country,year,iso3,corn,forest,suitability,Average_Temperature,gdp
0,Afghanistan,2000.0,AFG,960.0,12281.1,31149.29,12.586175,3531.869351
1,Afghanistan,2001.0,AFG,800.0,11975.3,31149.29,13.413867,3620.52525
2,Afghanistan,2002.0,AFG,1000.0,11851.1,31149.29,13.051083,4285.191376
3,Afghanistan,2003.0,AFG,2500.0,11735.3,31149.29,12.485457,4898.791114
4,Afghanistan,2004.0,AFG,2500.0,11667.1,31149.29,13.23336,5504.073142


In [553]:
print("Number of countries in the database", last_df["country"].nunique())

Number of countries in the database 78


### Cattle

In [554]:
cattle = pd.read_csv(filepath + 'cattle.csv', delimiter=';')
cattle = cattle.dropna()

cattle.head()

Unnamed: 0,country,year,Number of Cattle
0,Afghanistan,2000,700000
1,Afghanistan,2001,600000
2,Afghanistan,2002,833000
3,Afghanistan,2003,761000
4,Afghanistan,2004,829000


In [555]:
# Number of observations per country
value_counts_per_area = cattle.groupby("country")["Number of Cattle"].count()

# List the countries with a number of observations different from 22 (2000-2021)
areas_not_equal_to_22 = value_counts_per_area[value_counts_per_area < 22].index
cattle = cattle[~cattle["country"].isin(areas_not_equal_to_22)]

cattle.head()

Unnamed: 0,country,year,Number of Cattle
0,Afghanistan,2000,700000
1,Afghanistan,2001,600000
2,Afghanistan,2002,833000
3,Afghanistan,2003,761000
4,Afghanistan,2004,829000


In [556]:
country_mapping_cattle = {
    'Bolivia (Plurinational State of)':'Bolivia, Plurinational State of',
    'China Hong Kong SAR': 'Hong Kong',
    'China Macao SAR': 'Macao',
    'China': 'China_',
    'China mainland':'China',
    'China Taiwan Province of':'Taiwan, Province of China',
    'Democratic Republic of the Congo':'Congo, The Democratic Republic of the',
    'Iran (Islamic Republic of)':'Iran, Islamic Republic of',
    'Micronesia (Federated States of)': 'Micronesia, Federated States of',
    'Netherlands (Kingdom of the)':'Netherlands',
    'Palestine':'Palestine, State of',
    'Republic of Korea': 'Korea, Republic of',
    'Venezuela (Bolivarian Republic of)':'Venezuela, Bolivarian Republic of',
}

In [557]:
# Replace the country names, so that they match the ISO-3 codes
cattle['country'] = cattle['country'].replace(country_mapping_cattle)
# Get the ISO-3 code for each country
cattle['iso3'] = cattle['country'].apply(get_iso3)

cattle = cattle.dropna()
cattle.reset_index(drop = True, inplace = True)
cattle.rename(columns={'Number of Cattle':'cattle'}, inplace=True)

cattle.head()

Unnamed: 0,country,year,cattle,iso3
0,Afghanistan,2000,700000,AFG
1,Afghanistan,2001,600000,AFG
2,Afghanistan,2002,833000,AFG
3,Afghanistan,2003,761000,AFG
4,Afghanistan,2004,829000,AFG


In [558]:
last_df = pd.merge(last_df,cattle, on=['iso3','year'], how='outer')
last_df.drop(["country_y"], axis=1, inplace=True)
last_df.rename(columns={'country_x':'country'}, inplace=True)
last_df = last_df.dropna()
last_df.head()

Unnamed: 0,country,year,iso3,corn,forest,suitability,Average_Temperature,gdp,cattle
0,Afghanistan,2000.0,AFG,960.0,12281.1,31149.29,12.586175,3531.869351,700000.0
1,Afghanistan,2001.0,AFG,800.0,11975.3,31149.29,13.413867,3620.52525,600000.0
2,Afghanistan,2002.0,AFG,1000.0,11851.1,31149.29,13.051083,4285.191376,833000.0
3,Afghanistan,2003.0,AFG,2500.0,11735.3,31149.29,12.485457,4898.791114,761000.0
4,Afghanistan,2004.0,AFG,2500.0,11667.1,31149.29,13.23336,5504.073142,829000.0


In [559]:
print("Final number of countries in the database: ", last_df['country'].nunique())

Final number of countries in the database:  77


In [None]:
# Uncomment the line below to save the data to a csv file
#last_df.to_csv('./Data/Database.csv', index=False) #save the data to a csv file