# Forest Restoration and Reforestation Analysis

## üß≠ 01 ‚Äì Data Preparation and Cleaning

This notebook represents the first major phase of the data science workflow for the project AI for Sustainable Forest Restoration and Reforestation Analysis, developed within the Introduction to Data Science course of the Erasmus Mundus Joint Master‚Äôs Programme Artificial Intelligence for Sustainable Societies (AISS).

The goal of this stage is to load, explore, clean, and merge multiple open-access datasets from Global Forest Watch (GFW) into a single, well-structured analytical dataset that can later be used for exploratory data analysis (EDA), machine-learning modeling, and interactive visualization through a Dash web dashboard.

## üéØ Objectives

Ingest raw data from the GFW Excel workbook containing nine sheets at both country- and subnational-levels.

Filter, normalize, and restructure the relevant country-level sheets:

Country tree cover loss

Country primary loss

Country drivers

Country carbon data

Transform the data into a tidy format by unpivoting year-based columns, harmonizing thresholds, and renaming variables for clarity.

Integrate multiple datasets (loss, primary loss, drivers, and carbon emissions) into one comprehensive dataframe.

Validate data quality through summary statistics and visual inspections.

Export the cleaned dataset (merged_clean_data.csv) to the data/processed/ directory for use in downstream analysis and the Dash dashboard.

## üåç Context

Deforestation and reforestation patterns vary significantly across regions, and the integration of diverse GFW datasets provides a quantitative foundation to evaluate how forest restoration progress contributes to Sustainable Development Goal 15 ‚Äì Life on Land.
By preparing and standardizing these data systematically, this notebook ensures that subsequent analytical phases‚ÄîEDA, modeling, and dashboard development‚Äîare accurate, reproducible, and ready for advanced AI-based insights.

## Questions to be answered
1. how much tree cover exists?
2. how much is lost?
3. what causes the loss?
4. how that affects carbon emissions and climate?

### üß© Step 0 ‚Äî Setup & Imports

In [1]:
import pandas as pd

RAW_PATH = "../data/raw/global_forest_watch_raw_data.xlsx"

# Read Excel workbook
global_forest_watch_excel_file = pd.ExcelFile(RAW_PATH)
global_forest_watch_excel_file.sheet_names


['Read_Me',
 'Country tree cover loss',
 'Country primary loss',
 'Country drivers',
 'Country carbon data',
 'Subnational 1 tree cover loss',
 'Subnational 1 primary loss',
 'Subnational 1 drivers',
 'Subnational 1 carbon data']

### üå≤ 1. Processing the ‚ÄúCountry Tree Cover Loss‚Äù Sheet

This dataset provides annual tree cover loss (in hectares) per country and canopy threshold between 2001‚Äì2024.
It represents total forest area lost, regardless of forest type or cause.

Our main goals for this section are to:
- Audit the raw dataset to understand structure, missingness, and data types.
- Clean the data by standardizing column names, fixing country names, and converting data types.
- Reshape wide-format yearly columns (tc_loss_ha_2001, tc_loss_ha_2002, ‚Ä¶) into a tidy long format with a single year column.
- Save the processed output to data/processed/country_tree_cover_loss_processed.csv.

üß© STEP 1 ‚Äî Raw Data Audit

üß† What this does:
- Loads the sheet safely (no edits).
- Prints structure, data types, and missing values.
- Checks for duplicates and previews a few rows.

In [2]:
import pandas as pd
RAW_PATH = "../data/raw/global_forest_watch_raw_data.xlsx"
global_forest_watch_excel_file = pd.ExcelFile(RAW_PATH)
country_tree_cover_loss_sheet = global_forest_watch_excel_file.parse("Country tree cover loss")
print(country_tree_cover_loss_sheet.info())
print(country_tree_cover_loss_sheet.describe)
print(country_tree_cover_loss_sheet['country'].nunique())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1328 entries, 0 to 1327
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   country            1328 non-null   object
 1   threshold          1328 non-null   int64 
 2   area_ha            1328 non-null   int64 
 3   extent_2000_ha     1328 non-null   int64 
 4   extent_2010_ha     1328 non-null   int64 
 5   gain_2000-2012_ha  1328 non-null   int64 
 6   tc_loss_ha_2001    1328 non-null   int64 
 7   tc_loss_ha_2002    1328 non-null   int64 
 8   tc_loss_ha_2003    1328 non-null   int64 
 9   tc_loss_ha_2004    1328 non-null   int64 
 10  tc_loss_ha_2005    1328 non-null   int64 
 11  tc_loss_ha_2006    1328 non-null   int64 
 12  tc_loss_ha_2007    1328 non-null   int64 
 13  tc_loss_ha_2008    1328 non-null   int64 
 14  tc_loss_ha_2009    1328 non-null   int64 
 15  tc_loss_ha_2010    1328 non-null   int64 
 16  tc_loss_ha_2011    1328 non-null   int64 


In [38]:
import numpy as np

# --- Load raw sheet (read-only) ---
country_tree_cover_loss_sheet = global_forest_watch_excel_file.parse("Country tree cover loss")

print("Loaded 'Country tree cover loss' sheet")
print("Shape (rows, columns):", country_tree_cover_loss_sheet.shape)

print()
# 1Ô∏è‚É£ Basic info ‚Äî data types & non-null counts
print("\n Basic Info:")
country_tree_cover_loss_sheet.info()

# 2Ô∏è‚É£ Missing values summary
print("\nüîπ Missing values per column:")
display(
    country_tree_cover_loss_sheet.isna()
    .sum()
    .to_frame("missing_count")
    .sort_values("missing_count", ascending=False)
    .head(10)
)

# 3Ô∏è‚É£ Numeric summary (detect anomalies)
print("\nüîπ Descriptive statistics (numeric columns):")
display(country_tree_cover_loss_sheet.describe(include=[np.number]).T.head(10))

# 4Ô∏è‚É£ Duplicate check for country-threshold pairs
dup_count = country_tree_cover_loss_sheet.duplicated(subset=["country", "threshold"]).sum()
print(f"\nüîπ Duplicate country‚Äìthreshold pairs: {dup_count}")

# 5Ô∏è‚É£ Sample rows
print("\nüîπ Sample preview (first 5 rows):")
display(country_tree_cover_loss_sheet.head(5))


‚úÖ Loaded 'Country tree cover loss' sheet
Shape (rows, columns): (1328, 30)

üîπ Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1328 entries, 0 to 1327
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   country            1328 non-null   object
 1   threshold          1328 non-null   int64 
 2   area_ha            1328 non-null   int64 
 3   extent_2000_ha     1328 non-null   int64 
 4   extent_2010_ha     1328 non-null   int64 
 5   gain_2000-2012_ha  1328 non-null   int64 
 6   tc_loss_ha_2001    1328 non-null   int64 
 7   tc_loss_ha_2002    1328 non-null   int64 
 8   tc_loss_ha_2003    1328 non-null   int64 
 9   tc_loss_ha_2004    1328 non-null   int64 
 10  tc_loss_ha_2005    1328 non-null   int64 
 11  tc_loss_ha_2006    1328 non-null   int64 
 12  tc_loss_ha_2007    1328 non-null   int64 
 13  tc_loss_ha_2008    1328 non-null   int64 
 14  tc_loss_ha_2009    1328 non-null   int64 

Unnamed: 0,missing_count
country,0
threshold,0
tc_loss_ha_2023,0
tc_loss_ha_2022,0
tc_loss_ha_2021,0
tc_loss_ha_2020,0
tc_loss_ha_2019,0
tc_loss_ha_2018,0
tc_loss_ha_2017,0
tc_loss_ha_2016,0



üîπ Descriptive statistics (numeric columns):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
threshold,1328.0,28.125,22.49979,0.0,13.75,22.5,35.0,75.0
area_ha,1328.0,78148050.0,201575200.0,2094.0,5117777.0,20225865.5,62019967.0,1689455000.0
extent_2000_ha,1328.0,30380200.0,105670400.0,0.0,548025.5,3622985.5,18319767.5,1689455000.0
extent_2010_ha,1328.0,29943470.0,104738200.0,0.0,541270.0,3499126.0,18198855.25,1689455000.0
gain_2000-2012_ha,1328.0,786730.8,3417373.0,0.0,13832.0,94359.0,388240.0,37220540.0
tc_loss_ha_2001,1328.0,79065.76,312454.8,0.0,514.5,6940.5,31545.25,2933201.0
tc_loss_ha_2002,1328.0,96442.43,411666.9,0.0,393.25,5172.0,32422.25,3715945.0
tc_loss_ha_2003,1328.0,84788.51,376891.8,0.0,312.0,3940.5,28489.25,3489258.0
tc_loss_ha_2004,1328.0,116282.2,494717.7,0.0,532.75,6147.5,38602.75,4133606.0
tc_loss_ha_2005,1328.0,106803.6,423154.5,0.0,548.25,7207.5,40362.5,3675951.0



üîπ Duplicate country‚Äìthreshold pairs: 0

üîπ Sample preview (first 5 rows):


Unnamed: 0,country,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tc_loss_ha_2001,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Afghanistan,0,64383655,64383655,64383655,10738,103,214,267,226,...,0,0,0,31,25,46,47,16,133,223
1,Afghanistan,10,64383655,432070,126231,10738,92,190,254,207,...,0,0,0,28,19,40,37,9,32,32
2,Afghanistan,15,64383655,302629,106852,10738,91,186,248,205,...,0,0,0,28,19,39,32,7,23,17
3,Afghanistan,20,64383655,284330,105718,10738,89,181,245,203,...,0,0,0,28,18,39,32,7,22,16
4,Afghanistan,25,64383655,254843,72384,10738,89,180,244,202,...,0,0,0,27,18,38,27,6,21,14


üß© STEP 1 ‚Äî Cleaning ‚ÄúCountry tree cover loss‚Äù

üéØ Goal
- Detect and handle small issues before transforming.
- Ensure the data has consistent country names, numeric types, and no duplicate rows.

In [12]:
# Work on a copy to avoid touching raw data
tcl_clean = country_tree_cover_loss_sheet.copy()

# --- 1Ô∏è‚É£ Standardize column names ---
tcl_clean.columns = tcl_clean.columns.str.strip().str.lower().str.replace(" ", "_")

# --- 2Ô∏è‚É£ Clean country names ---
if "country" in tcl_clean.columns:
    tcl_clean["country"] = tcl_clean["country"].astype(str).str.strip().str.title()

# --- 3Ô∏è‚É£ Replace empty strings with NaN ---
tcl_clean = tcl_clean.replace(r"^\s*$", pd.NA, regex=True)

# --- 4Ô∏è‚É£ Remove duplicate rows (if any) ---
before = len(tcl_clean)
tcl_clean = tcl_clean.drop_duplicates(subset=["country", "threshold"], keep="first")
after = len(tcl_clean)
print(f"Removed {before - after} duplicate rows (if any).")

# --- 5Ô∏è‚É£ Convert numeric columns properly ---
numeric_cols = tcl_clean.select_dtypes(include="object").columns
for col in numeric_cols:
    try:
        tcl_clean[col] = pd.to_numeric(tcl_clean[col])
    except (ValueError, TypeError):
        #keep it as-is
        pass

# --- 6Ô∏è‚É£ Verify results ---
print("\n‚úÖ After cleaning:")
display(tcl_clean.info())
display(tcl_clean.head(5))


Removed 0 duplicate rows (if any).

‚úÖ After cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1328 entries, 0 to 1327
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   country            1328 non-null   object
 1   threshold          1328 non-null   int64 
 2   area_ha            1328 non-null   int64 
 3   extent_2000_ha     1328 non-null   int64 
 4   extent_2010_ha     1328 non-null   int64 
 5   gain_2000-2012_ha  1328 non-null   int64 
 6   tc_loss_ha_2001    1328 non-null   int64 
 7   tc_loss_ha_2002    1328 non-null   int64 
 8   tc_loss_ha_2003    1328 non-null   int64 
 9   tc_loss_ha_2004    1328 non-null   int64 
 10  tc_loss_ha_2005    1328 non-null   int64 
 11  tc_loss_ha_2006    1328 non-null   int64 
 12  tc_loss_ha_2007    1328 non-null   int64 
 13  tc_loss_ha_2008    1328 non-null   int64 
 14  tc_loss_ha_2009    1328 non-null   int64 
 15  tc_loss_ha_2010    1328 non-null 

None

Unnamed: 0,country,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tc_loss_ha_2001,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Afghanistan,0,64383655,64383655,64383655,10738,103,214,267,226,...,0,0,0,31,25,46,47,16,133,223
1,Afghanistan,10,64383655,432070,126231,10738,92,190,254,207,...,0,0,0,28,19,40,37,9,32,32
2,Afghanistan,15,64383655,302629,106852,10738,91,186,248,205,...,0,0,0,28,19,39,32,7,23,17
3,Afghanistan,20,64383655,284330,105718,10738,89,181,245,203,...,0,0,0,28,18,39,32,7,22,16
4,Afghanistan,25,64383655,254843,72384,10738,89,180,244,202,...,0,0,0,27,18,38,27,6,21,14


üß© STEP 3 ‚Äî Transform ‚ÄúCountry tree cover loss‚Äù to Long Format

üß† Why we‚Äôre doing this

- Makes analysis and visualization possible (e.g., plotting tree loss over time).
- Converts ~24 year columns into one year column + one tree_cover_loss_ha column.
- Keeps country, threshold, and other metadata intact.

In [13]:
def melt_yearly_columns(df, prefix, value_name):
    """
    Converts wide year columns (e.g., tc_loss_ha_2001, tc_loss_ha_2002, ‚Ä¶).
    """
    # Detect all columns that start with the prefix
    year_cols = [c for c in df.columns if c.startswith(prefix)]

    if not year_cols:
        print(f"No columns found with prefix '{prefix}'. Check column names.")
        return df

    melted = df.melt(
        id_vars=[c for c in df.columns if c not in year_cols],
        value_vars=year_cols,
        var_name="metric_year",
        value_name=value_name
    )
    # Extract year as integer from the column name
    melted["year"] = melted["metric_year"].str.extract(r"(\d{4})").astype(int)

    # Drop the temporary column
    melted = melted.drop(columns=["metric_year"])

    return melted
# --- Apply transformation ---
tcl_tidy = melt_yearly_columns(tcl_clean, prefix="tc_loss_ha_", value_name="tree_cover_loss_ha")

print("‚úÖ Transformed shape:", tcl_tidy.shape)
print("Columns:", list(tcl_tidy.columns)[:10])

print("\nüìä Preview of tidy data (AFTER processing):")
display(tcl_tidy.head(10))


‚úÖ Transformed shape: (31872, 8)
Columns: ['country', 'threshold', 'area_ha', 'extent_2000_ha', 'extent_2010_ha', 'gain_2000-2012_ha', 'tree_cover_loss_ha', 'year']

üìä Preview of tidy data (AFTER processing):


Unnamed: 0,country,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tree_cover_loss_ha,year
0,Afghanistan,0,64383655,64383655,64383655,10738,103,2001
1,Afghanistan,10,64383655,432070,126231,10738,92,2001
2,Afghanistan,15,64383655,302629,106852,10738,91,2001
3,Afghanistan,20,64383655,284330,105718,10738,89,2001
4,Afghanistan,25,64383655,254843,72384,10738,89,2001
5,Afghanistan,30,64383655,205771,71786,10738,88,2001
6,Afghanistan,50,64383655,148417,46235,10738,78,2001
7,Afghanistan,75,64383655,75480,18268,10738,46,2001
8,Albania,0,2872761,2872761,2872761,16468,3907,2001
9,Albania,10,2872761,838601,712542,16468,3815,2001


üíæ STEP 4 ‚Äî Save the Processed Data

In [15]:
import os

# Create folder if not already present
os.makedirs("../data/processed", exist_ok=True)

# Define output path
tcl_out_path = "../data/processed/country_tree_cover_loss_processed.csv"

# Save the processed tidy data
tcl_tidy.to_csv(tcl_out_path, index=False)

print(f"üíæ Saved processed dataset to: {tcl_out_path}")
print(f"Rows: {len(tcl_tidy):,} | Columns: {len(tcl_tidy.columns)}")

# Quick verification: reload and confirm structure
verify = pd.read_csv(tcl_out_path)
print("\n‚úÖ Reloaded successfully! Sample below:")
display(verify.head(10))


üíæ Saved processed dataset to: ../data/processed/country_tree_cover_loss_processed.csv
Rows: 31,872 | Columns: 8

‚úÖ Reloaded successfully! Sample below:


Unnamed: 0,country,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tree_cover_loss_ha,year
0,Afghanistan,0,64383655,64383655,64383655,10738,103,2001
1,Afghanistan,10,64383655,432070,126231,10738,92,2001
2,Afghanistan,15,64383655,302629,106852,10738,91,2001
3,Afghanistan,20,64383655,284330,105718,10738,89,2001
4,Afghanistan,25,64383655,254843,72384,10738,89,2001
...,...,...,...,...,...,...,...,...
95,Belize,75,2204340,1575027,1575027,18728,7866,2001
96,Benin,0,11528631,11528631,11528631,181245,18084,2001
97,Benin,10,11528631,8877131,7726250,181245,17778,2001
98,Benin,15,11528631,4559877,4251239,181245,10302,2001


üå¥ 2. Processing the ‚ÄúCountry Primary Loss‚Äù Sheet

This dataset focuses on humid tropical primary forests, providing annual primary forest loss (in hectares) for 2002‚Äì2024.
It reflects the most ecologically significant areas of forest change.

Our main goals for this section are to:

- Audit the raw dataset to check column structure and detect anomalies.
- Clean column names (notably fixing area__ha ‚Üí area_ha) and standardize country names.
- Reshape the yearly loss columns (tc_loss_ha_2002, tc_loss_ha_2003, ‚Ä¶) into a tidy long format.

Save the processed output to data/processed/country_primary_loss_processed.csv.

üå¥ STEP 1 ‚Äî Audit ‚ÄúCountry primary loss‚Äù

üß† Why we audit again

Even though the structure is similar, this sheet:
- Focuses specifically on humid tropical primary forests (2002‚Äì2024).
- May have slightly different column names (for example area__ha instead of area_ha).
- Sometimes starts a year later (2002) and may have unique missingness patterns.

In [16]:
# --- Load the sheet ---
pl_raw = global_forest_watch_excel_file.parse("Country primary loss")

print("‚úÖ Loaded 'Country primary loss' sheet")
print("Shape (rows, columns):", pl_raw.shape)

# 1Ô∏è‚É£ Basic info
print("\nüîπ Basic Info:")
pl_raw.info()

# 2Ô∏è‚É£ Missing values summary
print("\nüîπ Missing values per column:")
display(
    pl_raw.isna()
    .sum()
    .to_frame("missing_count")
    .sort_values("missing_count", ascending=False)
    .head(10)
)

# 3Ô∏è‚É£ Numeric summary
print("\nüîπ Descriptive statistics (numeric columns):")
display(pl_raw.describe(include=[np.number]).T.head(10))

# 4Ô∏è‚É£ Duplicate check
dup_count = pl_raw.duplicated(subset=["country", "threshold"]).sum()
print(f"\nüîπ Duplicate country‚Äìthreshold pairs: {dup_count}")

# 5Ô∏è‚É£ Sample rows
print("\nüîπ Sample of raw data:")
display(pl_raw.head(5))


‚úÖ Loaded 'Country primary loss' sheet
Shape (rows, columns): (76, 26)

üîπ Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 26 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          76 non-null     object
 1   threshold        76 non-null     int64 
 2   area__ha         76 non-null     int64 
 3   tc_loss_ha_2002  76 non-null     int64 
 4   tc_loss_ha_2003  76 non-null     int64 
 5   tc_loss_ha_2004  76 non-null     int64 
 6   tc_loss_ha_2005  76 non-null     int64 
 7   tc_loss_ha_2006  76 non-null     int64 
 8   tc_loss_ha_2007  76 non-null     int64 
 9   tc_loss_ha_2008  76 non-null     int64 
 10  tc_loss_ha_2009  76 non-null     int64 
 11  tc_loss_ha_2010  76 non-null     int64 
 12  tc_loss_ha_2011  76 non-null     int64 
 13  tc_loss_ha_2012  76 non-null     int64 
 14  tc_loss_ha_2013  76 non-null     int64 
 15  tc_loss_ha_2014  76 non-null     int6

Unnamed: 0,missing_count
country,0
threshold,0
tc_loss_ha_2023,0
tc_loss_ha_2022,0
tc_loss_ha_2021,0
tc_loss_ha_2020,0
tc_loss_ha_2019,0
tc_loss_ha_2018,0
tc_loss_ha_2017,0
tc_loss_ha_2016,0



üîπ Descriptive statistics (numeric columns):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
threshold,76.0,30.0,0.0,30.0,30.0,30.0,30.0,30.0
area__ha,76.0,13497960.0,43001720.0,1653.0,227151.25,1833099.5,7487047.0,343260979.0
tc_loss_ha_2002,76.0,35009.86,188356.5,0.0,132.5,2029.5,10132.5,1621738.0
tc_loss_ha_2003,76.0,32682.53,181678.3,0.0,122.5,1582.5,10143.75,1570540.0
tc_loss_ha_2004,76.0,44629.41,236723.2,0.0,203.5,2277.0,11972.75,2016350.0
tc_loss_ha_2005,76.0,43705.5,215722.9,0.0,217.0,2320.0,12021.0,1824217.0
tc_loss_ha_2006,76.0,36953.67,170600.6,0.0,245.5,2528.5,14184.5,1415536.0
tc_loss_ha_2007,76.0,38147.87,145239.2,0.0,184.5,3063.5,18432.0,1149515.0
tc_loss_ha_2008,76.0,35632.92,135475.6,0.0,222.75,3677.5,15289.25,1075087.0
tc_loss_ha_2009,76.0,36765.18,116080.1,0.0,445.25,3729.5,19996.25,700115.0



üîπ Duplicate country‚Äìthreshold pairs: 0

üîπ Sample of raw data:


Unnamed: 0,country,threshold,area__ha,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,tc_loss_ha_2005,tc_loss_ha_2006,tc_loss_ha_2007,tc_loss_ha_2008,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Angola,30,2458061,3499,2963,2354,3110,1400,8060,2699,...,8998,12040,11166,13507,9995,8895,24326,15576,17627,13660
1,Argentina,30,4418724,9318,14459,28090,31429,24095,18687,47067,...,10547,15247,17202,9496,8983,20847,11921,21388,11473,12103
2,Australia,30,13977,0,0,0,0,25,0,0,...,5,0,0,0,5,0,0,0,0,0
3,Bangladesh,30,101114,619,266,347,306,677,369,240,...,205,345,414,358,387,459,308,307,743,467
4,Belize,30,1165487,5570,2993,2108,3206,1899,4140,3632,...,6606,11511,6616,4781,8772,16087,4560,4033,11667,21137


üß© STEP 2 ‚Äî Cleaning Country primary loss
üß† What this does
- Fixes area__ha naming.
- Standardizes and cleans country names.
- Removes duplicates.

In [17]:
# Work on a copy
pl_clean = pl_raw.copy()

# --- 1Ô∏è‚É£ Standardize column names ---
pl_clean.columns = pl_clean.columns.str.strip().str.lower().str.replace(" ", "_")

# --- 2Ô∏è‚É£ Rename inconsistent columns ---
if "area__ha" in pl_clean.columns:
    pl_clean = pl_clean.rename(columns={"area__ha": "area_ha"})
    print("Renamed column 'area__ha' ‚Üí 'area_ha'")

# --- 3Ô∏è‚É£ Clean country names ---
if "country" in pl_clean.columns:
    pl_clean["country"] = pl_clean["country"].astype(str).str.strip().str.title()

# --- 4Ô∏è‚É£ Replace empty strings with NaN ---
pl_clean = pl_clean.replace(r"^\s*$", pd.NA, regex=True)

# --- 5Ô∏è‚É£ Drop duplicates ---
before = len(pl_clean)
pl_clean = pl_clean.drop_duplicates(subset=["country", "threshold"], keep="first")
after = len(pl_clean)
print(f"Removed {before - after} duplicate rows (if any).")

# --- 6Ô∏è‚É£ Convert numeric-like columns ---
for col in pl_clean.columns:
    try:
        pl_clean[col] = pd.to_numeric(pl_clean[col])
    except (ValueError, TypeError):
        pass

# --- 7Ô∏è‚É£ Verify results ---
print("\n‚úÖ After cleaning:")
display(pl_clean.info())
display(pl_clean.head(5))


Renamed column 'area__ha' ‚Üí 'area_ha'
Removed 0 duplicate rows (if any).

‚úÖ After cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 26 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          76 non-null     object
 1   threshold        76 non-null     int64 
 2   area_ha          76 non-null     int64 
 3   tc_loss_ha_2002  76 non-null     int64 
 4   tc_loss_ha_2003  76 non-null     int64 
 5   tc_loss_ha_2004  76 non-null     int64 
 6   tc_loss_ha_2005  76 non-null     int64 
 7   tc_loss_ha_2006  76 non-null     int64 
 8   tc_loss_ha_2007  76 non-null     int64 
 9   tc_loss_ha_2008  76 non-null     int64 
 10  tc_loss_ha_2009  76 non-null     int64 
 11  tc_loss_ha_2010  76 non-null     int64 
 12  tc_loss_ha_2011  76 non-null     int64 
 13  tc_loss_ha_2012  76 non-null     int64 
 14  tc_loss_ha_2013  76 non-null     int64 
 15  tc_loss_ha_2014  76 non-null   

None

Unnamed: 0,country,threshold,area_ha,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,tc_loss_ha_2005,tc_loss_ha_2006,tc_loss_ha_2007,tc_loss_ha_2008,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Angola,30,2458061,3499,2963,2354,3110,1400,8060,2699,...,8998,12040,11166,13507,9995,8895,24326,15576,17627,13660
1,Argentina,30,4418724,9318,14459,28090,31429,24095,18687,47067,...,10547,15247,17202,9496,8983,20847,11921,21388,11473,12103
2,Australia,30,13977,0,0,0,0,25,0,0,...,5,0,0,0,5,0,0,0,0,0
3,Bangladesh,30,101114,619,266,347,306,677,369,240,...,205,345,414,358,387,459,308,307,743,467
4,Belize,30,1165487,5570,2993,2108,3206,1899,4140,3632,...,6606,11511,6616,4781,8772,16087,4560,4033,11667,21137


üß© STEP 3 ‚Äî Transform ‚ÄúCountry primary loss‚Äù to Long Format
üß† Why this step matters

- The raw data still has columns like tc_loss_ha_2002, tc_loss_ha_2003, etc.
- We need to convert these into a single ‚Äúyear‚Äù column for easier merging, plotting, and analysis later.

In [18]:
def melt_yearly_columns(df, prefix, value_name):
    """
    Converts wide year columns (e.g., tc_loss_ha_2002, tc_loss_ha_2003, ‚Ä¶)
    into a tidy long format with columns: country, threshold, year, <value_name>.
    """
    year_cols = [c for c in df.columns if c.startswith(prefix)]
    if not year_cols:
        print(f"No columns found with prefix '{prefix}'. Check column names.")
        return df

    melted = df.melt(
        id_vars=[c for c in df.columns if c not in year_cols],
        value_vars=year_cols,
        var_name="metric_year",
        value_name=value_name
    )

    melted["year"] = melted["metric_year"].str.extract(r"(\d{4})").astype(int)
    melted = melted.drop(columns=["metric_year"])
    return melted

# --- Apply transformation ---
pl_tidy = melt_yearly_columns(pl_clean, prefix="tc_loss_ha_", value_name="primary_forest_loss_ha")

print("‚úÖ Transformed shape:", pl_tidy.shape)
print("Columns:", list(pl_tidy.columns)[:10])

print("\nüìä Preview of tidy data (after transformation):")
display(pl_tidy.head(10))


‚úÖ Transformed shape: (1748, 5)
Columns: ['country', 'threshold', 'area_ha', 'primary_forest_loss_ha', 'year']

üìä Preview of tidy data (after transformation):


Unnamed: 0,country,threshold,area_ha,primary_forest_loss_ha,year
0,Angola,30,2458061,3499,2002
1,Argentina,30,4418724,9318,2002
2,Australia,30,13977,0,2002
3,Bangladesh,30,101114,619,2002
4,Belize,30,1165487,5570,2002
5,Benin,30,1952,0,2002
6,Bhutan,30,1645545,119,2002
7,Bolivia,30,40850721,70494,2002
8,Brazil,30,343260979,1621738,2002
9,Brunei,30,431532,474,2002


üíæ STEP 4 ‚Äî Save Processed ‚ÄúCountry primary loss‚Äù Data

In [19]:
import os

# Ensure folder exists
os.makedirs("../data/processed", exist_ok=True)

# Define output path
pl_out_path = "../data/processed/country_primary_loss_processed.csv"

# Save tidy dataset
pl_tidy.to_csv(pl_out_path, index=False)

print(f"üíæ Saved processed dataset to: {pl_out_path}")
print(f"Rows: {len(pl_tidy):,} | Columns: {len(pl_tidy.columns)}")

# Verify save worked correctly
verify_pl = pd.read_csv(pl_out_path)
print("\n‚úÖ Reloaded successfully! Sample below:")
display(verify_pl.head(5))


üíæ Saved processed dataset to: ../data/processed/country_primary_loss_processed.csv
Rows: 1,748 | Columns: 5

‚úÖ Reloaded successfully! Sample below:


Unnamed: 0,country,threshold,area_ha,primary_forest_loss_ha,year
0,Angola,30,2458061,3499,2002
1,Argentina,30,4418724,9318,2002
2,Australia,30,13977,0,2002
3,Bangladesh,30,101114,619,2002
4,Belize,30,1165487,5570,2002


üåæ 3. Processing the ‚ÄúCountry Drivers‚Äù Sheet

This dataset links annual tree cover loss to its dominant drivers, such as agriculture, logging, fires, or urbanization.
Unlike the previous sheets, it already contains a driver and year column, so our main tasks are to:

- Audit the data for structure, missing values, and unique driver types.
- Clean column names and handle duplicates or blanks.
- Pivot the driver column into multiple columns (one per driver) to enable country-level comparisons.
- Save the cleaned and pivoted version into data/processed/country_drivers_processed.csv.

üåæ STEP 1 ‚Äî Audit ‚ÄúCountry drivers‚Äù

üß† Why this step matters

This sheet links forest loss to dominant drivers, allowing you to analyze which causes are most impactful globally or regionally.
Unlike the first two, this dataset already has a driver column (categorical) and a year column ‚Äî but we‚Äôll need to pivot it later to get one column per driver.

In [20]:
# --- Load raw 'Country drivers' sheet ---
drivers_raw = global_forest_watch_excel_file.parse("Country drivers")

print("‚úÖ Loaded 'Country drivers' sheet")
print("Shape (rows, columns):", drivers_raw.shape)

# 1Ô∏è‚É£ Basic Info
print("\nüîπ Basic Info:")
drivers_raw.info()

# 2Ô∏è‚É£ Missing Values
print("\nüîπ Missing values per column:")
display(
    drivers_raw.isna()
    .sum()
    .to_frame("missing_count")
    .sort_values("missing_count", ascending=False)
    .head(10)
)

# 3Ô∏è‚É£ Numeric summary
print("\nüîπ Descriptive statistics (numeric columns):")
display(drivers_raw.describe(include=[np.number]).T.head(10))

# 4Ô∏è‚É£ Unique drivers
print("\nüîπ Unique driver categories:")
if "driver" in drivers_raw.columns:
    print(drivers_raw["driver"].unique())

# 5Ô∏è‚É£ Sample
print("\nüîπ Sample of raw data:")
display(drivers_raw.head(5))


‚úÖ Loaded 'Country drivers' sheet
Shape (rows, columns): (21897, 5)

üîπ Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21897 entries, 0 to 21896
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     21897 non-null  object 
 1   threshold   21897 non-null  int64  
 2   driver      21897 non-null  object 
 3   year        21897 non-null  int64  
 4   tc_loss_ha  21897 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 855.5+ KB

üîπ Missing values per column:


Unnamed: 0,missing_count
country,0
threshold,0
driver,0
year,0
tc_loss_ha,0



üîπ Descriptive statistics (numeric columns):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
threshold,21897.0,30.0,0.0,30.0,30.0,30.0,30.0,30.0
year,21897.0,2012.406448,6.930328,2001.0,2006.0,2012.0,2018.0,2024.0
tc_loss_ha,21897.0,23513.404407,169882.759311,0.0,26.0,236.0,2375.0,7789588.0



üîπ Unique driver categories:
['Hard commodities' 'Logging' 'Other natural disturbances'
 'Permanent agriculture' 'Settlements & Infrastructure' 'Wildfire'
 'Shifting cultivation']

üîπ Sample of raw data:


Unnamed: 0,country,threshold,driver,year,tc_loss_ha
0,Afghanistan,30,Hard commodities,2014,0.0
1,Afghanistan,30,Logging,2001,3.0
2,Afghanistan,30,Logging,2002,64.0
3,Afghanistan,30,Logging,2003,73.0
4,Afghanistan,30,Logging,2004,143.0


üß© STEP 2 ‚Äî Cleaning ‚ÄúCountry drivers‚Äù

In [21]:
# Work on a copy to keep raw data safe
drivers_clean = drivers_raw.copy()

# --- 1Ô∏è‚É£ Standardize column names ---
drivers_clean.columns = drivers_clean.columns.str.strip().str.lower().str.replace(" ", "_")

# --- 2Ô∏è‚É£ Clean country names ---
if "country" in drivers_clean.columns:
    drivers_clean["country"] = drivers_clean["country"].astype(str).str.strip().str.title()

# --- 3Ô∏è‚É£ Clean driver names ---
if "driver" in drivers_clean.columns:
    drivers_clean["driver"] = drivers_clean["driver"].astype(str).str.strip().str.title()

# --- 4Ô∏è‚É£ Replace empty strings with NaN ---
drivers_clean = drivers_clean.replace(r"^\s*$", pd.NA, regex=True)

# --- 5Ô∏è‚É£ Drop duplicates ---
before = len(drivers_clean)
drivers_clean = drivers_clean.drop_duplicates(subset=["country", "driver", "year", "threshold"], keep="first")
after = len(drivers_clean)
print(f"Removed {before - after} duplicate rows (if any).")

# --- 6Ô∏è‚É£ Convert numeric columns ---
for col in drivers_clean.columns:
    try:
        drivers_clean[col] = pd.to_numeric(drivers_clean[col])
    except (ValueError, TypeError):
        pass

# --- 7Ô∏è‚É£ Verify results ---
print("\n‚úÖ After cleaning:")
display(drivers_clean.info())
display(drivers_clean.head(10))


Removed 0 duplicate rows (if any).

‚úÖ After cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21897 entries, 0 to 21896
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     21897 non-null  object 
 1   threshold   21897 non-null  int64  
 2   driver      21897 non-null  object 
 3   year        21897 non-null  int64  
 4   tc_loss_ha  21897 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 855.5+ KB


None

Unnamed: 0,country,threshold,driver,year,tc_loss_ha
0,Afghanistan,30,Hard Commodities,2014,0.0
1,Afghanistan,30,Logging,2001,3.0
2,Afghanistan,30,Logging,2002,64.0
3,Afghanistan,30,Logging,2003,73.0
4,Afghanistan,30,Logging,2004,143.0
5,Afghanistan,30,Logging,2005,142.0
6,Afghanistan,30,Logging,2006,102.0
7,Afghanistan,30,Logging,2007,182.0
8,Afghanistan,30,Logging,2008,67.0
9,Afghanistan,30,Logging,2009,33.0


üîÑ STEP 3 ‚Äî Pivot ‚ÄúCountry drivers‚Äù
üß† Why pivot?

- Currently, each row represents one (country, driver, year) pair.
- We want to convert it into a wide format
- That way, each country‚Äìyear is a single record with hectares of loss per driver.

In [22]:
import re

# --- Pivot drivers to wide format ---
drivers_pivot = (
    drivers_clean
    .pivot_table(
        index=["country", "threshold", "year"],
        columns="driver",
        values="tc_loss_ha",
        aggfunc="sum",
        fill_value=0
    )
    .reset_index()
)

# --- Clean column names (make them lowercase, replace spaces/special chars with underscores) ---
drivers_pivot.columns = [
    re.sub(r"[^0-9A-Za-z_]+", "_", str(c)).lower().strip("_")
    for c in drivers_pivot.columns
]

print("‚úÖ Pivoted shape:", drivers_pivot.shape)
print("Columns:", drivers_pivot.columns[:10].tolist())

print("\nüìä Preview of pivoted data:")
display(drivers_pivot.head(10))


‚úÖ Pivoted shape: (3625, 10)
Columns: ['country', 'threshold', 'year', 'hard_commodities', 'logging', 'other_natural_disturbances', 'permanent_agriculture', 'settlements_infrastructure', 'shifting_cultivation', 'wildfire']

üìä Preview of pivoted data:


Unnamed: 0,country,threshold,year,hard_commodities,logging,other_natural_disturbances,permanent_agriculture,settlements_infrastructure,shifting_cultivation,wildfire
0,Afghanistan,30,2001,0.0,3.0,2.0,63.0,0.0,0.0,1.0
1,Afghanistan,30,2002,0.0,64.0,3.0,49.0,0.0,0.0,34.0
2,Afghanistan,30,2003,0.0,73.0,1.0,11.0,0.0,0.0,134.0
3,Afghanistan,30,2004,0.0,143.0,1.0,24.0,0.0,0.0,13.0
4,Afghanistan,30,2005,0.0,142.0,4.0,12.0,0.0,0.0,51.0
5,Afghanistan,30,2006,0.0,102.0,0.0,10.0,0.0,0.0,23.0
6,Afghanistan,30,2007,0.0,182.0,0.0,9.0,0.0,0.0,36.0
7,Afghanistan,30,2008,0.0,67.0,2.0,7.0,0.0,0.0,19.0
8,Afghanistan,30,2009,0.0,33.0,0.0,8.0,0.0,0.0,12.0
9,Afghanistan,30,2010,0.0,67.0,0.0,4.0,0.0,0.0,6.0


üíæ STEP 4 ‚Äî Save Processed ‚ÄúCountry drivers‚Äù Data

In [23]:
import os

# Ensure processed folder exists
os.makedirs("../data/processed", exist_ok=True)

# Define output path
drivers_out_path = "../data/processed/country_drivers_processed.csv"

# Save the pivoted (wide) dataset
drivers_pivot.to_csv(drivers_out_path, index=False)

print(f"üíæ Saved processed dataset to: {drivers_out_path}")
print(f"Rows: {len(drivers_pivot):,} | Columns: {len(drivers_pivot.columns)}")

# Quick verification: reload to confirm structure
verify_drivers = pd.read_csv(drivers_out_path)
print("\n‚úÖ Reloaded successfully! Sample below:")
display(verify_drivers.head(5))

üíæ Saved processed dataset to: ../data/processed/country_drivers_processed.csv
Rows: 3,625 | Columns: 10

‚úÖ Reloaded successfully! Sample below:


Unnamed: 0,country,threshold,year,hard_commodities,logging,other_natural_disturbances,permanent_agriculture,settlements_infrastructure,shifting_cultivation,wildfire
0,Afghanistan,30,2001,0.0,3.0,2.0,63.0,0.0,0.0,1.0
1,Afghanistan,30,2002,0.0,64.0,3.0,49.0,0.0,0.0,34.0
2,Afghanistan,30,2003,0.0,73.0,1.0,11.0,0.0,0.0,134.0
3,Afghanistan,30,2004,0.0,143.0,1.0,24.0,0.0,0.0,13.0
4,Afghanistan,30,2005,0.0,142.0,4.0,12.0,0.0,0.0,51.0


### üå¨Ô∏è 4. Processing the ‚ÄúCountry carbon data‚Äù Sheet
Why this step is important

This dataset contains information on forest-related carbon fluxes ‚Äî including gross emissions, carbon removals, and net GHG balance per country and year.
Analyzing these values allows us to quantify the climate impact of deforestation and restoration activities, directly linking forest dynamics to SDG 15 (Life on Land) and SDG 13 (Climate Action).

Because the raw sheet stores each year as a separate column (e.g.,
gfw_forest_carbon_gross_emissions_2001__Mg_CO2e, ..._2002__Mg_CO2e, etc.),
we must first audit the structure, verify column patterns, and check for issues such as missing values or inconsistent thresholds before reshaping it into a tidy format.

STEP 1- Audit raw data

In [24]:
# --- Load the raw sheet ---
carbon_raw = global_forest_watch_excel_file.parse("Country carbon data")

print("‚úÖ Loaded 'Country carbon data' sheet")
print("Shape (rows, columns):", carbon_raw.shape)

# 1Ô∏è‚É£ Basic info
print("\nüîπ Basic Info:")
carbon_raw.info()

# 2Ô∏è‚É£ Missing values
print("\nüîπ Missing values per column:")
display(
    carbon_raw.isna()
    .sum()
    .to_frame("missing_count")
    .sort_values("missing_count", ascending=False)
    .head(10)
)

# 3Ô∏è‚É£ Numeric summary
print("\nüîπ Descriptive statistics (numeric columns):")
display(carbon_raw.describe(include=[np.number]).T.head(10))

# 4Ô∏è‚É£ Check for any threshold or country naming issues
print("\nüîπ Unique threshold values (if available):")
if "umd_tree_cover_density_2000__threshold" in carbon_raw.columns:
    print(carbon_raw["umd_tree_cover_density_2000__threshold"].unique())

print("\nüîπ Sample preview:")
display(carbon_raw.head(5))


‚úÖ Loaded 'Country carbon data' sheet
Shape (rows, columns): (498, 32)

üîπ Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 32 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   country                                            498 non-null    object 
 1   umd_tree_cover_density_2000__threshold             498 non-null    int64  
 2   umd_tree_cover_extent_2000__ha                     498 non-null    int64  
 3   gfw_aboveground_carbon_stocks_2000__Mg_C           498 non-null    int64  
 4   avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1  498 non-null    int64  
 5   gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1    498 non-null    int64  
 6   gfw_forest_carbon_gross_removals__Mg_CO2_yr-1      498 non-null    int64  
 7   gfw_forest_carbon_net_flux__Mg_CO2e_yr-1           498 non-null    int64  
 8   

Unnamed: 0,missing_count
country,0
umd_tree_cover_density_2000__threshold,0
gfw_forest_carbon_gross_emissions_2023__Mg_CO2e,0
gfw_forest_carbon_gross_emissions_2022__Mg_CO2e,0
gfw_forest_carbon_gross_emissions_2021__Mg_CO2e,0
gfw_forest_carbon_gross_emissions_2020__Mg_CO2e,0
gfw_forest_carbon_gross_emissions_2019__Mg_CO2e,0
gfw_forest_carbon_gross_emissions_2018__Mg_CO2e,0
gfw_forest_carbon_gross_emissions_2017__Mg_CO2e,0
gfw_forest_carbon_gross_emissions_2016__Mg_CO2e,0



üîπ Descriptive statistics (numeric columns):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
umd_tree_cover_density_2000__threshold,498.0,51.66667,18.42745,30.0,30.0,50.0,75.0,75.0
umd_tree_cover_extent_2000__ha,498.0,19312690.0,68877660.0,0.0,142635.25,2085215.5,9780736.0,761226500.0
gfw_aboveground_carbon_stocks_2000__Mg_C,498.0,1587066000.0,5357827000.0,0.0,9929741.5,156212670.5,687517700.0,55568870000.0
avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1,498.0,336.757,178.1686,0.0,210.0,320.0,465.5,770.0
gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1,498.0,48395840.0,169628200.0,0.0,193746.25,3567744.0,28028210.0,1589982000.0
gfw_forest_carbon_gross_removals__Mg_CO2_yr-1,498.0,70271930.0,218925300.0,0.0,1035810.5,11793397.5,49826750.0,1998131000.0
gfw_forest_carbon_net_flux__Mg_CO2e_yr-1,498.0,-21876080.0,104447900.0,-1373904000.0,-19619187.75,-3203279.5,-52108.75,399571800.0
gfw_forest_carbon_gross_emissions_2001__Mg_CO2e,498.0,27625620.0,114349400.0,0.0,122655.5,2095027.0,12642390.0,1221139000.0
gfw_forest_carbon_gross_emissions_2002__Mg_CO2e,498.0,34082610.0,153154000.0,0.0,93762.5,1902934.0,12462890.0,1655445000.0
gfw_forest_carbon_gross_emissions_2003__Mg_CO2e,498.0,28245190.0,130635000.0,0.0,101137.25,1336894.5,9653048.0,1514613000.0



üîπ Unique threshold values (if available):
[30 50 75]

üîπ Sample preview:


Unnamed: 0,country,umd_tree_cover_density_2000__threshold,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__Mg_C,avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1,gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1,gfw_forest_carbon_gross_removals__Mg_CO2_yr-1,gfw_forest_carbon_net_flux__Mg_CO2e_yr-1,gfw_forest_carbon_gross_emissions_2001__Mg_CO2e,gfw_forest_carbon_gross_emissions_2002__Mg_CO2e,...,gfw_forest_carbon_gross_emissions_2015__Mg_CO2e,gfw_forest_carbon_gross_emissions_2016__Mg_CO2e,gfw_forest_carbon_gross_emissions_2017__Mg_CO2e,gfw_forest_carbon_gross_emissions_2018__Mg_CO2e,gfw_forest_carbon_gross_emissions_2019__Mg_CO2e,gfw_forest_carbon_gross_emissions_2020__Mg_CO2e,gfw_forest_carbon_gross_emissions_2021__Mg_CO2e,gfw_forest_carbon_gross_emissions_2022__Mg_CO2e,gfw_forest_carbon_gross_emissions_2023__Mg_CO2e,gfw_forest_carbon_gross_emissions_2024__Mg_CO2e
0,Afghanistan,30,205771,12409398,123,15339,376800,-361461,27986.0,41762.0,...,0.0,0.0,0.0,4893.0,3708.0,11409.0,6772.0,1913.0,3435.0,2636.0
1,Afghanistan,50,148417,9765465,134,12657,275855,-263199,25603.0,32691.0,...,0.0,0.0,0.0,3920.0,3343.0,10321.0,6045.0,1664.0,2530.0,2106.0
2,Afghanistan,75,75480,5571655,150,6147,151074,-144926,15780.0,15308.0,...,0.0,0.0,0.0,1962.0,1743.0,6451.0,2477.0,668.0,1857.0,1512.0
3,Albania,30,648459,40958831,238,721806,5103589,-4381783,1417747.0,348556.0,...,120041.0,334094.0,448993.0,724335.0,429556.0,427420.0,506228.0,649874.0,948758.0,308121.0
4,Albania,50,534671,37239867,263,682919,4294627,-3611709,1358272.0,338279.0,...,113553.0,304691.0,403366.0,669011.0,404887.0,391385.0,449937.0,591504.0,895138.0,275104.0


üå¨Ô∏è STEP 2 ‚Äî Cleaning the ‚ÄúCountry carbon data‚Äù Sheet

In [25]:
# Work on a copy to keep raw safe
carbon_clean = carbon_raw.copy()

# --- 1Ô∏è‚É£ Standardize column names ---
carbon_clean.columns = carbon_clean.columns.str.strip().str.lower().str.replace(" ", "_")

# --- 2Ô∏è‚É£ Rename threshold column (for consistency) ---
if "umd_tree_cover_density_2000__threshold" in carbon_clean.columns:
    carbon_clean = carbon_clean.rename(columns={"umd_tree_cover_density_2000__threshold": "threshold"})
    print("Renamed 'umd_tree_cover_density_2000__threshold' ‚Üí 'threshold'")

# --- 3Ô∏è‚É£ Clean country names ---
if "country" in carbon_clean.columns:
    carbon_clean["country"] = carbon_clean["country"].astype(str).str.strip().str.title()

# --- 4Ô∏è‚É£ Replace empty strings with NaN ---
carbon_clean = carbon_clean.replace(r"^\s*$", pd.NA, regex=True)

# --- 5Ô∏è‚É£ Drop duplicates ---
before = len(carbon_clean)
carbon_clean = carbon_clean.drop_duplicates(subset=["country", "threshold"], keep="first")
after = len(carbon_clean)
print(f"Removed {before - after} duplicate rows (if any).")

# --- 6Ô∏è‚É£ Convert numeric-like columns ---
for col in carbon_clean.columns:
    try:
        carbon_clean[col] = pd.to_numeric(carbon_clean[col])
    except (ValueError, TypeError):
        pass

# --- 7Ô∏è‚É£ Verify results ---
print("\n‚úÖ After cleaning:")
display(carbon_clean.info())
display(carbon_clean.head(5))


Renamed 'umd_tree_cover_density_2000__threshold' ‚Üí 'threshold'
Removed 0 duplicate rows (if any).

‚úÖ After cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 32 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   country                                            498 non-null    object 
 1   threshold                                          498 non-null    int64  
 2   umd_tree_cover_extent_2000__ha                     498 non-null    int64  
 3   gfw_aboveground_carbon_stocks_2000__mg_c           498 non-null    int64  
 4   avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1  498 non-null    int64  
 5   gfw_forest_carbon_gross_emissions__mg_co2e_yr-1    498 non-null    int64  
 6   gfw_forest_carbon_gross_removals__mg_co2_yr-1      498 non-null    int64  
 7   gfw_forest_carbon_net_flux__mg_co2e_yr-1         

None

Unnamed: 0,country,threshold,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__mg_c,avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1,gfw_forest_carbon_gross_emissions__mg_co2e_yr-1,gfw_forest_carbon_gross_removals__mg_co2_yr-1,gfw_forest_carbon_net_flux__mg_co2e_yr-1,gfw_forest_carbon_gross_emissions_2001__mg_co2e,gfw_forest_carbon_gross_emissions_2002__mg_co2e,...,gfw_forest_carbon_gross_emissions_2015__mg_co2e,gfw_forest_carbon_gross_emissions_2016__mg_co2e,gfw_forest_carbon_gross_emissions_2017__mg_co2e,gfw_forest_carbon_gross_emissions_2018__mg_co2e,gfw_forest_carbon_gross_emissions_2019__mg_co2e,gfw_forest_carbon_gross_emissions_2020__mg_co2e,gfw_forest_carbon_gross_emissions_2021__mg_co2e,gfw_forest_carbon_gross_emissions_2022__mg_co2e,gfw_forest_carbon_gross_emissions_2023__mg_co2e,gfw_forest_carbon_gross_emissions_2024__mg_co2e
0,Afghanistan,30,205771,12409398,123,15339,376800,-361461,27986.0,41762.0,...,0.0,0.0,0.0,4893.0,3708.0,11409.0,6772.0,1913.0,3435.0,2636.0
1,Afghanistan,50,148417,9765465,134,12657,275855,-263199,25603.0,32691.0,...,0.0,0.0,0.0,3920.0,3343.0,10321.0,6045.0,1664.0,2530.0,2106.0
2,Afghanistan,75,75480,5571655,150,6147,151074,-144926,15780.0,15308.0,...,0.0,0.0,0.0,1962.0,1743.0,6451.0,2477.0,668.0,1857.0,1512.0
3,Albania,30,648459,40958831,238,721806,5103589,-4381783,1417747.0,348556.0,...,120041.0,334094.0,448993.0,724335.0,429556.0,427420.0,506228.0,649874.0,948758.0,308121.0
4,Albania,50,534671,37239867,263,682919,4294627,-3611709,1358272.0,338279.0,...,113553.0,304691.0,403366.0,669011.0,404887.0,391385.0,449937.0,591504.0,895138.0,275104.0


üå¨Ô∏è STEP 3 ‚Äî Transform ‚ÄúCountry carbon data‚Äù to Long Format

üß† Why this matters

Right now, each year‚Äôs emission data is a separate column.
By melting them into one year column and one carbon_gross_emissions_MgCO2e column, you‚Äôll:

- Align this dataset with all others (same tidy format).
- Make it mergeable and easy to visualize.
- Enable time-series and regression analysis later.

In [27]:
import re

def melt_carbon_columns(df, pattern, value_name):
    """
    Melts all columns matching a yearly carbon emission pattern into a tidy long format.
    Example: gfw_forest_carbon_gross_emissions_2001__Mg_CO2e ‚Üí year: 2001, value.
    """
    # Detect all year columns using regex pattern
    year_cols = [c for c in df.columns if re.match(pattern, c)]
    if not year_cols:
        print("‚ùå No matching year columns found. Check column names.")
        return df

    melted = df.melt(
        id_vars=[c for c in df.columns if c not in year_cols],
        value_vars=year_cols,
        var_name="metric_year",
        value_name=value_name
    )

    # Extract year from column names
    melted["year"] = melted["metric_year"].str.extract(r"(\d{4})").astype(int)
    melted = melted.drop(columns=["metric_year"])

    return melted


# --- Apply transformation ---
carbon_tidy = melt_carbon_columns(
    carbon_clean,
    pattern=r"^gfw_forest_carbon_gross_emissions_\d{4}__mg_co2e$",
    value_name="carbon_gross_emissions_MgCO2e"
)

print("‚úÖ Transformed shape:", carbon_tidy.shape)
print("Columns:", list(carbon_tidy.columns)[:10])

print("\nüìä Preview of tidy data (after transformation):")
display(carbon_tidy.head(10))


‚úÖ Transformed shape: (11952, 10)
Columns: ['country', 'threshold', 'umd_tree_cover_extent_2000__ha', 'gfw_aboveground_carbon_stocks_2000__mg_c', 'avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1', 'gfw_forest_carbon_gross_emissions__mg_co2e_yr-1', 'gfw_forest_carbon_gross_removals__mg_co2_yr-1', 'gfw_forest_carbon_net_flux__mg_co2e_yr-1', 'carbon_gross_emissions_MgCO2e', 'year']

üìä Preview of tidy data (after transformation):


Unnamed: 0,country,threshold,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__mg_c,avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1,gfw_forest_carbon_gross_emissions__mg_co2e_yr-1,gfw_forest_carbon_gross_removals__mg_co2_yr-1,gfw_forest_carbon_net_flux__mg_co2e_yr-1,carbon_gross_emissions_MgCO2e,year
0,Afghanistan,30,205771,12409398,123,15339,376800,-361461,27986.0,2001
1,Afghanistan,50,148417,9765465,134,12657,275855,-263199,25603.0,2001
2,Afghanistan,75,75480,5571655,150,6147,151074,-144926,15780.0,2001
3,Albania,30,648459,40958831,238,721806,5103589,-4381783,1417747.0,2001
4,Albania,50,534671,37239867,263,682919,4294627,-3611709,1358272.0,2001
5,Albania,75,363706,28761196,298,576299,3001723,-2425424,1137609.0,2001
6,Algeria,30,1223325,64822106,313,1872312,4873094,-3000781,574332.0,2001
7,Algeria,50,895366,50658903,334,1540229,3547408,-2007180,444098.0,2001
8,Algeria,75,496534,31035068,366,952542,1988182,-1035640,250927.0,2001
9,Angola,30,55276135,2879806419,296,62402574,170616018,-108213442,39294740.0,2001


üíæ STEP 4 ‚Äî Save Processed ‚ÄúCountry carbon data‚Äù

In [28]:
import os

# Ensure processed folder exists
os.makedirs("../data/processed", exist_ok=True)

# Define output path
carbon_out_path = "../data/processed/country_carbon_processed.csv"

# Save tidy dataset
carbon_tidy.to_csv(carbon_out_path, index=False)

print(f"üíæ Saved processed dataset to: {carbon_out_path}")
print(f"Rows: {len(carbon_tidy):,} | Columns: {len(carbon_tidy.columns)}")

# Verify save worked correctly
verify_carbon = pd.read_csv(carbon_out_path)
print("\n‚úÖ Reloaded successfully! Sample below:")
display(verify_carbon.head(5))


üíæ Saved processed dataset to: ../data/processed/country_carbon_processed.csv
Rows: 11,952 | Columns: 10

‚úÖ Reloaded successfully! Sample below:


Unnamed: 0,country,threshold,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__mg_c,avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1,gfw_forest_carbon_gross_emissions__mg_co2e_yr-1,gfw_forest_carbon_gross_removals__mg_co2_yr-1,gfw_forest_carbon_net_flux__mg_co2e_yr-1,carbon_gross_emissions_MgCO2e,year
0,Afghanistan,30,205771,12409398,123,15339,376800,-361461,27986.0,2001
1,Afghanistan,50,148417,9765465,134,12657,275855,-263199,25603.0,2001
2,Afghanistan,75,75480,5571655,150,6147,151074,-144926,15780.0,2001
3,Albania,30,648459,40958831,238,721806,5103589,-4381783,1417747.0,2001
4,Albania,50,534671,37239867,263,682919,4294627,-3611709,1358272.0,2001


üåç Integrating All Country-Level Datasets
Why this step is important

Up to this point, we have individually cleaned and reshaped four key country-level datasets from Global Forest Watch:

- Tree Cover Loss ‚Äì annual loss of forested area
- Primary Forest Loss ‚Äì loss in humid tropical primary forests
- Drivers of Deforestation ‚Äì hectares of loss by cause (fire, agriculture, etc.)
- Carbon Data ‚Äì annual forest-related CO‚ÇÇ emissions and removals

Each dataset provides a different perspective on global forest change.
To perform meaningful Exploratory Data Analysis (EDA), predictive modeling, and visualization, we now need a single integrated dataset that combines all relevant variables per country and year.

This section merges the four processed datasets on their common identifiers ‚Äî
country, threshold, and year ‚Äî ensuring that all information aligns correctly in one master table.
Missing values will be preserved (NaN) so that no data is lost during integration.

üß© STEP 1 ‚Äî Merge All Country-Level Datasets

In [29]:
import pandas as pd
import os

# Load processed files
base_path = "../data/processed"

tcl = pd.read_csv(f"{base_path}/country_tree_cover_loss_processed.csv")
pl = pd.read_csv(f"{base_path}/country_primary_loss_processed.csv")
drv = pd.read_csv(f"{base_path}/country_drivers_processed.csv")
crb = pd.read_csv(f"{base_path}/country_carbon_processed.csv")

print("‚úÖ Loaded all processed datasets:")
for name, df in zip(["Tree Cover Loss", "Primary Loss", "Drivers", "Carbon"], [tcl, pl, drv, crb]):
    print(f"{name:<15}: {df.shape}")

# --- Merge progressively on country + year (keeping all thresholds if present) ---
merged = (
    tcl
    .merge(pl, on=["country", "threshold", "year"], how="outer")
    .merge(drv, on=["country", "threshold", "year"], how="outer")
    .merge(crb, on=["country", "threshold", "year"], how="outer")
)

print("\n‚úÖ Merged dataset shape:", merged.shape)
print("Columns:", merged.columns[:12].tolist(), "...")
display(merged.head(10))


‚úÖ Loaded all processed datasets:
Tree Cover Loss: (31872, 8)
Primary Loss   : (1748, 5)
Drivers        : (3625, 10)
Carbon         : (11952, 10)

‚úÖ Merged dataset shape: (31873, 24)
Columns: ['country', 'threshold', 'area_ha_x', 'extent_2000_ha', 'extent_2010_ha', 'gain_2000-2012_ha', 'tree_cover_loss_ha', 'year', 'area_ha_y', 'primary_forest_loss_ha', 'hard_commodities', 'logging'] ...


Unnamed: 0,country,threshold,area_ha_x,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tree_cover_loss_ha,year,area_ha_y,primary_forest_loss_ha,...,settlements_infrastructure,shifting_cultivation,wildfire,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__mg_c,avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1,gfw_forest_carbon_gross_emissions__mg_co2e_yr-1,gfw_forest_carbon_gross_removals__mg_co2_yr-1,gfw_forest_carbon_net_flux__mg_co2e_yr-1,carbon_gross_emissions_MgCO2e
0,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,103.0,2001,,,...,,,,,,,,,,
1,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,214.0,2002,,,...,,,,,,,,,,
2,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,267.0,2003,,,...,,,,,,,,,,
3,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,226.0,2004,,,...,,,,,,,,,,
4,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,268.0,2005,,,...,,,,,,,,,,
5,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,172.0,2006,,,...,,,,,,,,,,
6,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,274.0,2007,,,...,,,,,,,,,,
7,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,123.0,2008,,,...,,,,,,,,,,
8,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,92.0,2009,,,...,,,,,,,,,,
9,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,109.0,2010,,,...,,,,,,,,,,


üß© STEP 2 ‚Äî Evaluate Missing Values

In [30]:
# --- 1Ô∏è‚É£ Overall missingness percentage ---
missing_summary = (
    merged.isna()
    .mean()
    .sort_values(ascending=False)
    .to_frame("missing_ratio")
)
display(missing_summary.head(15))

# --- 2Ô∏è‚É£ Country-level missing summary ---
country_missing = merged.groupby("country").apply(lambda x: x.isna().mean().mean())
print("\nüîπ Average missingness by country (first 10):")
display(country_missing.sort_values(ascending=False).head(10))

# --- 3Ô∏è‚É£ Check if certain columns are systematically missing ---
cols_all_missing = [c for c in merged.columns if merged[c].isna().all()]
if cols_all_missing:
    print("\n‚ö†Ô∏è Columns completely empty:", cols_all_missing)
else:
    print("\n‚úÖ No completely empty columns detected.")


Unnamed: 0,missing_ratio
area_ha_y,0.945157
primary_forest_loss_ha,0.945157
other_natural_disturbances,0.886267
shifting_cultivation,0.886267
hard_commodities,0.886267
logging,0.886267
wildfire,0.886267
permanent_agriculture,0.886267
settlements_infrastructure,0.886267
gfw_forest_carbon_net_flux__mg_co2e_yr-1,0.625012



üîπ Average missingness by country (first 10):


  country_missing = merged.groupby("country").apply(lambda x: x.isna().mean().mean())


country
S‚àö¬£O Tom‚àö¬© And Pr‚àö‚â†Ncipe    0.583333
Saint-Barth√©lemy            0.557292
Iceland                     0.557292
Faroe Islands               0.557292
S√£o Tom√© And Pr√≠ncipe       0.557292
Saudi Arabia                0.557292
Oman                        0.557292
Djibouti                    0.557292
United Arab Emirates        0.557292
Yemen                       0.557292
dtype: float64


‚úÖ No completely empty columns detected.


üíæ STEP 3 ‚Äî Save Final Merged Dataset

In [34]:
# Ensure processed folder exists
import os
os.makedirs("../data/processed", exist_ok=True)

# Define output path
merged_out_path = "../data/processed/merged_clean_data.csv"

# Save merged dataset
merged.to_csv(merged_out_path, index=False)

print(f"üíæ Final merged dataset saved to: {merged_out_path}")
print(f"Rows: {len(merged):,} | Columns: {len(merged.columns)}")

# Quick verification
verify_merged = pd.read_csv(merged_out_path)
print("\n‚úÖ Reloaded successfully! Sample below:")
display(verify_merged.head(10))


üíæ Final merged dataset saved to: ../data/processed/merged_clean_data.csv
Rows: 31,873 | Columns: 24

‚úÖ Reloaded successfully! Sample below:


Unnamed: 0,country,threshold,area_ha_x,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tree_cover_loss_ha,year,area_ha_y,primary_forest_loss_ha,...,settlements_infrastructure,shifting_cultivation,wildfire,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__mg_c,avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1,gfw_forest_carbon_gross_emissions__mg_co2e_yr-1,gfw_forest_carbon_gross_removals__mg_co2_yr-1,gfw_forest_carbon_net_flux__mg_co2e_yr-1,carbon_gross_emissions_MgCO2e
0,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,103.0,2001,,,...,,,,,,,,,,
1,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,214.0,2002,,,...,,,,,,,,,,
2,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,267.0,2003,,,...,,,,,,,,,,
3,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,226.0,2004,,,...,,,,,,,,,,
4,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,268.0,2005,,,...,,,,,,,,,,
5,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,172.0,2006,,,...,,,,,,,,,,
6,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,274.0,2007,,,...,,,,,,,,,,
7,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,123.0,2008,,,...,,,,,,,,,,
8,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,92.0,2009,,,...,,,,,,,,,,
9,Afghanistan,0,64383655.0,64383655.0,64383655.0,10738.0,109.0,2010,,,...,,,,,,,,,,
