# Data Download and Documentation
## Global Forest Watch Data - 2001-2024

**Purpose:** Data Wrangling
1. Preprocessing: Convert data from Excel sheets to csv files because it is easier to integrate with Data Science libraries like Pandas, NumPy, scikit-learn, and TensorFlow
2. Check missing values
3. Check data formats
4. Check the data shape

**Data Source:** Global Forest Watch (globalforestwatch.org)

**Date:** 2025-10-22

In [2]:
# Import libraries
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from scipy.stats import zscore

# Configure visual style
sns.set(style="whitegrid", palette="Set2")
plt.rcParams["figure.figsize"] = (12, 6)


print(f"✓ Libraries imported successfully")

✓ Libraries imported successfully


In [4]:

# Specify your Excel file
excel_file = Path('../data/raw/global_forest_watch_raw_data.xlsx')
output_dir = Path('../data/processed/')

# Read all sheets
excel_data = pd.ExcelFile(excel_file)

# Convert each sheet
for sheet_name in excel_data.sheet_names:
    # Read sheet
    df = pd.read_excel(excel_file, sheet_name=sheet_name)

    # Save as CSV
    csv_filename = f"{excel_file.stem}_{sheet_name}.csv"
    csv_path = output_dir / csv_filename
    df.to_csv(csv_path, index=False)

    print(f"✓ Converted: {sheet_name} -> {csv_filename}")

✓ Converted: Read_Me -> global_forest_watch_raw_data_Read_Me.csv
✓ Converted: Country tree cover loss -> global_forest_watch_raw_data_Country tree cover loss.csv
✓ Converted: Country primary loss -> global_forest_watch_raw_data_Country primary loss.csv
✓ Converted: Country drivers -> global_forest_watch_raw_data_Country drivers.csv
✓ Converted: Country carbon data -> global_forest_watch_raw_data_Country carbon data.csv
✓ Converted: Subnational 1 tree cover loss -> global_forest_watch_raw_data_Subnational 1 tree cover loss.csv
✓ Converted: Subnational 1 primary loss -> global_forest_watch_raw_data_Subnational 1 primary loss.csv
✓ Converted: Subnational 1 drivers -> global_forest_watch_raw_data_Subnational 1 drivers.csv
✓ Converted: Subnational 1 carbon data -> global_forest_watch_raw_data_Subnational 1 carbon data.csv


## 1. Data Overview

### Available Datasets:

**Country Level:**
1. **Country Tree Cover Loss** - Hectares of tree cover loss (2001-2024)
2. **Country Primary Loss** - Humid tropical primary forest loss (2002-2024)
3. **Country Carbon Data** - Biomass stocks, emissions, removals, net flux
4. **Country Drivers** - Tree cover loss by dominant driver (2001-2024)

**Subnational Level (State/Province):**
1. **Subnational Tree Cover Loss** - First level administrative data
2. **Subnational Primary Loss** - State/province level primary forest loss
3. **Subnational Carbon Data** - Regional carbon emissions and removals
4. **Subnational Drivers** - Regional drivers of deforestation

## 3. Load and Inspect Each Dataset

### 3.1 Country Tree Cover Loss

In [5]:
# Load country tree cover loss data
# NOTE: Update the filename to match your actual file
country_loss_file = "../data/raw/global_forest_watch_country_tree_cover_loss.csv"

df_country_loss = pd.read_csv(country_loss_file)
print("✓ Country Tree Cover Loss Data Loaded")
print(f"Shape: {df_country_loss.shape}")
print(f"\nColumns: {list(df_country_loss.columns)}")

print(f"\nFirst few rows:")
display(df_country_loss.head())

print(f"\nLast few rows:")
display(df_country_loss.tail())
    
print(f"\nData Types:")
print(df_country_loss.dtypes)
    
print(f"\nMissing Values:")
print(df_country_loss.isnull().sum())

✓ Country Tree Cover Loss Data Loaded
Shape: (1328, 30)

Columns: ['country', 'threshold', 'area_ha', 'extent_2000_ha', 'extent_2010_ha', 'gain_2000-2012_ha', 'tc_loss_ha_2001', 'tc_loss_ha_2002', 'tc_loss_ha_2003', 'tc_loss_ha_2004', 'tc_loss_ha_2005', 'tc_loss_ha_2006', 'tc_loss_ha_2007', 'tc_loss_ha_2008', 'tc_loss_ha_2009', 'tc_loss_ha_2010', 'tc_loss_ha_2011', 'tc_loss_ha_2012', 'tc_loss_ha_2013', 'tc_loss_ha_2014', 'tc_loss_ha_2015', 'tc_loss_ha_2016', 'tc_loss_ha_2017', 'tc_loss_ha_2018', 'tc_loss_ha_2019', 'tc_loss_ha_2020', 'tc_loss_ha_2021', 'tc_loss_ha_2022', 'tc_loss_ha_2023', 'tc_loss_ha_2024']

First few rows:


Unnamed: 0,country,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tc_loss_ha_2001,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Afghanistan,0,64383655,64383655,64383655,10738,103,214,267,226,...,0,0,0,31,25,46,47,16,133,223
1,Afghanistan,10,64383655,432070,126231,10738,92,190,254,207,...,0,0,0,28,19,40,37,9,32,32
2,Afghanistan,15,64383655,302629,106852,10738,91,186,248,205,...,0,0,0,28,19,39,32,7,23,17
3,Afghanistan,20,64383655,284330,105718,10738,89,181,245,203,...,0,0,0,28,18,39,32,7,22,16
4,Afghanistan,25,64383655,254843,72384,10738,89,180,244,202,...,0,0,0,27,18,38,27,6,21,14



Last few rows:


Unnamed: 0,country,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tc_loss_ha_2001,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
1323,Åland,20,150613,109897,108508,2583,398,278,221,736,...,568,675,737,621,2364,675,1357,1173,1098,955
1324,Åland,25,150613,108748,104703,2583,397,278,221,736,...,567,674,736,620,2363,675,1356,1170,1096,954
1325,Åland,30,150613,107739,103087,2583,397,277,221,736,...,567,672,735,620,2361,674,1355,1167,1093,951
1326,Åland,50,150613,87773,85300,2583,389,274,216,728,...,547,641,704,604,2331,663,1327,1098,1045,899
1327,Åland,75,150613,60890,60461,2583,358,256,205,694,...,484,551,606,538,2180,608,1193,926,910,756



Data Types:
country              object
threshold             int64
area_ha               int64
extent_2000_ha        int64
extent_2010_ha        int64
gain_2000-2012_ha     int64
tc_loss_ha_2001       int64
tc_loss_ha_2002       int64
tc_loss_ha_2003       int64
tc_loss_ha_2004       int64
tc_loss_ha_2005       int64
tc_loss_ha_2006       int64
tc_loss_ha_2007       int64
tc_loss_ha_2008       int64
tc_loss_ha_2009       int64
tc_loss_ha_2010       int64
tc_loss_ha_2011       int64
tc_loss_ha_2012       int64
tc_loss_ha_2013       int64
tc_loss_ha_2014       int64
tc_loss_ha_2015       int64
tc_loss_ha_2016       int64
tc_loss_ha_2017       int64
tc_loss_ha_2018       int64
tc_loss_ha_2019       int64
tc_loss_ha_2020       int64
tc_loss_ha_2021       int64
tc_loss_ha_2022       int64
tc_loss_ha_2023       int64
tc_loss_ha_2024       int64
dtype: object

Missing Values:
country              0
threshold            0
area_ha              0
extent_2000_ha       0
extent_2010_ha       0
g

### 3.2 Country Primary Loss

In [6]:
# Load country primary loss data
primary_loss_file = "../data/raw/global_forest_watch_country_primary_loss.csv"
df_primary_loss = pd.read_csv(primary_loss_file)
print("✓ Country Primary Loss Data Loaded")
print(f"Shape: {df_primary_loss.shape}")
print(f"\nColumns: {list(df_primary_loss.columns)}")


print(f"\nFirst few rows:")
display(df_primary_loss.head())

print(f"\nLast few rows:")
display(df_primary_loss.tail())

print(f"\nData Types:")
print(df_primary_loss.dtypes)

print(f"\nMissing Values:")
print(df_primary_loss.isnull().sum())


✓ Country Primary Loss Data Loaded
Shape: (76, 26)

Columns: ['country', 'threshold', 'area__ha', 'tc_loss_ha_2002', 'tc_loss_ha_2003', 'tc_loss_ha_2004', 'tc_loss_ha_2005', 'tc_loss_ha_2006', 'tc_loss_ha_2007', 'tc_loss_ha_2008', 'tc_loss_ha_2009', 'tc_loss_ha_2010', 'tc_loss_ha_2011', 'tc_loss_ha_2012', 'tc_loss_ha_2013', 'tc_loss_ha_2014', 'tc_loss_ha_2015', 'tc_loss_ha_2016', 'tc_loss_ha_2017', 'tc_loss_ha_2018', 'tc_loss_ha_2019', 'tc_loss_ha_2020', 'tc_loss_ha_2021', 'tc_loss_ha_2022', 'tc_loss_ha_2023', 'tc_loss_ha_2024']

First few rows:


Unnamed: 0,country,threshold,area__ha,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,tc_loss_ha_2005,tc_loss_ha_2006,tc_loss_ha_2007,tc_loss_ha_2008,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Angola,30,2458061,3499,2963,2354,3110,1400,8060,2699,...,8998,12040,11166,13507,9995,8895,24326,15576,17627,13660
1,Argentina,30,4418724,9318,14459,28090,31429,24095,18687,47067,...,10547,15247,17202,9496,8983,20847,11921,21388,11473,12103
2,Australia,30,13977,0,0,0,0,25,0,0,...,5,0,0,0,5,0,0,0,0,0
3,Bangladesh,30,101114,619,266,347,306,677,369,240,...,205,345,414,358,387,459,308,307,743,467
4,Belize,30,1165487,5570,2993,2108,3206,1899,4140,3632,...,6606,11511,6616,4781,8772,16087,4560,4033,11667,21137



Last few rows:


Unnamed: 0,country,threshold,area__ha,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,tc_loss_ha_2005,tc_loss_ha_2006,tc_loss_ha_2007,tc_loss_ha_2008,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
71,Venezuela,30,38654064,11320,20761,15891,15561,14243,26102,19852,...,15534,84699,43748,30125,58767,53669,22680,19802,27488,74428
72,Vietnam,30,6792860,11028,9521,21113,27779,20141,22289,34412,...,38580,68841,45060,33410,30898,32629,29028,21676,16455,16924
73,"Virgin Islands, U.S.",30,2762,0,6,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
74,Zambia,30,331125,62,133,431,242,277,538,305,...,2336,2860,3360,2424,2555,2165,2911,2422,2675,3913
75,Zimbabwe,30,7612,14,8,20,27,25,16,8,...,62,50,275,27,50,37,42,55,40,32



Data Types:
country            object
threshold           int64
area__ha            int64
tc_loss_ha_2002     int64
tc_loss_ha_2003     int64
tc_loss_ha_2004     int64
tc_loss_ha_2005     int64
tc_loss_ha_2006     int64
tc_loss_ha_2007     int64
tc_loss_ha_2008     int64
tc_loss_ha_2009     int64
tc_loss_ha_2010     int64
tc_loss_ha_2011     int64
tc_loss_ha_2012     int64
tc_loss_ha_2013     int64
tc_loss_ha_2014     int64
tc_loss_ha_2015     int64
tc_loss_ha_2016     int64
tc_loss_ha_2017     int64
tc_loss_ha_2018     int64
tc_loss_ha_2019     int64
tc_loss_ha_2020     int64
tc_loss_ha_2021     int64
tc_loss_ha_2022     int64
tc_loss_ha_2023     int64
tc_loss_ha_2024     int64
dtype: object

Missing Values:
country            0
threshold          0
area__ha           0
tc_loss_ha_2002    0
tc_loss_ha_2003    0
tc_loss_ha_2004    0
tc_loss_ha_2005    0
tc_loss_ha_2006    0
tc_loss_ha_2007    0
tc_loss_ha_2008    0
tc_loss_ha_2009    0
tc_loss_ha_2010    0
tc_loss_ha_2011    0
tc_loss

### 3.3 Country Carbon Data

In [10]:
# Load country carbon data
carbon_file = "../data/raw/global_forest_watch_country_carbon_data.csv"

df_carbon = pd.read_csv(carbon_file)
print("✓ Country Carbon Data Loaded")
print(f"Shape: {df_carbon.shape}")
print(f"\nColumns: {list(df_carbon.columns)}")

print(f"\nFirst few rows:")
display(df_carbon.head())

print(f"\nLast few rows:")
display(df_carbon.tail())

print(f"\nData Types:")
print(df_carbon.dtypes)

print(f"\nMissing Values:")
print(df_carbon.isnull().sum())

✓ Country Carbon Data Loaded
Shape: (498, 32)

Columns: ['country', 'umd_tree_cover_density_2000__threshold', 'umd_tree_cover_extent_2000__ha', 'gfw_aboveground_carbon_stocks_2000__Mg_C', 'avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1', 'gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1', 'gfw_forest_carbon_gross_removals__Mg_CO2_yr-1', 'gfw_forest_carbon_net_flux__Mg_CO2e_yr-1', 'gfw_forest_carbon_gross_emissions_2001__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2002__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2003__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2004__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2005__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2006__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2007__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2008__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2009__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2010__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2011__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2012__Mg_CO2e', 'gf

Unnamed: 0,country,umd_tree_cover_density_2000__threshold,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__Mg_C,avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1,gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1,gfw_forest_carbon_gross_removals__Mg_CO2_yr-1,gfw_forest_carbon_net_flux__Mg_CO2e_yr-1,gfw_forest_carbon_gross_emissions_2001__Mg_CO2e,gfw_forest_carbon_gross_emissions_2002__Mg_CO2e,...,gfw_forest_carbon_gross_emissions_2015__Mg_CO2e,gfw_forest_carbon_gross_emissions_2016__Mg_CO2e,gfw_forest_carbon_gross_emissions_2017__Mg_CO2e,gfw_forest_carbon_gross_emissions_2018__Mg_CO2e,gfw_forest_carbon_gross_emissions_2019__Mg_CO2e,gfw_forest_carbon_gross_emissions_2020__Mg_CO2e,gfw_forest_carbon_gross_emissions_2021__Mg_CO2e,gfw_forest_carbon_gross_emissions_2022__Mg_CO2e,gfw_forest_carbon_gross_emissions_2023__Mg_CO2e,gfw_forest_carbon_gross_emissions_2024__Mg_CO2e
0,Afghanistan,30,205771,12409398,123,15339,376800,-361461,27986.0,41762.0,...,0.0,0.0,0.0,4893.0,3708.0,11409.0,6772.0,1913.0,3435.0,2636.0
1,Afghanistan,50,148417,9765465,134,12657,275855,-263199,25603.0,32691.0,...,0.0,0.0,0.0,3920.0,3343.0,10321.0,6045.0,1664.0,2530.0,2106.0
2,Afghanistan,75,75480,5571655,150,6147,151074,-144926,15780.0,15308.0,...,0.0,0.0,0.0,1962.0,1743.0,6451.0,2477.0,668.0,1857.0,1512.0
3,Albania,30,648459,40958831,238,721806,5103589,-4381783,1417747.0,348556.0,...,120041.0,334094.0,448993.0,724335.0,429556.0,427420.0,506228.0,649874.0,948758.0,308121.0
4,Albania,50,534671,37239867,263,682919,4294627,-3611709,1358272.0,338279.0,...,113553.0,304691.0,403366.0,669011.0,404887.0,391385.0,449937.0,591504.0,895138.0,275104.0



Last few rows:


Unnamed: 0,country,umd_tree_cover_density_2000__threshold,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__Mg_C,avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1,gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1,gfw_forest_carbon_gross_removals__Mg_CO2_yr-1,gfw_forest_carbon_net_flux__Mg_CO2e_yr-1,gfw_forest_carbon_gross_emissions_2001__Mg_CO2e,gfw_forest_carbon_gross_emissions_2002__Mg_CO2e,...,gfw_forest_carbon_gross_emissions_2015__Mg_CO2e,gfw_forest_carbon_gross_emissions_2016__Mg_CO2e,gfw_forest_carbon_gross_emissions_2017__Mg_CO2e,gfw_forest_carbon_gross_emissions_2018__Mg_CO2e,gfw_forest_carbon_gross_emissions_2019__Mg_CO2e,gfw_forest_carbon_gross_emissions_2020__Mg_CO2e,gfw_forest_carbon_gross_emissions_2021__Mg_CO2e,gfw_forest_carbon_gross_emissions_2022__Mg_CO2e,gfw_forest_carbon_gross_emissions_2023__Mg_CO2e,gfw_forest_carbon_gross_emissions_2024__Mg_CO2e
493,Zimbabwe,50,292639,26987292,455,3463393,2010161,1453234,2202552.0,1852703.0,...,5543212.0,4287548.0,11037857.0,2313022.0,4481712.0,2463018.0,2921155.0,3646542.0,2163212.0,2915925.0
494,Zimbabwe,75,79815,9489605,637,1755946,1138565,617383,846377.0,547785.0,...,3282965.0,2294634.0,5274064.0,1197189.0,1989239.0,1311713.0,1691041.0,2068057.0,1219051.0,1827562.0
495,Åland,30,107739,3702622,63,199822,548179,-348357,76145.0,55690.0,...,146921.0,181156.0,204553.0,181186.0,720707.0,206427.0,422422.0,365660.0,354122.0,311241.0
496,Åland,50,87773,3178873,68,196055,461029,-264974,75336.0,55291.0,...,143102.0,174662.0,198103.0,177686.0,714011.0,204098.0,415748.0,348926.0,342552.0,298015.0
497,Åland,75,60890,2416128,76,181482,325900,-144418,71162.0,52714.0,...,129805.0,154716.0,175621.0,161701.0,676792.0,190477.0,381685.0,304561.0,306616.0,258040.0



Data Types:
country                                               object
umd_tree_cover_density_2000__threshold                 int64
umd_tree_cover_extent_2000__ha                         int64
gfw_aboveground_carbon_stocks_2000__Mg_C               int64
avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1      int64
gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1        int64
gfw_forest_carbon_gross_removals__Mg_CO2_yr-1          int64
gfw_forest_carbon_net_flux__Mg_CO2e_yr-1               int64
gfw_forest_carbon_gross_emissions_2001__Mg_CO2e      float64
gfw_forest_carbon_gross_emissions_2002__Mg_CO2e      float64
gfw_forest_carbon_gross_emissions_2003__Mg_CO2e      float64
gfw_forest_carbon_gross_emissions_2004__Mg_CO2e      float64
gfw_forest_carbon_gross_emissions_2005__Mg_CO2e      float64
gfw_forest_carbon_gross_emissions_2006__Mg_CO2e      float64
gfw_forest_carbon_gross_emissions_2007__Mg_CO2e      float64
gfw_forest_carbon_gross_emissions_2008__Mg_CO2e      float64
gfw_forest_

### 3.4 Country Drivers Data

In [11]:
# Load country drivers data
drivers_file = "../data/raw/global_forest_watch_data_country_drivers.csv"

df_drivers = pd.read_csv(drivers_file)
print("✓ Country Drivers Data Loaded")
print(f"Shape: {df_drivers.shape}")
print(f"\nColumns: {list(df_drivers.columns)}")
print(f"\nFirst few rows:")
display(df_drivers.head())

print(f"\nLast few rows:")
display(df_drivers.tail())

print(f"\nData Types:")
print(df_drivers.dtypes)

print(f"\nMissing Values:")
print(df_drivers.isnull().sum())

✓ Country Drivers Data Loaded
Shape: (21897, 5)

Columns: ['country', 'threshold', 'driver', 'year', 'tc_loss_ha']

First few rows:


Unnamed: 0,country,threshold,driver,year,tc_loss_ha
0,Afghanistan,30,Hard commodities,2014,0.0
1,Afghanistan,30,Logging,2001,3.0
2,Afghanistan,30,Logging,2002,64.0
3,Afghanistan,30,Logging,2003,73.0
4,Afghanistan,30,Logging,2004,143.0



Last few rows:


Unnamed: 0,country,threshold,driver,year,tc_loss_ha
21892,Zimbabwe,30,Wildfire,2020,128.0
21893,Zimbabwe,30,Wildfire,2021,89.0
21894,Zimbabwe,30,Wildfire,2022,272.0
21895,Zimbabwe,30,Wildfire,2023,143.0
21896,Zimbabwe,30,Wildfire,2024,172.0



Data Types:
country        object
threshold       int64
driver         object
year            int64
tc_loss_ha    float64
dtype: object

Missing Values:
country       0
threshold     0
driver        0
year          0
tc_loss_ha    0
dtype: int64


## 4. Data Summary Statistics

In [17]:
# Create a summary of all datasets
datasets = {
    'Country Tree Cover Loss': df_country_loss if 'df_country_loss' in locals() else None,
    'Country Primary Loss': df_primary_loss if 'df_primary_loss' in locals() else None,
    'Country Carbon Data': df_carbon if 'df_carbon' in locals() else None,
    'Country Drivers': df_drivers if 'df_drivers' in locals() else None
}

print("\n" + "="*80)
print("DATA SUMMARY")
print("="*80)

for name, df in datasets.items():
    if df is not None:
        print(f"\n{name}:")
        print(f"  Rows: {df.shape[0]:,}")
        print(f"  Columns: {df.shape[1]}")
        print(f"  Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    else:
        print(f"\n{name}: Not loaded")


print("\n" + "="*80)
print("DATA STATISTICS")
print("="*80)

for name, df in datasets.items():
    if df is not None:
        print(f"\n{name}:")
        print(f"\n{df.describe(include='all')}")
    else:
        print(f"\n{name}: Not loaded")


DATA SUMMARY

Country Tree Cover Loss:
  Rows: 1,328
  Columns: 30
  Memory: 0.37 MB

Country Primary Loss:
  Rows: 76
  Columns: 26
  Memory: 0.02 MB

Country Carbon Data:
  Rows: 498
  Columns: 32
  Memory: 0.15 MB

Country Drivers:
  Rows: 21,897
  Columns: 5
  Memory: 3.11 MB

DATA STATISTICS

Country Tree Cover Loss:

            country    threshold       area_ha  extent_2000_ha  \
count          1328  1328.000000  1.328000e+03    1.328000e+03   
unique          166          NaN           NaN             NaN   
top     Afghanistan          NaN           NaN             NaN   
freq              8          NaN           NaN             NaN   
mean            NaN    28.125000  7.814805e+07    3.038020e+07   
std             NaN    22.499791  2.015752e+08    1.056704e+08   
min             NaN     0.000000  2.094000e+03    0.000000e+00   
25%             NaN    13.750000  5.117777e+06    5.480255e+05   
50%             NaN    22.500000  2.022587e+07    3.622986e+06   
75%           

## 5. Quick Data Quality Check

In [17]:
def data_quality_check(df, dataset_name):
    """Perform basic data quality checks"""
    print(f"\n{'='*60}")
    print(f"Data Quality Check: {dataset_name}")
    print(f"{'='*60}")
    
    # Missing values
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print(f"\n⚠ Missing Values:")
        print(missing[missing > 0])
    else:
        print("\n✓ No missing values")
    
    # Duplicate rows
    duplicates = df.duplicated().sum()
    if duplicates > 0:
        print(f"\n⚠ Duplicate rows: {duplicates}")
    else:
        print("✓ No duplicate rows")
    
    # Data types
    print(f"\nData Types:")
    print(df.dtypes.value_counts())
    
    return True

# Run quality checks on loaded datasets
for name, df in datasets.items():
    if df is not None:
        data_quality_check(df, name)


Data Quality Check: Country Tree Cover Loss

✓ No missing values
✓ No duplicate rows

Data Types:
int64     29
object     1
Name: count, dtype: int64

Data Quality Check: Country Primary Loss

✓ No missing values
✓ No duplicate rows

Data Types:
int64     25
object     1
Name: count, dtype: int64

Data Quality Check: Country Carbon Data

✓ No missing values
✓ No duplicate rows

Data Types:
float64    24
int64       7
object      1
Name: count, dtype: int64

Data Quality Check: Country Drivers

✓ No missing values
✓ No duplicate rows

Data Types:
object     2
int64      2
float64    1
Name: count, dtype: int64


## 6. Save Data Inventory

Create a metadata file documenting all downloaded data

In [18]:
import json
from datetime import datetime

# Create data inventory
inventory = {
    'created_at': datetime.now().isoformat(),
    'data_source': 'Global Forest Watch',
    'data_url': 'https://www.globalforestwatch.org/',
    'time_period': '2001-2024',
    'datasets': {}
}

for name, df in datasets.items():
    if df is not None:
        inventory['datasets'][name] = {
            'rows': int(df.shape[0]),
            'columns': int(df.shape[1]),
            'column_names': list(df.columns),
            'memory_mb': float(df.memory_usage(deep=True).sum() / 1024**2)
        }

# Save inventory
inventory_file = "../data/data_inventory.json"
with open(inventory_file, 'w') as f:
    json.dump(inventory, f, indent=2)

print(f"✓ Data inventory saved to: {inventory_file}")
print(f"\nInventory:")
print(json.dumps(inventory, indent=2))

✓ Data inventory saved to: ../data/data_inventory.json

Inventory:
{
  "created_at": "2025-10-22T20:52:05.891608",
  "data_source": "Global Forest Watch",
  "data_url": "https://www.globalforestwatch.org/",
  "time_period": "2001-2024",
  "datasets": {
    "Country Tree Cover Loss": {
      "rows": 1328,
      "columns": 30,
      "column_names": [
        "country",
        "threshold",
        "area_ha",
        "extent_2000_ha",
        "extent_2010_ha",
        "gain_2000-2012_ha",
        "tc_loss_ha_2001",
        "tc_loss_ha_2002",
        "tc_loss_ha_2003",
        "tc_loss_ha_2004",
        "tc_loss_ha_2005",
        "tc_loss_ha_2006",
        "tc_loss_ha_2007",
        "tc_loss_ha_2008",
        "tc_loss_ha_2009",
        "tc_loss_ha_2010",
        "tc_loss_ha_2011",
        "tc_loss_ha_2012",
        "tc_loss_ha_2013",
        "tc_loss_ha_2014",
        "tc_loss_ha_2015",
        "tc_loss_ha_2016",
        "tc_loss_ha_2017",
        "tc_loss_ha_2018",
        "tc_loss_ha_201