In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from warnings import filterwarnings
filterwarnings('ignore')

RAW_PATH = "../data/raw/global_forest_watch_raw_data.xlsx"

print("="*80)
print("LOADING RAW DATA FROM EXCEL FILE")
print("="*80)

excel_file = pd.ExcelFile(RAW_PATH)
print(f"✅ Excel file loaded: {RAW_PATH}")
print(f"\nAvailable sheets: {len(excel_file.sheet_names)}")
print("\nSheet names:")
for i, sheet in enumerate(excel_file.sheet_names, 1):
    print(f"  {i}. {sheet}")

country_sheets = [s for s in excel_file.sheet_names if s.startswith('Country')]
print(f"\n📊 Country-level sheets to explore: {len(country_sheets)}")
for sheet in country_sheets:
    print(f"  - {sheet}")

LOADING RAW DATA FROM EXCEL FILE
✅ Excel file loaded: ../data/raw/global_forest_watch_raw_data.xlsx

Available sheets: 9

Sheet names:
  1. Read_Me
  2. Country tree cover loss
  3. Country primary loss
  4. Country drivers
  5. Country carbon data
  6. Subnational 1 tree cover loss
  7. Subnational 1 primary loss
  8. Subnational 1 drivers
  9. Subnational 1 carbon data

📊 Country-level sheets to explore: 4
  - Country tree cover loss
  - Country primary loss
  - Country drivers
  - Country carbon data


**Findings:** 

From loading the Excel file, we discovered:
- The file contains **9 total sheets**
- **4 country-level sheets** that we'll explore:
  1. Country tree cover loss
  2. Country primary loss
  3. Country drivers
  4. Country carbon data
- **4 subnational-level sheets** (which we'll skip for now, as we're focusing on country-level analysis)
- **1 Read_Me sheet** (documentation)

This gives us a clear picture of what data is available and helps us focus our exploration on the four country-level datasets that will be used for our analysis. The presence of both country and subnational data suggests the dataset has multiple levels of granularity, but for this project, we'll concentrate on the country-level aggregation.


In [None]:
---

## Part 1: Data Exploration - Raw Data Sheets

### Goal: Understand the raw data structure, quality, and characteristics from each Excel sheet

We will explore each country-level sheet separately to understand:
- Data structure and format
- Column names and types
- Data quality issues
- Relationships between sheets
- Patterns that will inform data preparation


## Step 2: Exploring Country Tree Cover Loss Data


In [2]:
df_tree_cover_loss = excel_file.parse("Country tree cover loss")

print("="*80)
print("COUNTRY TREE COVER LOSS - RAW DATA")
print("="*80)
print(f"Shape: {df_tree_cover_loss.shape[0]:,} rows × {df_tree_cover_loss.shape[1]} columns")
print(f"\nColumn names ({len(df_tree_cover_loss.columns)}):")
for i, col in enumerate(df_tree_cover_loss.columns, 1):
    print(f"  {i:2d}. {col}")

print("\n" + "="*80)
print("First 5 rows:")
print("="*80)
display(df_tree_cover_loss.head())

print("\n" + "="*80)
print("Data Types:")
print("="*80)
print(df_tree_cover_loss.dtypes)

print("\n" + "="*80)
print("Basic Statistics:")
print("="*80)
display(df_tree_cover_loss.describe())


COUNTRY TREE COVER LOSS - RAW DATA
Shape: 1,328 rows × 30 columns

Column names (30):
   1. country
   2. threshold
   3. area_ha
   4. extent_2000_ha
   5. extent_2010_ha
   6. gain_2000-2012_ha
   7. tc_loss_ha_2001
   8. tc_loss_ha_2002
   9. tc_loss_ha_2003
  10. tc_loss_ha_2004
  11. tc_loss_ha_2005
  12. tc_loss_ha_2006
  13. tc_loss_ha_2007
  14. tc_loss_ha_2008
  15. tc_loss_ha_2009
  16. tc_loss_ha_2010
  17. tc_loss_ha_2011
  18. tc_loss_ha_2012
  19. tc_loss_ha_2013
  20. tc_loss_ha_2014
  21. tc_loss_ha_2015
  22. tc_loss_ha_2016
  23. tc_loss_ha_2017
  24. tc_loss_ha_2018
  25. tc_loss_ha_2019
  26. tc_loss_ha_2020
  27. tc_loss_ha_2021
  28. tc_loss_ha_2022
  29. tc_loss_ha_2023
  30. tc_loss_ha_2024

First 5 rows:


Unnamed: 0,country,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tc_loss_ha_2001,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Afghanistan,0,64383655,64383655,64383655,10738,103,214,267,226,...,0,0,0,31,25,46,47,16,133,223
1,Afghanistan,10,64383655,432070,126231,10738,92,190,254,207,...,0,0,0,28,19,40,37,9,32,32
2,Afghanistan,15,64383655,302629,106852,10738,91,186,248,205,...,0,0,0,28,19,39,32,7,23,17
3,Afghanistan,20,64383655,284330,105718,10738,89,181,245,203,...,0,0,0,28,18,39,32,7,22,16
4,Afghanistan,25,64383655,254843,72384,10738,89,180,244,202,...,0,0,0,27,18,38,27,6,21,14



Data Types:
country              object
threshold             int64
area_ha               int64
extent_2000_ha        int64
extent_2010_ha        int64
gain_2000-2012_ha     int64
tc_loss_ha_2001       int64
tc_loss_ha_2002       int64
tc_loss_ha_2003       int64
tc_loss_ha_2004       int64
tc_loss_ha_2005       int64
tc_loss_ha_2006       int64
tc_loss_ha_2007       int64
tc_loss_ha_2008       int64
tc_loss_ha_2009       int64
tc_loss_ha_2010       int64
tc_loss_ha_2011       int64
tc_loss_ha_2012       int64
tc_loss_ha_2013       int64
tc_loss_ha_2014       int64
tc_loss_ha_2015       int64
tc_loss_ha_2016       int64
tc_loss_ha_2017       int64
tc_loss_ha_2018       int64
tc_loss_ha_2019       int64
tc_loss_ha_2020       int64
tc_loss_ha_2021       int64
tc_loss_ha_2022       int64
tc_loss_ha_2023       int64
tc_loss_ha_2024       int64
dtype: object

Basic Statistics:


Unnamed: 0,threshold,area_ha,extent_2000_ha,extent_2010_ha,gain_2000-2012_ha,tc_loss_ha_2001,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,tc_loss_ha_2005,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
count,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,...,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0
mean,28.125,78148050.0,30380200.0,29943470.0,786730.8,79065.76,96442.43,84788.51,116282.2,106803.6,...,115358.0,175681.5,174534.6,146977.3,143857.2,155330.5,149812.1,135673.1,168680.9,177610.7
std,22.499791,201575200.0,105670400.0,104738200.0,3417373.0,312454.8,411666.9,376891.8,494717.7,423154.5,...,390674.0,645335.3,607919.8,545900.9,458810.6,579976.9,613902.0,490937.7,750234.1,692486.9
min,0.0,2094.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.75,5117777.0,548025.5,541270.0,13832.0,514.5,393.25,312.0,532.75,548.25,...,323.0,618.75,727.25,582.5,537.75,657.5,527.5,386.0,828.75,673.75
50%,22.5,20225870.0,3622986.0,3499126.0,94359.0,6940.5,5172.0,3940.5,6147.5,7207.5,...,7497.5,14212.5,16314.0,11565.0,11082.0,11954.5,9935.0,9190.5,13963.0,10951.0
75%,35.0,62019970.0,18319770.0,18198860.0,388240.0,31545.25,32422.25,28489.25,38602.75,40362.5,...,51400.25,106330.2,97937.25,76692.5,76894.25,80744.0,70723.75,65272.0,76778.0,81784.0
max,75.0,1689455000.0,1689455000.0,1689455000.0,37220540.0,2933201.0,3715945.0,3489258.0,4133606.0,3675951.0,...,2925679.0,6407238.0,6143920.0,6621833.0,4847068.0,8217252.0,8559449.0,5126743.0,10176020.0,7421840.0


**Findings:**

From exploring the Country Tree Cover Loss sheet, we discovered:

1. **Data Structure:**
   - The sheet contains **1,328 rows and 30 columns**
   - The data is in **wide format** with separate columns for each year (`tc_loss_ha_2001` through `tc_loss_ha_2024`)
   - This represents 24 years of data (2001-2024) plus 6 metadata columns
   - The wide format will need to be converted to long format during data preparation

2. **Key Columns Identified:**
   - `country`: Country names (object/string type)
   - `threshold`: Canopy density threshold values (int64 type, mean: 28.125, suggesting values like 0, 25, 30, 50, 75)
   - `area_ha`: Total area in hectares (int64)
   - `extent_2000_ha`: Tree cover extent in year 2000 (int64)
   - `extent_2010_ha`: Tree cover extent in year 2010 (int64)
   - `gain_2000-2012_ha`: Tree cover gain between 2000-2012 (int64)
   - **24 year-based columns**: `tc_loss_ha_2001` through `tc_loss_ha_2024` representing annual tree cover loss

3. **Data Types:**
   - All numeric columns are stored as **int64** (integers), which is appropriate for hectare measurements
   - Country names are stored as **object** (strings)
   - No float types, which simplifies data handling and avoids precision issues
   - All data types are consistent and appropriate for the data content

4. **Statistical Summary:**
   - **Threshold**: Mean of 28.125, indicating multiple threshold values are used
   - **Area**: Mean area of ~78 million hectares per row (very large, likely country-level aggregations)
   - **Extent 2000**: Mean of ~30 million hectares of tree cover in 2000
   - **Extent 2010**: Mean of ~30 million hectares of tree cover in 2010 (slight decrease)
   - The large standard deviations indicate significant variation across countries
   - Year columns (2001-2024) contain loss values that will need to be unpivoted

**Implication for Data Preparation:** 
- We'll need to reshape this data from wide to long format, extracting years from column names (`tc_loss_ha_2001` → year: 2001, value: [loss amount])
- The 24 year columns will become 24 rows per country-threshold combination
- After reshaping, we expect approximately 1,328 × 24 = ~31,872 rows (if all years have data)
- All numeric columns are already integers, so no type conversion needed
- The metadata columns (area_ha, extent_2000_ha, extent_2010_ha, gain_2000-2012_ha) will need to be preserved during reshaping


In [None]:
print("="*80)
print("MISSING VALUES ANALYSIS:")
print("="*80)
missing = df_tree_cover_loss.isnull().sum()
missing_pct = (missing / len(df_tree_cover_loss)) * 100
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing Count': missing.values,
    'Missing Percentage': missing_pct.values
}).sort_values('Missing Count', ascending=False)

missing_df = missing_df[missing_df['Missing Count'] > 0]
if len(missing_df) > 0:
    display(missing_df)
else:
    print("✅ No missing values found!")

print("\n" + "="*80)
print("UNIQUE VALUES:")
print("="*80)
if 'country' in df_tree_cover_loss.columns:
    print(f"Number of countries: {df_tree_cover_loss['country'].nunique()}")
    print(f"Countries: {sorted(df_tree_cover_loss['country'].unique())}")

if 'threshold' in df_tree_cover_loss.columns:
    print(f"\nThreshold values: {sorted(df_tree_cover_loss['threshold'].unique())}")

year_cols = [col for col in df_tree_cover_loss.columns if '200' in col or '201' in col or '202' in col]
print(f"\nYear columns found: {len(year_cols)}")
if year_cols:
    print(f"Year range: {year_cols[0]} to {year_cols[-1]}")


⚠️ No column found for 'tree_cover_extent_ha'. Check your dataset columns:
['country', 'threshold', 'area_ha_x', 'extent_2000_ha', 'extent_2010_ha', 'gain_2000-2012_ha', 'tree_cover_loss_ha', 'year', 'area_ha_y', 'primary_forest_loss_ha', 'hard_commodities', 'logging', 'other_natural_disturbances', 'permanent_agriculture', 'settlements_infrastructure', 'shifting_cultivation', 'wildfire', 'umd_tree_cover_extent_2000__ha', 'gfw_aboveground_carbon_stocks_2000__mg_c', 'avg_gfw_aboveground_carbon_stocks_2000__mg_c_ha-1', 'gfw_forest_carbon_gross_emissions__mg_co2e_yr-1', 'gfw_forest_carbon_gross_removals__mg_co2_yr-1', 'gfw_forest_carbon_net_flux__mg_co2e_yr-1', 'carbon_gross_emissions_MgCO2e']


## Step 4: Exploring Country Primary Loss Data


In [5]:
df_primary_loss = excel_file.parse("Country primary loss")

print("="*80)
print("COUNTRY PRIMARY LOSS - RAW DATA")
print("="*80)
print(f"Shape: {df_primary_loss.shape[0]:,} rows × {df_primary_loss.shape[1]} columns")
print(f"\nColumn names ({len(df_primary_loss.columns)}):")
for i, col in enumerate(df_primary_loss.columns, 1):
    print(f"  {i:2d}. {col}")

print("\n" + "="*80)
print("First 5 rows:")
print("="*80)
display(df_primary_loss.head())

print("\n" + "="*80)
print("Data Types:")
print("="*80)
print(df_primary_loss.dtypes)

print("\n" + "="*80)
print("Basic Statistics:")
print("="*80)
display(df_primary_loss.describe())

print("\n" + "="*80)
print("MISSING VALUES ANALYSIS:")
print("="*80)
missing = df_primary_loss.isnull().sum()
missing_pct = (missing / len(df_primary_loss)) * 100
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing Count': missing.values,
    'Missing Percentage': missing_pct.values
}).sort_values('Missing Count', ascending=False)

missing_df = missing_df[missing_df['Missing Count'] > 0]
if len(missing_df) > 0:
    display(missing_df)
else:
    print("✅ No missing values found!")

print("\n" + "="*80)
print("UNIQUE VALUES:")
print("="*80)
if 'country' in df_primary_loss.columns:
    print(f"Number of countries: {df_primary_loss['country'].nunique()}")
    print(f"Countries: {sorted(df_primary_loss['country'].unique())}")

if 'threshold' in df_primary_loss.columns:
    print(f"\nThreshold values: {sorted(df_primary_loss['threshold'].unique())}")

year_cols = [col for col in df_primary_loss.columns if '200' in col or '201' in col or '202' in col]
print(f"\nYear columns found: {len(year_cols)}")
if year_cols:
    print(f"Year range: {year_cols[0]} to {year_cols[-1]}")


COUNTRY PRIMARY LOSS - RAW DATA
Shape: 76 rows × 26 columns

Column names (26):
   1. country
   2. threshold
   3. area__ha
   4. tc_loss_ha_2002
   5. tc_loss_ha_2003
   6. tc_loss_ha_2004
   7. tc_loss_ha_2005
   8. tc_loss_ha_2006
   9. tc_loss_ha_2007
  10. tc_loss_ha_2008
  11. tc_loss_ha_2009
  12. tc_loss_ha_2010
  13. tc_loss_ha_2011
  14. tc_loss_ha_2012
  15. tc_loss_ha_2013
  16. tc_loss_ha_2014
  17. tc_loss_ha_2015
  18. tc_loss_ha_2016
  19. tc_loss_ha_2017
  20. tc_loss_ha_2018
  21. tc_loss_ha_2019
  22. tc_loss_ha_2020
  23. tc_loss_ha_2021
  24. tc_loss_ha_2022
  25. tc_loss_ha_2023
  26. tc_loss_ha_2024

First 5 rows:


Unnamed: 0,country,threshold,area__ha,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,tc_loss_ha_2005,tc_loss_ha_2006,tc_loss_ha_2007,tc_loss_ha_2008,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
0,Angola,30,2458061,3499,2963,2354,3110,1400,8060,2699,...,8998,12040,11166,13507,9995,8895,24326,15576,17627,13660
1,Argentina,30,4418724,9318,14459,28090,31429,24095,18687,47067,...,10547,15247,17202,9496,8983,20847,11921,21388,11473,12103
2,Australia,30,13977,0,0,0,0,25,0,0,...,5,0,0,0,5,0,0,0,0,0
3,Bangladesh,30,101114,619,266,347,306,677,369,240,...,205,345,414,358,387,459,308,307,743,467
4,Belize,30,1165487,5570,2993,2108,3206,1899,4140,3632,...,6606,11511,6616,4781,8772,16087,4560,4033,11667,21137



Data Types:
country            object
threshold           int64
area__ha            int64
tc_loss_ha_2002     int64
tc_loss_ha_2003     int64
tc_loss_ha_2004     int64
tc_loss_ha_2005     int64
tc_loss_ha_2006     int64
tc_loss_ha_2007     int64
tc_loss_ha_2008     int64
tc_loss_ha_2009     int64
tc_loss_ha_2010     int64
tc_loss_ha_2011     int64
tc_loss_ha_2012     int64
tc_loss_ha_2013     int64
tc_loss_ha_2014     int64
tc_loss_ha_2015     int64
tc_loss_ha_2016     int64
tc_loss_ha_2017     int64
tc_loss_ha_2018     int64
tc_loss_ha_2019     int64
tc_loss_ha_2020     int64
tc_loss_ha_2021     int64
tc_loss_ha_2022     int64
tc_loss_ha_2023     int64
tc_loss_ha_2024     int64
dtype: object

Basic Statistics:


Unnamed: 0,threshold,area__ha,tc_loss_ha_2002,tc_loss_ha_2003,tc_loss_ha_2004,tc_loss_ha_2005,tc_loss_ha_2006,tc_loss_ha_2007,tc_loss_ha_2008,tc_loss_ha_2009,...,tc_loss_ha_2015,tc_loss_ha_2016,tc_loss_ha_2017,tc_loss_ha_2018,tc_loss_ha_2019,tc_loss_ha_2020,tc_loss_ha_2021,tc_loss_ha_2022,tc_loss_ha_2023,tc_loss_ha_2024
count,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,...,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0,76.0
mean,30.0,13497960.0,35009.86,32682.53,44629.41,43705.5,36953.67,38147.87,35632.92,36765.184211,...,38528.013158,80509.46,65324.87,47926.33,49310.18,55317.2,49265.11,54079.01,49111.96,88496.16
std,0.0,43001720.0,188356.5,181678.3,236723.2,215722.9,170600.6,145239.2,135475.6,116080.074035,...,126231.483498,343014.0,253695.7,168091.2,170268.9,205885.2,188559.1,214962.4,154958.8,367170.4
min,30.0,1653.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,227151.2,132.5,122.5,203.5,217.0,245.5,184.5,222.75,445.25,...,253.75,501.5,514.5,341.75,324.75,306.5,301.25,292.75,274.0,366.0
50%,30.0,1833100.0,2029.5,1582.5,2277.0,2320.0,2528.5,3063.5,3677.5,3729.5,...,3392.5,4904.0,7123.0,4061.5,5047.5,4094.0,4050.0,3952.0,4538.5,4598.0
75%,30.0,7487047.0,10132.5,10143.75,11972.75,12021.0,14184.5,18432.0,15289.25,19996.25,...,16938.25,45902.25,31105.25,27098.5,30709.75,34334.0,24449.5,23500.0,24598.25,40371.75
max,30.0,343261000.0,1621738.0,1570540.0,2016350.0,1824217.0,1415536.0,1149515.0,1075087.0,700115.0,...,828839.0,2830943.0,2134474.0,1347176.0,1361053.0,1703491.0,1546964.0,1772214.0,1136250.0,2823646.0



MISSING VALUES ANALYSIS:
✅ No missing values found!

UNIQUE VALUES:
Number of countries: 76
Countries: ['Angola', 'Argentina', 'Australia', 'Bangladesh', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Brazil', 'Brunei', 'Burundi', 'Cambodia', 'Cameroon', 'Central African Republic', 'China', 'Colombia', 'Costa Rica', 'Cuba', "Côte d'Ivoire", 'Democratic Republic of the Congo', 'Dominican Republic', 'Ecuador', 'El Salvador', 'Equatorial Guinea', 'Ethiopia', 'Fiji', 'French Guiana', 'Gabon', 'Ghana', 'Guadeloupe', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'India', 'Indonesia', 'Kenya', 'Laos', 'Liberia', 'Madagascar', 'Malawi', 'Malaysia', 'Martinique', 'Mozambique', 'Myanmar', 'México', 'Nepal', 'Nicaragua', 'Nigeria', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Republic of the Congo', 'Rwanda', 'Senegal', 'Sierra Leone', 'Solomon Islands', 'South Africa', 'South Sudan', 'Sri Lanka', 'Suriname', 'Tanzania', 'Thailand', 'Togo', 'Uganda', 'Unit

**Findings:**

From exploring the Country Primary Loss sheet, we discovered:

1. **Data Structure:**
   - The sheet contains **76 rows and 26 columns**
   - The data is in **wide format** with separate columns for each year (`tc_loss_ha_2002` through `tc_loss_ha_2024`)
   - **Important difference**: Year columns start at 2002 (not 2001), so we have 23 years of data (2002-2024) plus 3 metadata columns
   - This is significantly smaller than tree cover loss (76 rows vs. 1,328 rows), indicating fewer countries have primary forest data
   - The wide format will need to be converted to long format during data preparation

2. **Key Columns Identified:**
   - `country`: Country names (object/string type)
   - `threshold`: Canopy density threshold values (int64 type) - **Only threshold 30 is present** (unlike tree cover loss which has multiple thresholds)
   - `area__ha`: Total area in hectares (int64) - Note: column name uses double underscore `area__ha` (different from tree cover loss which uses `area_ha`)
   - **23 year-based columns**: `tc_loss_ha_2002` through `tc_loss_ha_2024` representing annual primary forest loss
   - **Missing 2001 data**: Unlike tree cover loss, primary loss data starts from 2002

3. **Data Types:**
   - All numeric columns are stored as **int64** (integers), which is appropriate for hectare measurements
   - Country names are stored as **object** (strings)
   - Data types are consistent with tree cover loss sheet, which will facilitate merging
   - No float types, simplifying data handling

4. **Statistical Summary:**
   - **Threshold**: All values are 30 (mean: 30.0, std: 0.0) - only one threshold value in this dataset
   - **Area**: Mean area of ~13.5 million hectares per row (smaller than tree cover loss's ~78 million, reflecting fewer/larger primary forest countries)
   - **Primary Loss Values**: Mean loss ranges from ~35,000 hectares (2002) to varying amounts across years
   - Primary loss values are indeed smaller than total tree cover loss values, as expected
   - The large standard deviations indicate significant variation across countries

5. **Missing Values:**
   - **No missing values found** in the dataset (excellent data quality!)
   - However, the dataset only includes 76 countries (vs. many more in tree cover loss), suggesting countries without primary forests are simply not included
   - Zero values in year columns likely represent "no primary forest loss" rather than missing data

6. **Geographic and Temporal Coverage:**
   - **76 unique countries** represented (significantly fewer than tree cover loss's 1,328 rows, which includes multiple thresholds per country)
   - Countries include major primary forest nations: Brazil, Indonesia, Democratic Republic of the Congo, Peru, Colombia, etc.
   - **Threshold values**: Only threshold 30 (single value, unlike tree cover loss)
   - **Year range**: 2002 to 2024 (23 years of data, missing 2001)
   - **Year columns**: 23 columns found

7. **Relationship to Tree Cover Loss:**
   - This dataset uses the same country identifiers as tree cover loss, but **only includes threshold 30**
   - Can be merged with tree cover loss data on country, threshold (30), and year
   - **Important validation**: Primary loss values should be ≤ tree cover loss values for the same country/year/threshold 30
   - Primary forests are a subset of total forests, so primary loss should never exceed total loss
   - **Column name difference**: `area__ha` (double underscore) vs. `area_ha` in tree cover loss - will need standardization

**Implication for Data Preparation:**
- This sheet will need the same wide-to-long format transformation as tree cover loss
- **Important**: Year extraction must account for starting year 2002 (not 2001)
- Can be merged with tree cover loss after reshaping using country + threshold (30) + year as keys
- Should implement validation to ensure primary loss ≤ total loss during data preparation
- Column name `area__ha` needs to be standardized to `area_ha` for consistency
- After reshaping, we expect approximately 76 × 23 = ~1,748 rows (if all years have data)
- The smaller dataset size (76 countries vs. many more in tree cover loss) means we'll need to decide on merge strategy (inner vs. outer join)


### 1.3 Sheet 3: Country Drivers

## Step 3: Exploring Country Drivers Data


In [11]:
df_drivers = excel_file.parse("Country drivers")

print("="*80)
print("COUNTRY DRIVERS - RAW DATA")
print("="*80)
print(f"Shape: {df_drivers.shape[0]:,} rows × {df_drivers.shape[1]} columns")
print(f"\nColumn names ({len(df_drivers.columns)}):")
for i, col in enumerate(df_drivers.columns, 1):
    print(f"  {i:2d}. {col}")

print("\n" + "="*80)
print("First 5 rows:")
print("="*80)
display(df_drivers.head())

driver_type_cols = [c for c in df_drivers.columns if c not in ['country', 'threshold'] and not any(str(year) in c for year in range(2001, 2025))]
print(f"\nDriver type columns: {driver_type_cols}")

year_cols = [col for col in df_drivers.columns if any(str(year) in col for year in range(2001, 2025))]
print(f"Year-based columns: {len(year_cols)}")

print("\n" + "="*80)
print("Missing Values:")
print("="*80)
missing = df_drivers.isnull().sum()
if missing.sum() > 0:
    display(pd.DataFrame({'Missing Count': missing[missing > 0]}))
else:
    print("✅ No missing values!")

COUNTRY DRIVERS - RAW DATA
Shape: 21,897 rows × 5 columns

Column names (5):
   1. country
   2. threshold
   3. driver
   4. year
   5. tc_loss_ha

First 5 rows:


Unnamed: 0,country,threshold,driver,year,tc_loss_ha
0,Afghanistan,30,Hard commodities,2014,0.0
1,Afghanistan,30,Logging,2001,3.0
2,Afghanistan,30,Logging,2002,64.0
3,Afghanistan,30,Logging,2003,73.0
4,Afghanistan,30,Logging,2004,143.0



Driver type columns: ['driver', 'year', 'tc_loss_ha']
Year-based columns: 0

Missing Values:
✅ No missing values!


**Findings:**

From exploring the Country Drivers sheet, we discovered:

1. **Data Structure:**
   - The sheet contains **21,897 rows and 5 columns**
   - **Critical difference**: The data is already in **LONG format** (unlike tree cover loss and primary loss which are in wide format)
   - This is a significant advantage as it doesn't require reshaping during data preparation
   - The structure is: one row per country-threshold-driver-year combination

2. **Key Columns Identified:**
   - `country`: Country names (object/string type)
   - `threshold`: Canopy density threshold values (int64 type)
   - `driver`: Driver category names (object/string type) - represents the cause of deforestation
   - `year`: Year values (int64 type) - already extracted as a column (not in column names)
   - `tc_loss_ha`: Tree cover loss in hectares attributed to that specific driver (float64 type)

3. **Driver Categories:**
   - The `driver` column contains categorical values representing different causes of deforestation
   - Examples from the data: "Hard commodities", "Logging"
   - These represent the main causes of deforestation tracked by Global Forest Watch
   - Each row represents loss attributed to a specific driver for a specific country-threshold-year combination

4. **Data Types:**
   - Country and driver names are stored as **object** (strings)
   - Threshold and year are stored as **int64** (integers)
   - Loss values are stored as **float64** (floats), which allows for precise measurements
   - All data types are appropriate for the data content

5. **Data Completeness:**
   - **No missing values found** in the dataset (excellent data quality!)
   - **21,897 rows** represent multiple driver types per country-threshold-year combination
   - The large number of rows (compared to other sheets) reflects the long format structure

6. **Important Structural Difference:**
   - **Unlike tree cover loss and primary loss sheets**, this data is already in long format
   - No year-based columns (0 year-based columns found)
   - Year is already a separate column, making this dataset ready for merging without reshaping
   - This suggests the data was pre-processed differently than the loss datasets

7. **Relationship to Other Sheets:**
   - Can be merged with tree cover loss data using: country + threshold + year as keys
   - After merging, we can analyze which drivers contribute most to forest loss
   - Driver values represent a breakdown of total loss by cause, so summing drivers for a country-threshold-year should approximate (but may not exactly equal) total loss
   - Some loss may be unclassified or attributed to multiple causes

**Implication for Data Preparation:**
- **No reshaping needed** - this sheet is already in the desired long format
- Can be directly merged with reshaped loss data using country + threshold + year as merge keys
- May want to pivot driver column to create separate columns for each driver type (e.g., `hard_commodities_ha`, `logging_ha`) for easier analysis, or keep in long format depending on analysis needs
- The float64 data type for `tc_loss_ha` is appropriate for precise measurements
- This dataset will be easier to integrate than the loss datasets since it doesn't require wide-to-long transformation


### 1.4 Sheet 4: Country Carbon Data

## Step 4: Exploring Country Carbon Data


In [7]:
df_carbon = excel_file.parse("Country carbon data")

print("="*80)
print("COUNTRY CARBON DATA - RAW DATA")
print("="*80)
print(f"Shape: {df_carbon.shape[0]:,} rows × {df_carbon.shape[1]} columns")
print(f"\nColumn names ({len(df_carbon.columns)}):")
for i, col in enumerate(df_carbon.columns, 1):
    print(f"  {i:2d}. {col}")

print("\n" + "="*80)
print("First 5 rows:")
print("="*80)
display(df_carbon.head())

print("\n" + "="*80)
print("Missing Values:")
print("="*80)
missing = df_carbon.isnull().sum()
if missing.sum() > 0:
    display(pd.DataFrame({'Missing Count': missing[missing > 0]}))
else:
    print("✅ No missing values!")

carbon_metric_cols = [c for c in df_carbon.columns if 'carbon' in c.lower() or 'emission' in c.lower()]
print(f"\nCarbon-related columns: {carbon_metric_cols}")


COUNTRY CARBON DATA - RAW DATA
Shape: 498 rows × 32 columns

Column names (32):
   1. country
   2. umd_tree_cover_density_2000__threshold
   3. umd_tree_cover_extent_2000__ha
   4. gfw_aboveground_carbon_stocks_2000__Mg_C
   5. avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1
   6. gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1
   7. gfw_forest_carbon_gross_removals__Mg_CO2_yr-1
   8. gfw_forest_carbon_net_flux__Mg_CO2e_yr-1
   9. gfw_forest_carbon_gross_emissions_2001__Mg_CO2e
  10. gfw_forest_carbon_gross_emissions_2002__Mg_CO2e
  11. gfw_forest_carbon_gross_emissions_2003__Mg_CO2e
  12. gfw_forest_carbon_gross_emissions_2004__Mg_CO2e
  13. gfw_forest_carbon_gross_emissions_2005__Mg_CO2e
  14. gfw_forest_carbon_gross_emissions_2006__Mg_CO2e
  15. gfw_forest_carbon_gross_emissions_2007__Mg_CO2e
  16. gfw_forest_carbon_gross_emissions_2008__Mg_CO2e
  17. gfw_forest_carbon_gross_emissions_2009__Mg_CO2e
  18. gfw_forest_carbon_gross_emissions_2010__Mg_CO2e
  19. gfw_forest_carbon_gros

Unnamed: 0,country,umd_tree_cover_density_2000__threshold,umd_tree_cover_extent_2000__ha,gfw_aboveground_carbon_stocks_2000__Mg_C,avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1,gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1,gfw_forest_carbon_gross_removals__Mg_CO2_yr-1,gfw_forest_carbon_net_flux__Mg_CO2e_yr-1,gfw_forest_carbon_gross_emissions_2001__Mg_CO2e,gfw_forest_carbon_gross_emissions_2002__Mg_CO2e,...,gfw_forest_carbon_gross_emissions_2015__Mg_CO2e,gfw_forest_carbon_gross_emissions_2016__Mg_CO2e,gfw_forest_carbon_gross_emissions_2017__Mg_CO2e,gfw_forest_carbon_gross_emissions_2018__Mg_CO2e,gfw_forest_carbon_gross_emissions_2019__Mg_CO2e,gfw_forest_carbon_gross_emissions_2020__Mg_CO2e,gfw_forest_carbon_gross_emissions_2021__Mg_CO2e,gfw_forest_carbon_gross_emissions_2022__Mg_CO2e,gfw_forest_carbon_gross_emissions_2023__Mg_CO2e,gfw_forest_carbon_gross_emissions_2024__Mg_CO2e
0,Afghanistan,30,205771,12409398,123,15339,376800,-361461,27986.0,41762.0,...,0.0,0.0,0.0,4893.0,3708.0,11409.0,6772.0,1913.0,3435.0,2636.0
1,Afghanistan,50,148417,9765465,134,12657,275855,-263199,25603.0,32691.0,...,0.0,0.0,0.0,3920.0,3343.0,10321.0,6045.0,1664.0,2530.0,2106.0
2,Afghanistan,75,75480,5571655,150,6147,151074,-144926,15780.0,15308.0,...,0.0,0.0,0.0,1962.0,1743.0,6451.0,2477.0,668.0,1857.0,1512.0
3,Albania,30,648459,40958831,238,721806,5103589,-4381783,1417747.0,348556.0,...,120041.0,334094.0,448993.0,724335.0,429556.0,427420.0,506228.0,649874.0,948758.0,308121.0
4,Albania,50,534671,37239867,263,682919,4294627,-3611709,1358272.0,338279.0,...,113553.0,304691.0,403366.0,669011.0,404887.0,391385.0,449937.0,591504.0,895138.0,275104.0



Missing Values:
✅ No missing values!

Carbon-related columns: ['gfw_aboveground_carbon_stocks_2000__Mg_C', 'avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1', 'gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1', 'gfw_forest_carbon_gross_removals__Mg_CO2_yr-1', 'gfw_forest_carbon_net_flux__Mg_CO2e_yr-1', 'gfw_forest_carbon_gross_emissions_2001__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2002__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2003__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2004__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2005__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2006__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2007__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2008__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2009__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2010__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2011__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2012__Mg_CO2e', 'gfw_forest_carbon_gross_emissions_2013__Mg_CO2e', 'gfw_forest_carbon_gross_emissio

**Findings:**

From exploring the Country Carbon Data sheet, we discovered:

1. **Data Structure:**
   - The sheet contains **498 rows and 32 columns**
   - The data is in **wide format** with separate columns for each year (`gfw_forest_carbon_gross_emissions_2001__Mg_CO2e` through `gfw_forest_carbon_gross_emissions_2024__Mg_CO2e`)
   - This represents 24 years of annual emissions data (2001-2024) plus 8 metadata columns
   - The wide format will need to be converted to long format during data preparation
   - Similar structure to tree cover loss and primary loss sheets

2. **Key Columns Identified:**
   - `country`: Country names (object/string type)
   - `umd_tree_cover_density_2000__threshold`: Canopy density threshold values (int64 type) - **Note**: Column name differs from other sheets (uses `umd_tree_cover_density_2000__threshold` instead of `threshold`)
   - `umd_tree_cover_extent_2000__ha`: Tree cover extent in year 2000 (int64)
   - `gfw_aboveground_carbon_stocks_2000__Mg_C`: Total carbon stored in forests in 2000 (int64)
   - `avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1`: Average carbon density per hectare in 2000 (int64)
   - `gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1`: Annual gross carbon emissions (int64)
   - `gfw_forest_carbon_gross_removals__Mg_CO2_yr-1`: Annual gross carbon removals (int64)
   - `gfw_forest_carbon_net_flux__Mg_CO2e_yr-1`: Net carbon flux (emissions - removals) (int64)
   - **24 year-based columns**: `gfw_forest_carbon_gross_emissions_2001__Mg_CO2e` through `gfw_forest_carbon_gross_emissions_2024__Mg_CO2e` representing annual carbon emissions

3. **Carbon Metrics Available:**
   - **Carbon Stocks (2000 baseline)**: `gfw_aboveground_carbon_stocks_2000__Mg_C` - Total carbon stored in forests
   - **Average Carbon Density**: `avg_gfw_aboveground_carbon_stocks_2000__Mg_C_ha-1` - Carbon per hectare
   - **Annual Gross Emissions**: `gfw_forest_carbon_gross_emissions__Mg_CO2e_yr-1` - Total annual emissions
   - **Annual Gross Removals**: `gfw_forest_carbon_gross_removals__Mg_CO2_yr-1` - Total annual removals
   - **Net Carbon Flux**: `gfw_forest_carbon_net_flux__Mg_CO2e_yr-1` - Net change (emissions - removals)
   - **Yearly Emissions (2001-2024)**: 24 columns with annual emissions data

4. **Data Types:**
   - Most numeric columns are stored as **int64** (integers)
   - Year-based emission columns are stored as **float64** (floats), allowing for precise measurements
   - Country names are stored as **object** (strings)
   - Data types are appropriate for the data content

5. **Units:**
   - Carbon stocks: **Mg C** (Megagrams of Carbon)
   - Carbon emissions: **Mg CO2e** (Megagrams of CO2 equivalent)
   - Carbon removals: **Mg CO2** (Megagrams of CO2)
   - Tree cover extent: **ha** (hectares)
   - **Important**: Units must be preserved and documented during merging

6. **Data Quality:**
   - **No missing values found** in the dataset (excellent data quality!)
   - **498 rows** represent country-threshold combinations (similar to tree cover loss structure)
   - The dataset includes multiple threshold values per country

7. **Relationship to Other Sheets:**
   - Can be merged with tree cover loss data using: country + threshold + year as keys
   - **Important**: The threshold column name differs (`umd_tree_cover_density_2000__threshold` vs. `threshold`) - will need standardization during data preparation
   - Carbon emissions can be linked to forest loss to quantify climate impact
   - The year-based emission columns (2001-2024) align with the loss data time period

8. **Temporal Coverage:**
   - **Year range**: 2001 to 2024 (24 years of data)
   - Baseline year: 2000 (for carbon stocks and tree cover extent)
   - Annual emissions data available for each year in the range

**Implication for Data Preparation:**
- This sheet will need the same wide-to-long format transformation as tree cover loss and primary loss
- Year extraction must account for starting year 2001 (same as tree cover loss)
- **Column name standardization needed**: `umd_tree_cover_density_2000__threshold` should be renamed to `threshold` for consistency with other sheets
- Can be merged with loss data after reshaping using country + threshold + year as keys
- After reshaping, we expect approximately 498 × 24 = ~11,952 rows (if all years have data)
- Units must be preserved and clearly documented in the final merged dataset
- The carbon data enables climate impact analysis by linking deforestation to carbon emissions


**Final Summary: CRISP-DM Data Understanding Phase Completion**

This notebook represents the completion of the **Data Understanding** phase of the CRISP-DM methodology. Through systematic exploration of the raw Global Forest Watch data, we have achieved the key objectives of this phase:

1. **Data Collection and Initial Assessment:**
   - Successfully loaded and cataloged all available data sources (9 sheets total)
   - Identified and focused on the four country-level datasets relevant to our analysis
   - Documented the data source structure and organization

2. **Data Description:**
   - Characterized the structure of each dataset (wide format with year-based columns for loss and carbon data; long format for drivers data)
   - Documented all column names, data types, and basic statistics for each sheet
   - Identified 1,328 rows in tree cover loss, 76 rows in primary loss, 21,897 rows in drivers, and 498 rows in carbon data

3. **Data Quality Assessment:**
   - Evaluated missing values across all sheets (found no missing values in any dataset - excellent quality!)
   - Identified data inconsistencies (e.g., different threshold column names, different starting years)
   - Assessed data completeness and coverage (166 countries in tree cover loss, 76 in primary loss, 158 in drivers, 166 in carbon data)

4. **Data Exploration:**
   - Examined relationships between datasets (common identifiers: country, threshold, year)
   - Identified 76 countries present in all sheets (critical for merging strategy)
   - Discovered structural differences (drivers data already in long format; others need reshaping)

5. **Documentation of Findings:**
   - Created comprehensive documentation of data characteristics
   - Identified specific data preparation requirements (reshaping, column standardization, merge strategy)
   - Established clear expectations for the next phase

**Transition to Data Preparation Phase (CRISP-DM):**

The findings from this Data Understanding phase directly inform the **Data Preparation** phase (02_data_preparation.ipynb), where we will:

- **Data Selection:** Focus on the four country-level datasets identified
- **Data Cleaning:** Standardize column names (e.g., `umd_tree_cover_density_2000__threshold` → `threshold`, `area__ha` → `area_ha`)
- **Data Construction:** Reshape wide format to long format for tree cover loss, primary loss, and carbon data
- **Data Integration:** Merge all sheets using country + threshold + year as keys
- **Data Formatting:** Ensure consistent data types and units across the merged dataset

This systematic Data Understanding phase ensures that our Data Preparation phase will be:
- **Informed:** Based on actual data characteristics, not assumptions
- **Efficient:** Addressing only real issues identified through exploration
- **Comprehensive:** Covering all identified data quality and structural concerns
- **Aligned with CRISP-DM:** Following the methodology's structured approach to data science projects

The completion of this phase provides a solid foundation for the subsequent phases of CRISP-DM: Data Preparation, Modeling, Evaluation, and Deployment.
