# Forest Data Annual Explore
**Goal:** We have NASA JPL data on forests going back for the past 20 years. This notebook pulls all of those together to create a cleaned up tidy dataset to create plots over time.

In [104]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import statsmodels
import numpy as np
import geopandas as gpd
from keplergl import KeplerGl
import os

pd.options.mode.chained_assignment = None  # default='warn'

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pio.templates.default = "none"
%config InlineBackend.figure_format ='retina'

## Data Read-in and Cleaning
Data exists in this google drive link: https://drive.google.com/drive/folders/1wvgpM55g77-bHRfHJAE_HPfHZLIkII1f

There are datasets of two different granularities: country and province level forest emissions, which we will read into separate dataframes.

In [13]:
# Read in all files in data directory, which is downloaded locally in repo to avoid messing with Google Drive APIs
directory = 'data'
full_list = os.listdir(directory)
country_list = [file for file in full_list if file[:6] == 'gadm00']
province_list = [file for file in full_list if file[:6] == 'gadm01']

In [69]:
# Create Country Dataframe
country_df_list = []
for f in country_list:
    year = f.split("carbon_",1)[1][:4]
    t_df = pd.read_excel('./raw_data/' + f,usecols=lambda x: 'Unnamed' not in x) # Get rid of unnamed index column
    # Next we want to clean up these column names so that the year is taken out and moved to a separate column
    t_df.insert(1,'Year',year) #Create new year column
    t_df.columns = [col.replace(' Yr-'+year,'') for col in t_df.columns]
    country_df_list.append(t_df)
len(country_df_list)
country_df = pd.concat(country_df_list).sort_values(by = ['Country','Year'])
country_df.to_csv('./cleaned_data/country_level_annual_forest_data.csv')
country_df.head()

19

Unnamed: 0,Country,Year,Forest Area (ha),Nonforest Area (ha),Forest loss (ha),Forest C (TgC),Nonforest C (TgC),Total C (TgC),Deforestation Emission (TgC),Degradation Emission (TgC),Fire Emission (TgC),Removal (tons C)
0,Afghanistan,2001,215448.071289,378638.40332,103.316418,2.177294,340.720184,342.897491,0.000603,0.006141,0.002173,-536413.6
0,Afghanistan,2002,215448.071289,378638.40332,203.680723,2.10271,332.489868,334.59259,0.005333,0.006073,0.009176,8284320.0
0,Afghanistan,2003,215448.071289,378638.40332,243.034849,2.211627,328.742676,330.954285,0.009189,0.00628,0.010983,3611854.0
0,Afghanistan,2004,215448.071289,378638.40332,206.475227,2.149607,321.279999,323.429596,0.003678,0.005521,0.001391,7514099.0
0,Afghanistan,2005,215448.071289,378638.40332,267.715857,2.23558,323.401276,325.636871,0.008252,0.006086,0.00639,-2228003.0


In [74]:
# Create Province Dataframe
province_df_list = []
for f in province_list:
    year = f.split("carbon_",1)[1][:4]
    t_df = pd.read_excel('./raw_data/' + f,usecols=lambda x: 'Unnamed' not in x) # Get rid of unnamed index column
    # Next we want to clean up these column names so that the year is taken out and moved to a separate column
    t_df.insert(2,'Year',year) #Create new year column
    t_df.columns = [col.replace(' Yr-'+year,'') for col in t_df.columns]
    province_df_list.append(t_df)
len(province_df_list)
province_df = pd.concat(province_df_list).sort_values(by = ['Country','Province','Year'])
province_df.to_csv('./cleaned_data/province_level_annual_forest_data.csv')
province_df.head()

19

Unnamed: 0,Country,Province,Year,Forest Area (ha),Nonforest Area (ha),Forest loss (ha),Forest C (GgC),Nonforest C (GgC),Total C (GgC),Deforestation Emission (GgC),Degradation Emission (GgC),Fire Emission (GgC),Removal (tons C)
0,Afghanistan,Khost,2001,5781.328964,26786.935425,0.747203,0.0,4669.428349,4669.428349,0.011443,0.0,0.0,117764.472961
0,Afghanistan,Khost,2002,5781.328964,26786.935425,27.872674,0.0,4480.788708,4480.788708,0.419173,0.0,0.0,188220.471144
0,Afghanistan,Khost,2003,5781.328964,26786.935425,24.737055,0.269373,4632.602692,4632.872105,0.405558,0.0,2.983986,-155472.949147
0,Afghanistan,Khost,2004,5781.328964,26786.935425,107.123736,0.22072,4431.820869,4432.041645,1.559557,0.0,0.675997,198594.912887
0,Afghanistan,Khost,2005,5781.328964,26786.935425,67.987326,0.293602,4826.453686,4826.746941,1.28194,0.0,4.264727,-400251.954794


## Country Data Explore
Let's take a look at some of this country data and see how these values change over time. Some initial visualizations to look at here:
* Let's pick 5 similarly sized countries and take a look at how key variables plot over time
* Let's plot totals with a country breakdown

In [73]:
country_select = ['United States','Brazil','India','China','Russia']
variable_list = country_df.columns.drop(['Country','Year'])
for var in variable_list:
    px.line(country_df[country_df.Country.isin(country_select)],
            x='Year',y=var,color = 'Country',title = var + ' (2001-2019)')

In [85]:
# For a handful of variables, let's plot cumulative charts for these two decades
country_select = ['United States','Brazil','India','China','Russia']
variable_list = ['Deforestation Emission (TgC)',
       'Degradation Emission (TgC)', 'Fire Emission (TgC)',
       'Removal (tons C)']
for var in variable_list:
    t_df = country_df[country_df.Country.isin(country_select)][['Country','Year',var]]
    t_df['Cumulative ' + var] = t_df.groupby('Country').cumsum()
    px.line(t_df,x='Year',y='Cumulative ' + var,color = 'Country',title = 'Cumulative ' + var + ' [2001-2019]')

In [121]:
# Next let's plot how each of these countries countribute to the total figures for 2019
year_df = country_df[country_df.Year == '2019']
variable_list = country_df.columns.drop(['Country','Year'])
for var in variable_list:
    global_total = year_df[var].sum()
    t_df = year_df[['Country',var]]
    t_df.loc[-1] = ['Global Total',t_df[var].sum()]
    t_df['% of Global Total'] = t_df[var] / global_total
    t_df = t_df.sort_values(by = var)
    color_list = ['lightgreen',] * 9
    fig = px.bar(t_df.tail(10),x = var,y='Country',text = '% of Global Total',orientation = 'h',
                 title = 'Top Countries: ' + var + ' [2019 Data]',color_discrete_sequence = ['lightgreen'])
    fig = fig.update_traces(texttemplate='%{text:.2p}', textposition='outside')
    fig