# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My datasets:
- Conflict in Mexico: https://acleddata.com/data-export-tool/ (Armed Conflict Location & Event Data Project, January 1 2018 - December 31 2023, accessed October 18, 2024)
- Forest over loss: https://www.globalforestwatch.org/dashboards/country/MEX/ (Global Forest Watch, 2001-2023, access October 11 2024)

Import the necessary libraries and create your dataframe(s).

In [2]:
import pandas as pd

# Load the DRIVER GFW csv file into separate dataframes:
driver_df = pd.read_csv('/Users/audreythill/Desktop/LaunchCode/Python/Final Project/eda-checkpoint/GFW_Mexico_treecover_loss__ha.csv')

# Rename several columns to make them easier reference in EDA and to match the conflict_df: rename the 'umd_tree_cover_loss__year' to 'year'
forest_renamed = driver_df.rename(columns={'umd_tree_cover_loss__year': 'year', 'adm1': 'admin1', 'tsc_tree_cover_loss_drivers__driver': 'driver', 'umd_tree_cover_loss__ha': 'forest_loss_by_driver'})

forest_renamed.head()

Unnamed: 0,driver,year,forest_loss_by_driver,gfw_gross_emissions_co2e_all_gases__Mg
0,Unknown,2012,2988.479457,535026.1
1,Unknown,2019,795.091102,199184.8
2,Commodity driven deforestation,2021,43255.446201,19047170.0
3,Urbanization,2021,707.054035,371255.2
4,Unknown,2011,2072.176484,284452.0


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [3]:
# Identify how many nulls there are in the df:
print(forest_renamed.isnull().sum())


driver                                    0
year                                      0
forest_loss_by_driver                     0
gfw_gross_emissions_co2e_all_gases__Mg    0
dtype: int64


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [4]:
# Find the max and minimum values for forest loss by driver (and round 2 decimal places):
max_forest_loss_by_driver = forest_renamed['forest_loss_by_driver'].max().round(2)
print(f"max forest loss by driver: {max_forest_loss_by_driver} hectares")

min_forest_loss_by_driver =forest_renamed['forest_loss_by_driver'].min().round(2)
print(f"min forest loss by driver: {min_forest_loss_by_driver} hectares")

max forest loss by driver: 244467.94 hectares
min forest loss by driver: 183.03 hectares


In [5]:
# Now I want to look at the 1st and 3rd quartiles and find the inner quartile range to get a better sense of the data spread.
# For forest loss by driver:
Q1_driver = forest_renamed['forest_loss_by_driver'].quantile(0.25).round(2)
Q3_driver = forest_renamed['forest_loss_by_driver'].quantile(0.75).round(2)
IQR_driver = Q3_driver - Q1_driver   # calculate the inter-quartile range
print(Q1_driver)  
print(Q3_driver)
print(IQR_driver)


# It's interesting to see that the IQR for these two columns are not the same. But this is to be expected since they represent different units of analysis (admin vs. driver).
# The larger IQR for forest loss by driver suggests that there is more variability attributable to the different drivers of deforestation. 
# For example, Q3 is very high (59559.54 ha), meaning the data skews to the right. This suggests that while a significant portion of the data has relatively low forest loss
# (up to Q1, which is 1451.35), there some drivers that lead to extremely high forest loss. 

1420.49
43034.61
41614.12


In [6]:
# # Now find the outliers by establishing boundaries based on adding or subtracting 1.5X the inter-quartile range (IQR)
# by driver:
lower_bound = Q1_driver - 1.5 * IQR_driver
upper_bound = Q3_driver + 1.5 * IQR_driver
outliers = forest_renamed[(forest_renamed['forest_loss_by_driver'] < lower_bound) | (forest_renamed['forest_loss_by_driver'] > upper_bound)]
print(outliers)
# I see the outliers for forest loss are heavily represented by ‘commodity driven deforestation’ and ‘shifting agriculture’ in the ‘driver’ column.
# Based on my visualizations in EDA, I know these are important outliers to include.

                   driver  year  forest_loss_by_driver  \
70   Shifting agriculture  2019          244467.937745   
74   Shifting agriculture  2011          134235.339909   
75   Shifting agriculture  2005          152518.165832   
76   Shifting agriculture  2002          119783.012726   
78   Shifting agriculture  2003          107709.943029   
87   Shifting agriculture  2007          161251.748375   
90   Shifting agriculture  2023          140197.844185   
91   Shifting agriculture  2016          202491.648488   
94   Shifting agriculture  2018          192642.494869   
97   Shifting agriculture  2006          119823.198893   
99   Shifting agriculture  2008          131157.298720   
101  Shifting agriculture  2015          138919.102405   
104  Shifting agriculture  2021          126386.193787   
106  Shifting agriculture  2004          125829.313461   
107  Shifting agriculture  2020          214619.365181   
108  Shifting agriculture  2009          192429.334132   
117  Shifting 

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [7]:
# There is quite a bit of unnecessary data. I want to drop some columns, so first I will print out the current column names.
print(forest_renamed.columns)

# The following columns are not relevant to my research scope so I will drop them:
#     - ACLED: 'inter1', 'inter2', 'interaction', 'civilian_targeting', 'iso_y', 'disorder_type', 'event_date', 'region', 'admin2', 'admin3', 'source', 'geo_precision', 'source_scale', 'country', 'location', 'notes', 'tags','timestamp’, 'time_precision'
#     - GFW: 'gfw_gross_emissions_co2e_all_gases__Mg_driver', 'iso_x', 'gfw_gross_emissions_co2e_all_gases__Mg_admin'

merged_df = forest_renamed.drop(columns=['gfw_gross_emissions_co2e_all_gases__Mg',
                                            ])
print(merged_df.columns)
# I did leave in a few columns (from ACLED conflict dataset) that are not necessary to my immediate project but that might be 
# interesting to look at later. E.g, actor names and the types of violence.




Index(['driver', 'year', 'forest_loss_by_driver',
       'gfw_gross_emissions_co2e_all_gases__Mg'],
      dtype='object')
Index(['driver', 'year', 'forest_loss_by_driver'], dtype='object')


In [8]:
# I need to drop forest data prior to 2015 because this is when GFW changed their methodology. Comparing old and new data, especially before/after 2015
# could result in inaccurate analysis.

after_2015_merged_df = merged_df[merged_df['year'] >= 2015]


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [9]:
# I will rename some columns to be more descriptive to better facilitate my analysis in Tableau:
after_2015_merged_df = after_2015_merged_df.rename(columns={
       'even_id_cnty': 'conflict event',
       'forest_loss_by_admin': 'forest loss by admin',
       'forest_loss_by_driver': 'forest loss by driver',
       'event_id_cnty': 'conflict event id',
       'event_type': 'event type',
       'sub_event_type': 'subevent type'       
       })
print(after_2015_merged_df.columns)
print(after_2015_merged_df)

Index(['driver', 'year', 'forest loss by driver'], dtype='object')
                             driver  year  forest loss by driver
1                           Unknown  2019             795.091102
2    Commodity driven deforestation  2021           43255.446201
3                      Urbanization  2021             707.054035
7                      Urbanization  2015             862.337747
8    Commodity driven deforestation  2017           76080.396553
10                     Urbanization  2018            1014.596636
12   Commodity driven deforestation  2020           72552.914783
13                          Unknown  2018            1504.600186
18   Commodity driven deforestation  2023           72558.881596
19                          Unknown  2022            1070.804437
20                     Urbanization  2023            1961.959394
21                          Unknown  2017             699.846113
22                     Urbanization  2017             563.918378
23   Commodity driven d

In [None]:
# Add a column for 'football field equilvalents' so i can make a bar chart with icons in Tableau.
import numpy as np
# First calculate the number of football field equivalents. np.floor ensures that the numbers round down so that I
# don't end up with half of an icon.
after_2015_merged_df['Football Field Equivalents'] = np.floor((after_2015_merged_df['Forest Loss By Driver'] * 1.4) / 1000).astype(int)

# Repeat rows using NumPy so that each football equivalent has its own rows. 
expanded_rows = np.repeat(after_2015_merged_df.values, after_2015_merged_df['Football Field Equivalents'], axis=0)


KeyError: 'Forest Loss By Driver'

In [11]:
after_2015_merged_df.to_csv('df_forest_loss_driver.csv', index=False) 

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?

Yes, but these were not due to errors on the part of the people who created the dataset. They mostly stemmed from combining two separate datasets that used different spelling conventions for Mexican states. Additionally, there were null values, but these were not missing due to error but rather because there was not relevant data for those particular rows/columns (e.g., if a conflict event did not involve an 'associated actor 2'). I also dropped columns that are not relevant to my analysis and renamed columns so they are more descriptive. As for irregular data, I identified that there are indeed outliers, but based on what I saw, these are exactly what I will be looking for in my analysis.

At first I thought I would drop the forest loss data that precedes January 2018 (when the conflict data starts), but then I realized I still may want to analyze general trends from those years. However, I did drop data prior to 2015 because this is when the GFW dataset creators changed their methodology.

2. Did the process of cleaning your data give you new insights into your dataset?

The process of identifying outliers in my forest loss data gave me some insights. For example, the IQR 
for 'forest loss by driver' was larger than 'forest loss by admin'. For 'driver', the 3rd quartile was quite high, meaning the data skewed significantly to the right -- this leads me to think that most drivers have little relationship to forest loss whereas a few drivers have a significant impact. On the other hand, the lower IQR for 'admin'-related forest loss suggests that the specific admin is less of a factor in teh forest loss.

3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?

As described above, I will definitely need to drill into the relationship between admin and forest loss. Based on my calculations of IQR, it appears to be less of a factor compared to other 'drivers' of forest loss, but this does not mean there is not a statistical relationship. I will try to do a regression analysis in addition to other charts/graphs.

I will also be interested to use the latitute and longituede (provided by the ACLED conflict dataset) to make maps. Hopefully this works to create shading on individual states.
