<a href="https://colab.research.google.com/github/edoardochiarotti/class_datascience/blob/main/2024/03_EDA-Visualization/03_EDA_Practice.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# PACKAGES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.express as px

# EDA Practice

<img src='https://miro.medium.com/v2/resize:fit:750/format:webp/1*bNSd-pm4XjkOV7uSTNAfnA.jpeg' width="450">

Source: [EDA, Data Preprocessing, Feature Engineering: We are different!](https://medium.com/@ndleah/eda-data-preprocessing-feature-engineering-we-are-different-d2a5fa09f527), Leah Nguyen

## Content

>  More than anything, EDA is a state of mind ([Grolemund & Wickham, 2016](https://r4ds.had.co.nz/exploratory-data-analysis.html))

Exploratory Data Analysis (EDA) is the very first step before developing a statistical or ML model to answer business problems. Our goal is to get acquainted with the important traits of our data set by using descriptive statistics and visualization tools. To learn more about descriptive statistics, graphs and graph libraries in Python, you should read the notebook "03_EDA_Data-visualization". 

In this notebook, we will explore some indicators from [The Quality of Government Environmental Indicators Dataset](https://www.gu.se/en/quality-government/qog-data/data-downloads/environmental-indicators-dataset). Instead on importing the raw data, we will pick it up from where we left off in our previous practice on data cleaning, importing the dataset we created in that notebook. 

*Reference: Povitkina, Marina, Natalia Alvarado Pachon & Cem Mert Dalli. 2021. The Quality of Government Environmental Indicators Dataset, version Sep21. University of Gothenburg: The Quality of Government Institute, https://www.gu.se/en/quality-government*

We will apply the [recipe to empirically answer any question quickly](https://medium.com/towards-data-science/a-recipe-to-empirically-answer-any-question-quickly-22e48c867dd5) by Quentin Gallea, exploring how climate change and consumption patterns affect biodiversity, and more precisely, fishing biocapacity. We will follow these steps:

- [Select our ingredients](#select-ingredients)
- [Pick the right quantity of each ingredient](#pick-ingredients)
- [Tasting and preparing the ingredients (univariate analysis)](#taste-ingredients)
- [Cooking the ingredients together (bivariate analysis)](#cooking-ingredients)
- [Tasting the new recipe](#tasting-ingredients)



**We imported packages at the top of this notebook, do not forget to run the top cell to import the necessary packages!** Feel free to import other libraries if needed.


## Select our ingredients <a name="select-ingredients"></a>

<img src='https://i.imgflip.com/815iqf.jpg' width="500">


Let's import the CSV file "df_qog_polity", obtained from our last practice.

In [None]:
url = 'https://raw.githubusercontent.com/thurmboris/Data-Science_4_Sustainability/main/data/df_qog_polity.csv'
df = pd.read_csv(url)

Our first step is to select our ingredients, i.e., the variables we are interested in. 

In this notebook we will explore the following question:

*How do sea surface temperature (SST) anomalies affect fishing ground biocapacity?*

We could think that SST anomalies is reducing the biosphere's ability to produce seafood. Indeed, since most marine animals live in the upper layers of the water body, excess temperature could diminish the livability of these layers, forcing fishes to migrate, for example, to deeper zones, in turn affecting their ability to reproduce because:
- eggs may properly develop only at given temperature,
- changes in natural characteristics (e.g., terrain) may make breeding impossible, 
- changes in local food chain could make breeding more difficult.

In addition, a high fish consumption could exert additional pressure on the biosphere, while more stringent environmental regulation could lower this pressure. Hence we should also consider these factors in our analysis. 

Is our assumption correct? Let's explore! 

Here are the variables that we selected to answer our research question:
- **outcome variable**: Fishing ground biocapacity per capita (ef_fg_bc), which is the ability of a biosphere to produce seafood (the amount of fishing grounds available, weighted by the productivity of fishing grounds) per capita;
- **main explanatory variable**: SST anomalies (ohi_csst), which is the number of positive temperature deviations (anomalies) that exceed the natural range of variation for a given location, i.e., the frequency with which a location experiences unnaturally warm temperature;
- **additional explanatory variable**: Fish footprint of consumption (gha per person) (ef_fg), calculated based on estimates of the maximum sustainable catch for a variety of fish species.
- **additional explanatory variable**:  Number of climate change laws and policies (ccl_nlp), which is the cumulative sum of laws (legislative acts) and policies (executive provisions) related to climate change

You can learn more about these variables in the [QoG codebook](https://www.qogdata.pol.gu.se/data/codebook_ei_sept21_august2023.pdf). 

Ok, let's extract these variables:

- Extract in a new dataframe the columns 'iso3', 'year', 'cname_qog', 'ef_fg_bc', 'ohi_csst', 'ef_fg', and 'ccl_nlp'.
- Rename the column 'cname_qog' as 'country'
- Display your dataframe

In [None]:
# Your code here...


## Pick the right quantity of each ingredient <a name="pick-ingredients"></a>

<img src='https://i.imgflip.com/81615v.jpg' width="400">

Our objective is to explore the data availability of our variables and select a clean sample for the analysis.

- First, use the `.describe()` method ([Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)) to generate descriptive statistics of our variables
- What do you notice regarding the number of observations?

In [None]:
# Your code here...


Although we have 11'722 observations overall, we only have 148 non missing SST anomalies values, and 6'201 non missing ecological footprint values. Let's try to better understand where those missing values are coming from.

- Create a new dataframe grouping the observations by year and counting the number of non missing values for the variables 'ef_fg_bc', 'ohi_csst', and 'ef_fg'. You can use the methods `.groupby()` ([Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)) and `.count()` ([Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html))
- In this new dataframe, remove the rows that have zero non missing observations for our 3 variables
- Display your new dataframe, what do you observe?

In [None]:
# Your code here...


Unfortunately, we only have SST anomalies for the year 2015... It will significantly limit our analysis as it prevents to exploit the time dimension. For the ecological footprint variables, our number of observations is increasing through time. To have a relatively constant sample size, we will restrict to years from 1993 to 2015. We will also only keep countries for which we have observations for our outcome variable, namely fishing ground biocapacity per capita:

- Create a new dataframe keeping only the observations between 1993 to 2015 and for which the variable 'ef_fg_bc' is not missing

In [None]:
# Your code here...


Let's visualize our geographical coverage!

- Create a new dataframe counting the number of observations for each iso 3 code
- Create an interactive map displaying the number of observations for each country. Do not forget to add a title to your graph. You can also use different [color theme](https://plotly.com/python/builtin-colorscales/). What do you think?

In [None]:
# Your code here...


We have a nice geographical coverage, covering all continents, and most countries have +20 observations. Cool!

## Tasting and preparing the ingredients (univariate analysis) <a name="taste-ingredients"></a>

<img src='https://i.imgflip.com/81630v.jpg' width="500">

Our objectives are threefold:
- Prepare the data: by studying the distribution of the variables, we will decide if we should transform the data (e.g., log-transform, define a categorical variable, deal with outliers, etc.)
- Choose the right statistical tools: knowing the nature of each variable (continuous, categorical, binary, etc.) will allows us to choose the right statistical tools (correlation, bar/line graphs, scatter plot, etc).
- Get an idea of the underlying variation: looking at how the variable varies over time (line graph) and space (map), will help us better understand the data and potentially spot some anomalies or interesting shocks to exploit for a natural experiment.

Ok, before digging more carefully into each variable, let's have an overview of how our variables are distributed:

- Display descriptive statistics for our selected sample

In [None]:
# Your code here...


- Create box plots of the variables 'ef_fg_bc', 'ohi_csst', and 'ef_fg'. What do you observe?

*Note: you can create several plots next to each other by using `fig, ax = plt.subplots(1, 3, figsize=(10, 5))`. In `plt.subplots(1,3,figsize=(10,5))`, 1 refers to the number of rows, and 3 to the number of columns: here we would have 3 plots next to each other. You specify the location of each plot thanks to `ax`: here `ax[0]` would be the left plot, `ax[1]` the middle one, and `ax[2]` the right one.*

In [None]:
# Your code here...


It seems that our ecological footprint variables have a lot of outliers. In other words, they are highly-skewed. Let's continue the exploration.

### Fishing ground biocapacity per capita

Let's explore our first variable, namely the fishing ground biocapacity per capita

- Plot a histogram of the column 'ef_fg_bc'

In [None]:
# Your code here


The histogram confirms that our variable is highly skewed.

- Print the skewness of the variable 'ef_fg_bc'

In [None]:
# Your code here...


The skewness is very high. As a rule of thumb, when the skewness is larger than 3, it is a good idea to do a log transformation. However, we have another issue here: it seems that many observations may be equal to zero. To better understand what is happening, let's check for which countries we have values equal to 0.

- Print the countries that have at least one observation equal to zero for the column 'ef_fg_bc'

In [None]:
# Your code here...


We could have assumed that the fishing group biocapacity was zero for countries without access to the sea. However, this is not the case (probably because our variable also includes freshwater fishing ground). Hence, the observations equal to zero may simply be missing values. We will remove them, which will simplify our analysis.

- For our "selected sample" dataframe, remove the observations for which 'ef_fg_bc' is equal to zero.
- Display the new descriptive statistics

In [None]:
# Your code here...


Alright, now we can transform our data.

- Log transform the column 'ef_fg_bc'. You can for instance use the `log` function of the `numpy` library ([Documentation](https://numpy.org/doc/stable/reference/generated/numpy.log.html)), and create a new column 'ef_fg_bc_log' in our dataframe.
- Plot a histogram of your log transformed data
- Check the skewness of your log transformed data

In [None]:
# Your code here...


Alright, much better! Let's finalize our exploration of the fishing ground biocapacity per capita by checking the geographical patterns and the evolution through time of the variable

- Plot a map of the average 'ef_fg_bc_log' for each country (average of all years). *Note: you can group rows by iso 3 codes, and then use the `.mean()` method*.
- Do you observe any geographical patterns?

In [None]:
# Your code here...


- Plot the evolution of the average 'ef_fg_bc_log' (average of all countries). Do not forget labels and title! What do you observe?

In [None]:
# Your code here...


### Sea surface temperature (SST) anomalies

Let's now explore our second variable 'ohi_csst'

- Plot a histogram of 'ohi_csst' and check the skewness. Do we need to transform our variable? If yes, transform accordingly, else, continue.
- Plot a map of the SST anomalies in 2015. What do you think? 

*Note: to avoid repeating the same lines of code, it is always a good idea to define functions! We could have done so already in the previous section, but feel free to train how to define functions here, for instance defining a function to create histograms and another one to create a map.*


In [None]:
# Your code here...


### Fish footprint of consumption

Let's now dive into the fish footprint of consumption.

- Plot a histogram of 'ef_fg' and check the skewness. Do we need to transform our variable? If yes, transform accordingly, else, continue.
- Plot a map of the average fish footprint of consumption per country. What do you observe? 
- Plot the evolution of the average fish footprint of consumption in the world. What do you observe?

In [None]:
# Your code here...


### Number of climate change laws and policies

You may have noticed, we did not discuss much the number of climate change regulations until now. Indeed, we should save it in the fridge to keep it fresh and use at the right moment! Joke aside, it is true that we will manipulate and transform this variable later - data science is not a linear process after all. For now, it is important to note that this variable is discrete, not continuous like the other variables. 

- Plot a histogram of 'ccl_nlp'.
- Plot a map of the number of climate change regulations in 2015

In [None]:
# Your code here...


## Cooking the ingredients together (bivariate analysis) <a name="cooking-ingredients"></a>

<img src='https://i.imgflip.com/1e0h17.jpg' width="400">

Let's now put our ingredients together! Our goal is to detect potential associations between our variables, which will later guide our statistical analysis.

First, remember that for SST anomalies, we only had values for 2015. It is not ideal, and if we were to actually perform a statistical analysis, it would be wise to look for more data (maybe from another database?) or to use another variable (maybe temperature anomalies instead of SST anomalies?). But for now, let's proceed with the data at hand.

- Extract observations for which the variables 'ohi_csst' and 'ccl_nlp' are not missing

In [None]:
# You code here...


We'll first check the correlations between our variables. Even though correlation does not imply causation, it's never a bad idea to check how our variables correlate.

- Display the correlation between the variables 'ef_fg_bc_log' (log transformed), 'ohi_csst', and 'ef_fg_log' (log transformed). *Note: you can use the `pandas` method `.corr()` ([Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html))*
- For a more visual representation, plot a heat map of your correlation matrix. *Note: you can for instance use the `seaborn` function `heatmap()` ([Documentation](https://seaborn.pydata.org/generated/seaborn.heatmap.html))*

In [None]:
# Your code here...


Note that the absence of (Pearson)-correlation does not indicate an absence of causation. Indeed, there could be a non-linear association between our variables! A scatter plot provides a visual tool to check the relation between our variables.

- Plot two scatter plots:
    - 'ef_fg_bc' (log transformed) vs 'ohi_csst'
    - 'ef_fg_bc' vs 'ef_fg' (both log transformed)
- What do you observe?    

In [None]:
# Your code here...



The relation between the fishing ground biocapacity and SST anomalies is not obvious... As for fishing ground biocapacity and fish footprint of consumption, we do have a trend! But wait, it seems that the biocapacity is increasing with the consumption?! When we checked the evolution through time of our variables, they were evolving in the opposite direction, which seemed more in line with our intuition, but here we observe a positive correlation. How is that possible? 

Well it may be an instance of [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox). This paradox describes phenomena in which a trend appears in several groups of data but disappears or reverses when the groups are combined. Here, it may be a good idea to group our countries by their GDP level and see if there is a relation between our variables within groups of countries. Indeed, richer countries may have both a greater fish consumption and stricter environmental norms protecting fishing biocapacity (what we observe in our scatter plot). However, for the same level of development, a higher fish footprint of consumption could lead to a lower fishing ground biocapacity (what we observe in our previous line plot). 

We did not select GDP as a variable - what a shame, however, we could still do a heterogeneity analysis using... the number of climate change regulations! Yes, it is time to take it out of the fridge! Remember that the number of climate change regulation is a discrete variable. In such case, instead of directly using the variable, it may be interesting to create clusters, e.g., countries with value above the median vs countries with value below the mean.

- Create a new column 'cluster' with values 0 for countries with 'ccl_nlp' below the median, and 1 for countries with 'ccl_nlp' above the median

In [None]:
# Your code here...


Ok, let's see how grouping our countries by clusters depending on their number of policies affect the relationship between our variables:

- Create a scatter plot of 'ef_fg_bc_log' vs 'ef_fg_log', colored by the country 'cluster'. What do you observe?

In [None]:
# Your code here...


Well... It does not seem like we uncovered some secret relationships... That's a pity, but it does not mean we should give up now! As previously discussed, we may need more data, e.g., GDP, temperature anomalies instead of SST anomalies, to take into consideration the time dimension... Hence, now would be a good time to reflect on what we accomplished so far, check again the documentation to make sure we properly understood our variables, and question if our choices were appropriate.

## Tasting the new recipe <a name="tasting-ingredients"></a>

<img src='https://i.postimg.cc/13xSdFjc/naruto-ramen.jpg' width="250">

It is time to reflect on our cooking experience and evaluate if our research question is worth further investigation:
- What did we learn?
- Did we identify some interesting patterns in our data (geographic clusters, sudden change in values)?
- Did we identify some interesting relationships between our variables?
- Would we have all the data we need to pursue?
- What would be problematic for a proper causal analysis? We don't need fancy statistical model yet! Remember Quentin's *First Aid Statistics* to question the causal link between variables: 
    - Is there something else? 
    - Is it the reverse? 
    - Can we extrapolate?

If you completed this notebook, your deserve to enjoy a bowl of ramen while watching Quentin Gallea's Ted Talk on [How to question numbers and prevent manipulation](https://www.youtube.com/watch?v=bD1Jq6YwPk8)...