# Basic Operations and Plotting

This tutorial shows basic examples on how to load, handle, and plot the EM-DAT data using the [`pandas`](https://pandas.pydata.org/) Python data analysis package and the [`matplotlib`](https://matplotlib.org/) charting library.

**Note**: This tutorial is also available on the [EM-DAT Documentation Website](https://doc.emdat.be/docs/additional-resources-and-tutorials/tutorials/python_tutorial_1/).

## Import Modules

Let us import the necessary modules and print their versions. For this tutorial, we used `pandas` v.2.1.1 and `matplotlib` v.3.8.3. If your package versions are different, you may have to adapt this tutorial by checking the corresponding package documentation.

In [None]:
import pandas as pd #data analysis package
import matplotlib as mpl
import matplotlib.pyplot as plt #plotting library
for i in [pd, mpl]:
    print(i.__name__, i.__version__)

## Load EM-DAT

To load EM-DAT:
* Download the EM-DAT data at https://public.emdat.be/;
* Use the [pd.read_excel](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) method to load and parse the data into a `pd.DataFrame` object; 
* Check if the data has been succesfully parsed with the [`pd.DataFrame.info`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method.

**Notes**: 
1. You may need to install the `openpyxl` package or another engine to make it possible to read the data. 
2. Another option is to export the `.xlsx` file into a `.csv`, and use the [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) method;
3. If not in the same folder as the Python code, replace the filename with the relative path or the full path, e.g., `E:/MyDATa/public_emdat_2024-01-08.xlsx`

In [None]:
#!pip install openpyxl
df = pd.read_excel('public_emdat_2024-01-08.xlsx') # <-- modify file name or path
df.info()

## Example 1: Japan Earthquake Data

### Filtering 

Let us focus on the EM-DAT earthquakes in Japan for the 2000-2023 period and build an appropriate filter using the EM-DAT columns `Disaster Type`, `ISO` and `Start Year`. 

For simplicity, let us only keep the columns `Start Year`, `Magnitude`, and `Total Deaths` and show the 5 first outcomes with the [`pd.DataFrame.head`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method.

**Note**:
If you need more insight about the columns, check the EM-DAT Documentation page [EM-DAT Public Table](https://doc.emdat.be/docs/data-structure-and-content/emdat-public-table/).

In [None]:
eq_jpn = df[
    (df['Disaster Type'] == 'Earthquake') &
    (df['ISO'] == 'JPN') &
    (df['Start Year'] < 2024)
][['Start Year', 'Magnitude', 'Total Deaths', 'Total Affected']]
eq_jpn.head(5)

### Grouping

Let us group the data to calculate the number of earthquake event by year and plot the results.
* Use the [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method to group based on one or more columns in a DataFrame, e.g., `Start Year`;
* Use the [`size`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html) method as an aggregation method (or [`count`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html)).
* Plot the results easilly using the [`pd.DataFrame.plot`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) method.

**Note**: The `count` method provides the total number of non-missing values, while `size` gives the total number of elements (including missing values). Since the field `Start Year` is always defined, both methods should return the same results.


In [None]:
eq_jpn.groupby(['Start Year']).size().plot(kind='bar')

### Customize Chart

The `pandas` library relies on the `matplotlib` package to draw charts. To have more flexibility on the rendered chart, let us create the figure using the imported `plt` submodule.

In [None]:
# Group earthquake data by 'Start Year' and count occurrences
eq_cnt = eq_jpn.groupby(['Start Year']).size()

# Initialize plot with specified figure size
fig, ax = plt.subplots(figsize=(7, 2))

# Plot number of earthquakes per year
ax.bar(eq_cnt.index, eq_cnt)

# Set axis labels and title
ax.set_xlabel('Year')
ax.set_ylabel('N° of Earthquake')
ax.set_yticks([0, 1, 2, 3])  # Define y-axis tick marks
ax.set_title('Number of EM-DAT Earthquake in Japan (2000-2023)')

## Example 2: Comparing Regions 

Let us compare earthquake death toll by continents. As before, we filter the original dataframe `df` according to our specific needs, including the `Region` column. 

In [None]:
eq_all = df[
    (df['Disaster Type'] == 'Earthquake') &
    (df['Start Year'] < 2024)
][['Start Year', 'Magnitude', 'Region', 'Total Deaths', 'Total Affected']]
eq_all.head(5)

In this case,
* Use the [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method to group based on the `Region` column;
* Use the [`sum`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html) method for the `Total Deaths` field as aggregation method;
* Plot the results easilly using the [`pd.DataFrame.plot`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) method.

In [None]:
eq_sum = eq_all.groupby(['Region'])['Total Deaths'].sum()
eq_sum

Finally, let us make an horizontal bar chart of it using `matplotlib`. In particular,

* use the [`ax.ticklabel_format`](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.ticklabel_format.html) method to set the x axis label as scientific (in thousands of deaths);
* use the [`ax.invert_yaxis`](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.invert_yaxis.html) to display the regions in alphabetical order from top to bottom.

In [None]:
fig, ax = plt.subplots(figsize=(4,3))
ax.barh(eq_sum.index, eq_sum)
ax.set_xlabel('Total Earthquake Deaths')
ax.ticklabel_format(style='sci',scilimits=(3,3),axis='x')
ax.invert_yaxis()
ax.set_title('EM-DAT Earthquake Deaths by Regions')

## Example 3: Multiple Grouping

At last, let us report the earthquake time series by continents. To avoid the creation of a `['Region', 'Start Year']` multiindex for future processing, we set the argument `as_index` to `False`. As such, `Region` and `Start Year` remain columns. 

In [None]:
eq_reg_ts = eq_all.groupby(
    ['Region', 'Start Year'], as_index=False
)['Total Deaths'].sum()
eq_reg_ts

Next, we apply the [`pivot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html) method to restructure the table in a way it could be plot easilly.

In [None]:
eq_pivot_ts = eq_reg_ts.pivot(
    index='Start Year', columns='Region', values='Total Deaths'
)
eq_pivot_ts.head()

In [None]:
ax = eq_pivot_ts.plot(kind='bar', width=1, figsize=(6,3))
ax.set_ylabel('Total Deaths')
ax.set_title('EM-DAT Earthquake Deaths by Regions')

In order to be able to visualize the data in more details, let us make a subplot instead by setting the   `subplot` argument to `True` within the `plot` method. 

In [None]:
ax = eq_pivot_ts.plot(kind='bar', subplots=True, legend=False, figsize=(6,6))
plt.tight_layout() # <-- adjust plot layout

We have just covered the most common manipulations applied to a [`pandas`](https://pandas.pydata.org/pandas-docs/stable/) `DataFrame` containing the EM-DAT data. To delve further into your analyses, we encourage you to continue your learning of [pandas](https://pandas.pydata.org/pandas-docs/stable/) and [matplotlib](https://matplotlib.org/stable/contents.html) with the many resources available online, starting with the official documentation.

If you are interested in learning the basics of making maps based on EM-DAT data, you can also follow the [EM-DAT Python Tutorial 2: Making Maps](./python_tutorial_2_making_maps.ipynb).