<a href="https://colab.research.google.com/github/edoardochiarotti/class_datascience/blob/main/2024/02_Data-Cleaning/02_Data-Cleaning-Panda-Introduction.ipynb" target="_blank" rel="noopener"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning: Introduction to Pandas

<img src='https://miro.medium.com/v2/resize:fit:720/format:webp/0*hHVINI5TGJB6jPKN.jpg' width="600">

Source: Roman Orac [Pandas Web API - Towards Data Science](https://towardsdatascience.com/pandas-analytics-server-d64d20ef01be)

## Content

Throughout your career, you will undoubtedly need to handle data, possibly lots of data. Data comes in lots of formats, and you will spend much of your time manipulating and cleaning it to obtain usable form for analysis. In this notebook, we will discover one of the most common package to manipulate data, namely Pandas.

- [Pandas](#Pandas)
   - [Importing data](#Importing-data)
   - [Discovering your data frame](#Discovering-your-data-frame)
      - [Dimensions of data frame](#Dimensions-of-data-frame)
      - [Data frame indexing](#Data-frame-indexing)
      - [Scope: data types](#Scope:-data-types)
      - [Scope: Extract unique values in a column](#Scope:-Extract-unique-values-in-a-column)
   - [Cleaning your data frame](#Cleaning-your-data-frame)
      - [Identifying NaN](#Identifying-NaN)
      - [Droping NaN](#Droping-NaN)
      - [Dealing with errors and other missing values](#Dealing-with-errors-and-other-missing-values)
   - [Merging data frames](#Merging-data-frames)
   - [Manipulating your data](#Manipulating-your-data)
      - [Operating on columns](#Operating-on-columns)
      - [Functions and data frame](#Functions-and-data-frame)
      - [Calculating GrDP](#Calculating-GrDP)
   - [Exporting data frame](#Exporting-data-frame)

## Pandas <a name="Pandas"></a>

Pandas (derived from "panel data") is the go-to package for data analysis and manipulation. Its primary object, the `DataFrame` is extremely useful in wrangling data. We will explore some of that functionality here, and will put it to use in all along this course.

You can read more about Pandas in the [documentation](https://pandas.pydata.org/docs/index.html), and you can refine your knowledge with this [online course and tutorial](https://realpython.com/learning-paths/pandas-data-science/). Since Pandas is one of the most used package, you can also find a ton of material online, answering any questions you might have. As always, there is no need to reinvent the wheel, and you should rely on the years of experience and knowledge of programmers who already faced similar issues you might encounter. 

Without further ado, let's import Pandas. We generally `import pandas as pd`:

In [None]:
import pandas as pd

### Importing data <a name="Importing-data"></a>

Now the fun begins! We will discover the functionalities offered by Pandas using "real" data. More precisely, we will do an application on the Green Domestic Product (GrDP).

GrDP is a novel indicator developed by E4S to remedy some of the shortcomings of GDP. GrDP extends the scope of the GDP to integrate the depletion of natural, social, and human capital. In its current version, GrDP considers the impacts of the emissions of three groups of pollutants: greenhouse gases, air pollutants, and heavy metals. These impacts include climate change, health issues, decrease in crops' yields and biomass production, buildings degradation, and damages to ecosystems due to eutrophication. 

To learn more, you can read the recent E4S white paper applying GrDP to Switzerland and the methodological report [here](https://e4s.center/en/resources/reports/green-domestic-product/). The IbyIMD magazine also published an [article](https://iby.imd.org/sustainability/lets-replace-gdp-introducing-the-green-domestic-product/) to introduce GrDP and its application for policymakers and businesses. Finally, you can explore the results of the project in our [online webapp](https://green-dp.streamlit.app/), designed using [streamlit](https://streamlit.io/cloud).

First, we need to import our data. The datasets were uploaded to our GitHub repository as two CSV files: "GrDP_Panel-Data.csv" and "GrDP_Cost.csv". We can directly import online CSV file using the `.read_csv()` function:

In [None]:
url = "https://raw.githubusercontent.com/thurmboris/Data-Science_4_Sustainability/main/data/GrDP_Panel-Data.csv"
url_cost = "https://raw.githubusercontent.com/thurmboris/Data-Science_4_Sustainability/main/data/GrDP_Cost.csv"

df=pd.read_csv(url)
df_cost = pd.read_csv(url_cost)

type(df)

Note that to access the data url in GitHub, you need to click on your data file, and then click on "Raw".

You can import data from a variety of sources e.g., Excel file, TSV, HTML, JSON, etc., define specific columns to import, and even rename them! Learn more [how to import data into Pandas dataframes](https://practicaldatascience.co.uk/data-science/how-to-import-data-into-pandas-dataframes).

### Discovering your data frame <a name="Discovering-your-data-frame"></a>

The first thing you want to do is explore your data frame to understand its structure and what it contains. In a notebook, we can directly look at it in a nice, visual way:

In [None]:
df

This is a nice representation of the data, but we really do not need to display that many rows of the data frame in order to understand its structure. Instead, we can use the `.head()` method of data frames to look at the first few rows:

In [None]:
df.head()

Nice! We can see that the data includes the GDP, population, and emissions of various pollutants for several countries and years. Let's continue our exploration!

#### Dimensions of data frame <a name="Dimensions-of-data-frame"></a>

When we look at our whole data frame, the dimensions of our data frame were printed below the table (900 rows x 17 columns). We can also directly extract the number of observations (rows) using the `len()` function:

In [None]:
len(df)

Similarly, using the `len()` function and the `.columns` method, we have access to the number of variables (columns):

In [None]:
len(df.columns)

We can also use directly return the number of observations and variables using the `.shape` method:

In [None]:
df.shape

Going on with our exploration, we can print the names of the columns. As often there are several ways depending on what we want to achieve. Here are some options:
- `list(df.columns)` returns a list of our columns,
- `sorted(df)` returns a sorted list of our columns,
- `df.keys()` returns a pandas Index object

In [None]:
df.keys()

#### Data frame indexing <a name="Data-frame-indexing"></a>

Once we have understood the structure of our data frame, it is a good idea to look at some observations. 

We **index data frames by columns**. For instance, let's have a look at countries' population:  

In [None]:
df['Population']

What if we want to extract a specific value? Well, did you notice on the left of our data frame, there was a nameless column. Nameless but not useless, this column is the row labels. 

We can use the label to extract a given value. For example, let's say we want to extract the population of Belgium in 1990 (first row), one way is to use the syntax `df['Population'][0]`. However, the preferred way is to use the `.loc` method, short for location:

In [None]:
df.loc[0, 'Population']

What if we want several values, say population and GDP? Easy! We can use lists specifying the labels of row and columns to extract subset of our data frame.

In [None]:
df.loc[[3,87], ['Country', 'Year','GDP [million Euro]','Population']]

We can even extract an entire row with `.loc`:

In [None]:
df.loc[0]

Remember when we did list slicing? We can do similar operations here:

In [None]:
df.loc[1:10:2, 'Country':'Population']

Now imagine that we want to use numerical index for columns, instead of labels. Well, there is a way! We can use the`.iloc` method to index data frames based on the integer-location rows and columns. As always in Python, indexing starts at 0. Let's extract the same data subset as before, this time using `iloc`:

In [None]:
df.iloc[1:10:2, 0:4]

As a parenthesis, note that in our case, the label column is using integers - this is the default when importing data. Hence, extracting rows with `.loc` and `iloc` is using the same syntax... Almost:
- `.loc` gets rows and columns with particular **labels** (in our case, integers)
-`.iloc` gets rows and columns at integer **locations**

How does it affect us? Well, one difference is slicing. While `.iloc` works the same as slicing lists, `.loc` is a little bit different. For example, if we pass as argument `[0:1]`, `.iloc` would only return the first row (as with lists, slicing excludes the last element), but `.loc` would return the two rows with labels 0 and 1:

In [None]:
df.loc[0:1]

In [None]:
df.iloc[0:1]

Similarly, `df.iloc[-1]` would allow us to access the last line, but `df.loc[-1]` would yield an error since there is no row labeled '-1'. For a more extensive discussion of the differences between `.loc` and `.iloc`, see this [Stack Overflow post](https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different#:~:text=The%20main%20distinction%20between%20the,or%20columns%29%20at%20integer%20locations).

Both methods, `.loc` and `.iloc`, are very helpful, but you need to know the label or location of your data. Now what if we want to access data based on a given column value. For instance, what if we want to extract data only for Switzerland? We know neither the labels nor the locations of the rows corresponding to Switzerland... Such operation seems super common - for example you could imagine having company/customer data with some ID numbers or names - so there must be a way. Indeed, there is! We can use a condition:  

In [None]:
df[df['Country'] == 'Switzerland']

Note the syntax: 
- `df['Country']` extracts our column labeled 'Country' ,
- `df['Country'] == 'Switzerland'` returns a column of `True` if the country is Switzerland, and `False` otherwise,
- When feeding this condition to our data frame `df`, it will only keep the rows for which the condition is `True`

Let's try another one, this time extracting data for Switzerland in 2019. We can use `&` to include a second condition. We did not cover this **bitwise operator** before, but the syntax is self-explanatory in the example below. Note that it is important that each Boolean operation you are doing is in parentheses because of the precedence of the operators involved. 

In pandas, the bitwise operators perform element-wise comparisons. That's what we want to do here: find **for each row** if the country is Switzerland and if the year is 2019.

In [None]:
df[(df['Country'] == 'Switzerland') & (df['Year'] == 2019)]

We can also select some columns, but in this case, we also need to use the `.loc` method:

In [None]:
df.loc[(df['Country'] == 'Switzerland') & (df['Year'] == 2019), ['GDP [million Euro]','Emissions_GHG [thousand tonnes CO2eq]']]

Or a slice of columns, for instance all the emissions of air pollutants:

In [None]:
df.loc[(df['Country'] == 'Switzerland') & (df['Year'] == 2019), 'Emissions_NOx [tonne]':'Emissions_NH3 [tonne]']

#### Scope: data types <a name="Scope:-data-types"></a>

Let's keep on exploring our data frame. We can identify the types of our variables using the method `.dtypes`: 

In [None]:
df.dtypes

As you can see, Pandas is using different names for data types. Here is a description:

|Pandas type|Native Python type|Description|
|:-------|:-------|:----------|
|`object` | `string` | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). |
|`int64` | `int` | Numeric characters. 64 refers to the memory allocated to hold this character. |
|`float64` | `float` | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal.|

Our data frame contains `object` (e.g., strings like countries), `int64` and `float64`. You may wonder about the emissions of arsenic (As), nickel (Ni), and chromium (Cr), which are apparently `object` type. We will discover why later.

#### Scope: Extract unique values in a column <a name="Scope:-Extract-unique-values-in-a-column"></a>

From the column labels of our data frame, we know that we have data for various countries and years. For example, we have seen before that we have data for Switzerland between 1990 and 2019. Which other country-years are included in our data frame? We can use the `.unique()` method to return the unique values of a column:

In [None]:
df['Country'].unique()

Ok! So we have for European countries. What about years? We can use the same syntax, but let's discover a new trick! As everything in Python, `df['Country']` is an object, and we apply the `.unique()` method to this object. Instead of `df['Country']`, we can use the equivalent statement `df.Country`. Let's try with the 'Year' column:

In [None]:
df.Year.unique()

Nice! It seems that we have data from 1990 to 2019. Why "seems"? Well, we may have missing values...

### Cleaning your data frame <a name="Cleaning-your-data-frame"></a>

Ideally, you would have perfect data and directly perform some statistical or machine learning analysis to uncover the mysterious relationships behind your data. Well, unfortunately, we don't live in a perfect world and your data will never be cleaned, except if someone already performed this fastidious task. In such case, be grateful, data scientists spend most of their time collecting and cleaning data.

We will explore in this section how to clean our data frame. Unfortunately, there is no universal method: each dataset is different, and you will need to design the appropriate techniques depending on your objective and on how your data looks like.

#### Identifying NaN <a name="Identifying-NaN"></a>

The first step is to identify missing values. You have probably noticed something called `NaN` in our data frame. For example, for the GDP of Belgium in 1990:

In [None]:
df.loc[0, 'GDP [million Euro]']

`NaN` stands for **Not a Number**, and it can be interpreted as a value that is undefined or unrepresentable. When you import csv data, the following values will be interpreted as `NaN`: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

We can check whether each value is a `NaN` or not using the `.isna()` method. `.isna()` will return `True` if the values are `NaN`, and `False` otherwise:

In [None]:
df.isna()

Ok, that's somewhat useful, but we are only human, we cannot check our 900x17=15300 observations to identify where we have `NaN`. Instead, let's use the power of pandas to pinpoint the location of our `NaN`. We will use the `.sum()` method, to sum over a column. Remember, `True` is associated with the value `1` and `False` with the value `0`, so when summing over a column of `True`/`False`, we are actually counting the number of `True` statement. In our case, we are counting the `Nan` values:

In [None]:
df.isna().sum() 

Alright, we are lucky, it seems the problem is coming from the GDP data (and one population observation).

We can extract the observations (rows) for which the GDP or the population is actually `NaN`. Remember what we did when we extracted data for Switzerland? We used a condition as indexing. We do the same here, this time our condition is checking the `NaN` values with `.isna()` but only for the GDP or the Population columns. The bitwise (element-wise) operator corresponding to or is `|`:

In [None]:
df[(df['GDP [million Euro]'].isna()) | (df['Population'].isna())]

Ok we are making progress! We have now identified the rows containing `NaN`. If we explore a bit further, it seems that the missing observations are only for certain years. Let's confirm it by returning the years for which we have `NaN`. We just use the syntax `.Year` to extract the column (equivalent to `["Year"]`), and then apply the `.unique()` method so that the given years only appear once:

In [None]:
df[(df['GDP [million Euro]'].isna()) | (df['Population'].isna())].Year.unique()

Great! Now we know that our missing values are only for years before 1994.

We can also find which rows contain `NaN` applying the method `.index` to our data frame. Alternatively, let's create a list of the labels using list comprehensions. We can use the method `.iterrows()` to iterate over the labels and series of row values. In the code below, we also make use of the `.any()` function, which returns `True` if at least one item in our iterable is `True`, and `False` otherwise.

In [None]:
rows_with_nan = [index for index, row in df.iterrows() if row.isna().any()]

print(rows_with_nan)

#### Droping NaN <a name="Droping-NaN"></a>

We have previously identified the rows and columns that contain `NaN`. What do we do with these observations? Well, as often, it depends on what you want to achieve. In some context, `NaN` might not be an issue. Often, though, you will want to delete the observations containing `NaN`. How do we do that? Easy! We can use the `.dropna()` method. As the name indicates, it will drop the rows containing  `NaN`. 

Note that until now, when we operated on our data frame, we never modified its values. For instance, when we extracted a subset of our data frame - say the data for Switzerland - it did not modify our original data frame. We just returned a new object, and we did not store that object. It is the same with `.dropna()`. If we do not store the result, it will not affect our original data frame. Since when we will analyze the data we wish to work on the cleaned version of our data frame, we should store our result in a new data frame variable. 

As a good practice, you should <span style="color:dodgerblue">store the modifications you make into a new data frame variable </span>, instead of just replacing the data frame you imported. Indeed, cleaning is an iterative process, and sometimes you might need to go back to a previous cleaning step, especially when you do a mistake...

In [None]:
df_clean = df.dropna()

df_clean

It worked! instead of 900 rows, we know have 786 rows and no more nasty `NaN`. We can confirm it by checking if we have `NaN` values with `.isna()` and then summing two times (one over the rows, the second over the columns)

In [None]:
df_clean.isna().sum().sum()

Great! Now we are ready data analysis! Wait a minute, not so fast! We dropped `NaN` values but by doing so created some imbalances in our data frame: some countries have more years than others. We can check by grouping our observations by country - we will perform such grouping operations a bit later.

Although a slight imbalance might not be a big deal in some applications, let's suppose it is for us. What can we do to solve this issue? We have previously seen that we had `NaN` values only for years from 1990 to 1994. Easy then, let's drop all observations for years (strictly) before 1995. We can use the `.drop()` method. `.drop()` removes the rows corresponding to a specified label. Hence, we need to use the method `.index` to get the names of the labels: 

In [None]:
df_clean = df_clean.drop(df_clean[df_clean.Year < 1995].index)

df_clean

Youhouuu! We know have a beautiful data frame without `NaN`. Now let's perform some data analysis! Wait a minute... We're not done with cleaning...

#### Dealing with errors and other missing values <a name="Dealing-with-errors-and-other-missing-values"></a>

Missing values can take different forms depending on the data you import. You might also have errors or "absurd" values. Dealing with those is another part of cleaning...

Do you remember when we checked the data types of our columns, we got an `object` type for the emissions of arsenic (As), nickel (Ni), and chromium (Cr). Weird right, shouldn't we have numerical data like for the other emissions of pollutants? Well, yes, we should. So what is happening here? It happens that we have some strings in our columns instead of nice numbers. More precisely, the culprit strings are `'Not Available'`:

In [None]:
df_clean.loc[238, 'Emissions_As [tonne]']

As we did for `NaN`, let's check for which variable we have such strings. We cannot rely on the `.isna()` method, since we are not dealing with `NaN`. Instead, we can use a simple condition, and then summing over our rows:

In [None]:
(df_clean == 'Not Available').sum()

As suspected, the `'Not Available'` strings only appear for the emissions of arsenic (As), nickel (Ni), and chromium (Cr), thus explaining why these columns have the `object` type.

Let's continue our investigation by identifying the rows containing `'Not Available'`. To start with, we will only check for the emissions of arsenic column (`'Emissions_As [tonne]'`). Indeed, we can see above that we have 10 observations containing `'Not Available'` for the emissions of arsenic, nickel, and chromium. It feels likely that the missing data for these pollutants are for the same country-years...

In [None]:
df_clean[df_clean['Emissions_As [tonne]'] == 'Not Available']

Bingo! We got 10 rows, and if we look at the arsenic, nickel, and chromium columns, all the values are not available. 

Ok, so what do we do now. We could drop the country-years with missing values as we did before with `NaN`, or all the observations for Switzerland and Luxembourg, or the columns with the emissions of As, Ni, and Cr. However, in our context, it feels a bit extreme. Remember, what you do depends on your ultimate goal. And **data science is not only about good programming techniques, it also about domain-expertise**. I know, from previous studies and ex-post analysis, that the emissions of arsenic, nickel, and chromium are very small and thus represent a small part of the external costs of pollution, at least in other countries. There is no reason to suspect a different pattern for Switzerland and Luxembourg, especially since these countries emit relatively less pollutants than others (for the pollutants for which we have data). Thus, instead of deleting observations, we will assume that the emissions are equal to zero. 

The influence of such assumptions on our result should be negligible. Still, you should always <span style="color:dodgerblue">document your assumptions </span> and you can even do ex-post <span style="color:dodgerblue">sensitivity analysis </span>, by trying different values and see how your results are affected. 

So we decided to assume that the emissions were zero. How do we implement that? It's actually not difficult since we can use the `.replace()` method. In `.replace()`, the first argument is the value we wish to replace, the second one is the value we replace with:

In [None]:
df_clean = df_clean.replace('Not Available', 0)

df_clean

See, we still have 750 rows, so we did not drop any observation. Let's check that the `'Not Available'` string disappear:

In [None]:
(df_clean == 'Not Available').sum().sum()

Yihaaa! A well-done job! Let's perform one last cleaning operation. We will modify the type of our columns. Indeed, the `object` type might be inconvenient when we will perform numerical operations. We will use the `to_numeric` pandas function and the `.apply()` method, which allows us to apply our function along one axis of our dataframe, in this case to each column.

In [None]:
list_hm_obj = ['Emissions_As [tonne]', 'Emissions_Ni [tonne]', 'Emissions_Cr [tonne]']

df_clean[list_hm_obj] = df_clean[list_hm_obj].apply(pd.to_numeric)

df_clean.dtypes

This time, we are actually done with cleaning. Not so painful, no? Huge props to pandas who made cleaning an actually bearable experience.

You might wonder, how do I know we are done with cleaning? Well, here, it's because I know our data set, since I prepared it! In practice, you need to further explore your data, for instance using Exploratory Data Analysis, which we will see in a future lesson. It might also happen that you only realize something is wrong with the data once you started to analyze it. Although it might be very annoying, remember that **data science projects are an iterative process**, and you always go back and forth.

Before moving on to some data manipulation, we will reset the labels of our rows - as you can see above, the rows numbers were the ones from our original data frame. We can simply do that using the method `.reset_index`. The argument `drop=True` drops the previous labels, while `drop=False` (default value) will insert the previous labels in a new column.

In [None]:
df_clean = df_clean.reset_index(drop=True)
df_clean

### Merging data frames <a name="Merging-data-frames"></a>

The first step when we discovered Pandas was to import data. We have imported the dataset we are currently using, which contains the GDP, population, and emissions of pollutants for European countries since 1990. But do you remember we also imported a second dataset? Well, we did not import that other dataset just for fun, we will actually need it for the GrDP analysis! No worries though, the dataset is already clean...

Let's first have a look at what it contains:

In [None]:
df_cost

We have, for the same pollutants than before, their unit damage cost. The damage costs are the costs and expenses that are or would be incurred to prevent, control or abate the environmental harm caused by pollution. We notice that the costs vary per country for air pollutants (NO$_x$, PM$_{2.5}$, PM$_{10}$, SO$_x$, NMVOC, NH$_3$). Indeed, these pollutants induce a local pollution: they are transported in the air over relatively short distance before being inhaled by humans and contaminating soil, water, and ecosystems (at which point they could be ingested by humans). By contrast, greenhouse gases (GHG) are responsible for climate change, which is a global pollution: it does not matter where the emissions occur, all GHG emitted contribute to global warming. As for heavy metals (Pb, Cd, Hg, As, Ni, Cr), they induce a local pollution, but the available data was only for an average European damage cost.

We note that the costs are independent of years. Indeed, all the monetary data (e.g., GDP) are expressed in 2019 euros.

GrDP is defined as GDP minus the external costs resulting of human activities. In its current version, the indicator only includes the external costs for which the impacts are known, measured, and priced; namely GHG, air pollutants, and heavy metals. 

To compute GrDP, we will need to multiply the emissions data - contained in our data frame `df_clean` - by our unit cost data - included in the data frame `df_data`. It will be more convenient to operate on a single data frame. Thus, we need to merge our two data frames.

This seems complex though: not only do the columns differ, but also the rows since the emissions vary for each year and the costs do not. No worries, Pandas is here to come to the rescue! There are several methods and functions to combine two data frames:

- `.concat()` combines data frame along an axis, either across rows or columns. 
- `.join()` combines data on key column or an index.
- `.merge()` combines data on common columns or indices.

To better understand how these methods work, I strongly recommend reading the associated [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). You can also check this [Real Python tutorial](https://realpython.com/pandas-merge-join-and-concat/#pandas-concat-combining-data-across-rows-or-columns).

In our case, we want to combine our two data frames such that, for each country-year observation, we have the variables GDP, population, emissions of each pollutants, and costs of each pollutants. Our two data frames have one column in common, the country. Moreover, we want to keep all the observations from the `df_clean` data frame while the costs for a given country should be the same for each year, i.e., we need to 'copy' the information of the `df_cost` data frame. We can achieve this using the `.merge()` function. 

The values of the column(s) on which we merge our data frames are called the keys. If you dive a bit into the `.merge()` documentation, you will notice that you have different options to deal with cases where the keys differ (e.g., suppose that we have different countries in our two data frames):
- `inner` only keeps the rows with common keys, i.e., the intersection;
- `outer` keeps all rows, i.e, the union;
- `left` keeps all rows from the left (first) data frame, and disregards the ones from the right (second) data frame when the keys do not exist in the left data frame;
- `right` keeps all rows from the right data frame, and disregards the ones from the left data frame when the keys do not exist in the right data frame;
- `cross` creates the cartesian product of rows of both frames, i.e., it creates rows with all possible combination of keys.
You can specify the method with `how`.

Fortunately, in our case, we do not need to worry about the method since the countries are the same.

Combining data frames requires some practice, so try it by yourself, always check the result, and try again if you did not obtain what you expected! 

In [None]:
df_grdp = pd.merge(df_clean, df_cost, how="left", on="Country")

Note that in the code above, we did not need to specify `on` which columns we should merge since the function automatically select all the common columns. Similarly, we did not need to specify `how` to merge since our keys are the same.  

In [None]:
df_grdp

### Manipulating your data <a name="Manipulating-your-data"></a>

#### Operating on columns <a name="Operating-on-columns"></a>

Let's start with some simple operations. We have data for GDP and for population. To compare countries, it is best to use per capita indicators. Hence, let's compute the GDP per capita. We want to divide the elements of the GDP column by the elements of the population column. Should we use so kind of comprehension techniques as we did with lists? No! In pandas, when we want to divide one column by another element-by-element, we can simply divide the two columns. Also, similar to what we did with dictionaries, we can easily create a new column in our data frame indexing on a new name: 

In [None]:
df_grdp["GDP per capita"] = df_grdp["GDP [million Euro]"]/df_grdp["Population"]

Note that the same syntax can be used for any element-by-element operations including addition, subtraction, or multiplication. Very convenient isn't it?

Let's visualize the result:

In [None]:
df_grdp.head()

Looks like it worked. However, because our GDP data was in million euros, our GDP per capita is in million euros per capita. Countries do not produce this much, so the outputs are very small... Let's modify the unit to euros per capita by multiplying the column by 1'000'000. As before, we do not need some kind of comprehension, we can directly multiply the column by our scalar!

In [None]:
df_grdp['GDP per capita'] = df_grdp['GDP per capita']*1000000

# Here is an alternative:
# df_data.loc[:, 'GDP per capita'] *=1000000  

df_grdp.head()

Much better!

#### Functions and data frame <a name="Functions-and-data-frame"></a>

We will now calculate the Green Domestic Product. First let's compute the external costs of each pollutant. We need to multiply each emission column with the associated unit cost column, while paying attention that the unit is matching. For instance, for greenhouse gases, we can calculate the external costs in million euros with the following:

In [None]:
df_grdp["External cost GHG"] = (df_grdp["Emissions_GHG [thousand tonnes CO2eq]"]
                                *df_grdp["Cost_GHG [Euro per tonnes CO2eq]"]/1000)

We could proceed similarly for all pollutants... But that would be redundant. Surely we can do better? Yes we can. We will define some functions to repeat the operation!

Let's proceed by group of pollutants, first with the air pollutants.

In [None]:
def external_cost(pol):
    """This function computes the external cost of air pollutants, 
    and stores the results in a new column of our data frame.
    It takes one argument: pol refs to a pollutant and should be a string"""
    
    df_grdp['External cost {p}'.format(p=pol)] = (df_grdp['Emissions_{p} [tonne]'.format(p=pol)]
                                             *df_grdp['Cost_{p} [Euro per tonne]'.format(p=pol)]
                                             /1000000)

Can you understand this function? We made use of the repeated patterns in our variables names and of the string `.format` method. Recall that `.format` allows to insert values into a string. Here, the idea is to insert the name of our pollutant. The rest of the code performs the exact same operation as we did above with GHG: it multiplies the emission of a given pollutant by the associated unit cost and stores the results in a new column. We divide by 1'000'000 to express the result in million euros. 

Ok, now we can apply our function to our air pollutants. One way is to loop over our list of air pollutants:

In [None]:
air_pollutants = ('NOx', 'PM2.5', 'PM10', 'SOx', 'NMVOC', 'NH3') # tuple of air pollutant

for pol in air_pollutants:
    external_cost(pol)

Alright, let's do the same with the heavy metals:

In [None]:
def external_cost_hm(pol):
    """This function computes the external cost of heavy metal, 
    and stores the results in a new column of our data frame.
    It takes one argument: pol refs to a pollutant and should be a string"""
    
    df_grdp['External cost {p}'.format(p=pol)] = (df_grdp['Emissions_{p} [tonne]'.format(p=pol)]
                                             *df_grdp['Cost_{p} [Euro per kg]'.format(p=pol)]
                                             /1000)

In [None]:
heavy_metals = ('Pb', 'Cd', 'Hg', 'As', 'Ni', 'Cr') # tuple of air pollutant

for pol in heavy_metals:
    external_cost_hm(pol)

As always, let's visualize the result:

In [None]:
df_grdp.head()

And we can check that we did not miss one pollutant by returning a list of the variables:

In [None]:
list(df_grdp)

Looks good to me! We're almost done!

Let's compute the total external cost by summing each pollutant's external cost. We use the `.sum` method, where `axis=1` means that we are summing over columns (we could also sum over rows):

In [None]:
df_grdp['Total external cost'] = df_grdp.loc[:,'External cost GHG':'External cost Cr'].sum(axis = 1)

#### Calculating GrDP <a name="Calculating-GrDP"></a>

Finally, we can compute the Green Domestic Product. We also compute a few additional indicators, such as the GrDP per capita, and the share of external cost with respect to GDP:

In [None]:
# GrDP = GDP - total external cost
df_grdp['GrDP [million Euro]'] = df_grdp['GDP [million Euro]'] - df_grdp['Total external cost']

# GrDP per capita = GrDP / population
df_grdp['GrDP per capita [Euro]'] = df_grdp['GrDP [million Euro]']*1000000/df_grdp['Population']

# Share of external cost w.r.t. GDP
df_grdp['Share external cost [%]'] = df_grdp['Total external cost']*100/df_grdp['GDP [million Euro]']

Let's visualize the result..

In [None]:
df_grdp.head()

Here you go! We have computed the GrDP and related indicators for European countries between 1995 and 2019! 

### Exporting data frame <a name="Exporting-data-frame"></a>

We have created this super cool data frame. We will now save our data into a CSV file to be able to reuse it. We use the `.to_csv` method to export to CSV. We specify a name for our file, while the kwarg `index = False` ask Pandas not to explicitly write the row labels to the file.

In [None]:
df_grdp.to_csv('GrDP_1995-2019.csv', index=False)

Our file is saved in the active directory. In Colab, click on the folder icon (left of the screen), and you can then download your file.