## Imports

In [None]:
# Import pandas, numpy, and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# seaborn is a data visualization library built on matplotlib
import seaborn as sns 

# set the plotting style 
sns.set_style("whitegrid")

# Plot missing values
import missingno as msno

## Lab introduction

Use the greenhouse gas emission data set owid-co2-data.csv from Our World in Data to describe how the emission levels of the current top 10 CO2 emitters have changed over the last 50 years (1971 - 2020). 



## Import and set up the data set

##### $\rightarrow$ Use Pandas to load the file `owid-co2-data.csv` from https://github.com/owid/co2-data as a `DataFrame`. Name the `DataFrame` `df`.

##### Solution

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv')

Consult the [codebook](https://github.com/owid/co2-data/blob/master/owid-co2-codebook.csv) to see the description of each column.



##### $\rightarrow$ Select the rows corresponding to individual countries 

In [None]:
df_population = df[['country', 'year', 'population', 'co2']]

In [None]:
df_population.head()

Unnamed: 0,country,year,population,co2
0,Afghanistan,1850,3752993.0,
1,Afghanistan,1851,3769828.0,
2,Afghanistan,1852,3787706.0,
3,Afghanistan,1853,3806634.0,
4,Afghanistan,1854,3825655.0,


The `country` column of the data set contains some values that are groups of countries. We will remove these observations from the data set.

In [None]:
non_countries = ['Africa', 'Africa (GCP)', 'Asia', 'Asia (GCP)', 'Asia (excl. China and India)', 'Central America (GCP)',
                  'EU-27', 'Europe', 'Europe (excl. EU-27)', 'European Union (27) (GCP)', 'Europe (GCP)',
                  'Europe (excl. EU-28)', 'European Union (27)', 
                  'European Union (28)', 'French Equatorial Africa', 
                  'French Guiana', 'French Polynesia', 'French West Africa',
                  'High-income countries', 'International transport', 
                  'Low-income countries', 'Lower-middle-income countries', 'Mayotte', 'Middle East (GCP)',
                  'Non-OECD (GCP)',
                  'North America',  'North America (excl. USA)', 'North America (GCP)',
                  'Oceania (GCP)', 'OECD (GCP)', 
                  'Panama Canal Zone','South America', 'South America (GCP)', 'Upper-middle-income countries', 
                  'World']

Remove the rows corresponding to the non-countries.

In [None]:
df_population.loc[df_population['country'].isin(non_countries) == False]

Unnamed: 0,country,year,population,co2
0,Afghanistan,1850,3752993.0,
1,Afghanistan,1851,3769828.0,
2,Afghanistan,1852,3787706.0,
3,Afghanistan,1853,3806634.0,
4,Afghanistan,1854,3825655.0,
...,...,...,...,...
46518,Zimbabwe,2017,14751101.0,9.596
46519,Zimbabwe,2018,15052191.0,11.795
46520,Zimbabwe,2019,15354606.0,11.115
46521,Zimbabwe,2020,15669663.0,10.608


## Explore the data set

##### $\rightarrow$ Display the head of the data frame

In [None]:
df_population.head()

Unnamed: 0,country,year,population,co2
0,Afghanistan,1850,3752993.0,
1,Afghanistan,1851,3769828.0,
2,Afghanistan,1852,3787706.0,
3,Afghanistan,1853,3806634.0,
4,Afghanistan,1854,3825655.0,


##### Solution

##### $\rightarrow$ Use the `info` method further explore the data.
1.  Are there any columns where the data type is obviously incorrect? For example, is there a variable that should be a number, but is coded as a string?
2.  Do any of the columns have missing (null) values?

In [None]:
df_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46523 entries, 0 to 46522
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     46523 non-null  object 
 1   year        46523 non-null  int64  
 2   population  38574 non-null  float64
 3   co2         31349 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 1.4+ MB


##### Solution

##### $\rightarrow$ What years are present in the data set?

##### Solution

## Analysis of top emissions in 2020

##### $\rightarrow$ Find the top 10 emitters of total CO$_2$ in 2020.



##### Solution

##### $\rightarrow$ Make a histogram of total CO$_2$ emissions in 2020. Make the plot on a density scale.

##### Solution

##### $\rightarrow$ Make a boxplot of total CO$_2$ emissions in 2020. Add a strip plot on top of the boxplot.

##### Solution

##### $\rightarrow$ Are the CO$_2$ emissions of the top 10 emitters in 2020 outliers in the distribution?

##### Solution

## Emission trend over time

##### $\rightarrow$ Is the data set missing any CO$_2$ emission values for the top 10 emitters in 2020 over the years 1971 to 2020?

##### Solution

##### $\rightarrow$ Plot the time plot of the total CO$_2$ emissions from 1971 to 2020 for the top 10 emitters in 2020.

##### Solution

##### $\rightarrow$ Again, plot the time plot of the total CO$_2$ emissions from 1971 to 2020 for the top 10 emitters in 2020, but now also include a plot of the mean total CO$_2$ emissions over all countries on the same plot.

##### Solution

##### $\rightarrow$ Given the large difference between the smallest and largest values, it can help to plot the results on a log scale. Produce the plot of the top 10 emitters and the mean with CO$_2$ emissions on a log scale.

##### Solution

##### $\rightarrow$ Comment on the trend in CO$_2$ emissions from these countries over the last 50 years.

##### Solution