Load data

In [1]:
import pandas as pd

co2 = pd.read_csv("gapminder.csv")
co2 = pd.melt(co2, id_vars=["country"], var_name="year", value_name="co2_per_capita")
co2.head()

Unnamed: 0,country,year,co2_per_capita
0,Afghanistan,1800,
1,Albania,1800,
2,Algeria,1800,
3,Andorra,1800,
4,Angola,1800,


Here I melted the data frame to put it in the same long format than the previous one.

In [2]:
gdp = pd.read_csv("gapminder-gdp.csv")
gdp = pd.melt(gdp, id_vars=["geo.name"], var_name="year", value_name="gdp_per_capita")
gdp = gdp.rename(columns={"geo.name":"country"})
gdp.head()

Unnamed: 0,country,year,gdp_per_capita
0,Abkhazia,1800,
1,Afghanistan,1800,603.0
2,Akrotiri and Dhekelia,1800,
3,Albania,1800,667.0
4,Algeria,1800,715.0


The same here.

In [3]:
population = pd.read_csv("population.csv")
population = pd.melt(population, id_vars=["country"], var_name="year", value_name="population")
population.head()

Unnamed: 0,country,year,population
0,Afghanistan,1800,3280000
1,Albania,1800,400000
2,Algeria,1800,2500000
3,Andorra,1800,2650
4,Angola,1800,1570000


Here it wasn't necesary.

In [4]:
continents = pd.read_csv("countries-continents.csv")
continents.head()

Unnamed: 0,country,iso_alpha3_code,m_49_code,region_1,region_2,continent
0,Afghanistan,AFG,4,Southern Asia,,Asia
1,Åland Islands,ALA,248,Northern Europe,,Europe
2,Albania,ALB,8,Southern Europe,,Europe
3,Algeria,DZA,12,Northern Africa,,Africa
4,American Samoa,ASM,16,Polynesia,,Oceania


Keep `country`, `iso_alpha3_code` and `continent`.

In [5]:
continents = continents[["country", "iso_alpha3_code", "continent"]]

And merge all the df together.

In [7]:
df = pd.merge(co2,gdp,on=['country','year'],how='left')

In [8]:
df = pd.merge(df, population, on=['country', 'year'], how='left')

In [9]:
df = pd.merge(df, continents, on='country', how='left')

In [10]:
df.head()

Unnamed: 0,country,year,co2_per_capita,gdp_per_capita,population,iso_alpha3_code,continent
0,Afghanistan,1800,,603.0,3280000,AFG,Asia
1,Albania,1800,,667.0,400000,ALB,Europe
2,Algeria,1800,,715.0,2500000,DZA,Africa
3,Andorra,1800,,1197.0,2650,AND,Europe
4,Angola,1800,,618.0,1570000,AGO,Africa


Let's do some checks about NAs in each column.

In [11]:
df.drop('year', 1).isna().groupby(df.year, sort=False).sum().reset_index()

Unnamed: 0,year,country,co2_per_capita,gdp_per_capita,population,iso_alpha3_code,continent
0,1800,0.0,187.0,2.0,0.0,27.0,27.0
1,1801,0.0,187.0,2.0,0.0,27.0,27.0
2,1802,0.0,185.0,2.0,0.0,27.0,27.0
3,1803,0.0,187.0,2.0,0.0,27.0,27.0
4,1804,0.0,186.0,2.0,0.0,27.0,27.0
...,...,...,...,...,...,...,...
210,2010,0.0,1.0,2.0,0.0,27.0,27.0
211,2011,0.0,1.0,2.0,0.0,27.0,27.0
212,2012,0.0,0.0,2.0,0.0,27.0,27.0
213,2013,0.0,0.0,2.0,0.0,27.0,27.0


It seems that missing values are constant in the `gdp_per_capita` and the `continent` attributes.

In [12]:
df.loc[df.gdp_per_capita.isnull() == True, "country"].value_counts()

Liechtenstein      215
North Macedonia    215
Name: country, dtype: int64

Countries with no GDP data are removed, as they wont be ploted.

In [13]:
df = df.loc[df.gdp_per_capita.isnull() == False, :]

Now, let's see what happens with the continent.

In [14]:
df.continent.value_counts()

Africa           10320
Asia              8385
Europe            7310
North America     4085
Oceania           2795
South America     2150
Name: continent, dtype: int64

As far as I'm concerned, there is only one America. So let's combine these together.

In [15]:
df["continent"] = df["continent"].str.replace("South |North ", "")

Which countries do not have its corresponding continent?

In [16]:
df.loc[df.continent.isnull() == True, "country"].value_counts()

Congo, Rep.                       215
Cape Verde                        215
Micronesia, Fed. Sts.             215
Moldova                           215
Syria                             215
United States                     215
St. Kitts and Nevis               215
Czech Republic                    215
Russia                            215
Slovak Republic                   215
South Korea                       215
Brunei                            215
St. Vincent and the Grenadines    215
Congo, Dem. Rep.                  215
North Korea                       215
Palestine                         215
United Kingdom                    215
Cote d'Ivoire                     215
Tanzania                          215
Bolivia                           215
Vietnam                           215
Venezuela                         215
Lao                               215
St. Lucia                         215
Iran                              215
Kyrgyz Republic                   215
Swaziland   

It's time to refresh (and learn) some geography!

In [17]:
df.loc[df.country=="Congo, Rep.", "continent"] = "Africa"
df.loc[df.country=="St. Lucia", "continent"] = "America"
df.loc[df.country=="Micronesia, Fed. Sts.", "continent"] = "Asia"
df.loc[df.country=="Syria", "continent"] = "Asia"
df.loc[df.country=="Swaziland", "continent"] = "Africa"
df.loc[df.country=="St. Kitts and Nevis", "continent"] = "America"
df.loc[df.country=="Moldova", "continent"] = "Europe"
df.loc[df.country=="Tanzania", "continent"] = "Africa"
df.loc[df.country=="Iran", "continent"] = "Asia"
df.loc[df.country=="Brunei", "continent"] = "Asia"
df.loc[df.country=="South Korea", "continent"] = "Asia"
df.loc[df.country=="United Kingdom", "continent"] = "Europe"
df.loc[df.country=="Slovak Republic", "continent"] = "Europe"
df.loc[df.country=="Cote d'Ivoire", "continent"] = "Africa"
df.loc[df.country=="United States ", "continent"] = "America"
df.loc[df.country=="Venezuela", "continent"] = "America"
df.loc[df.country=="Vietnam", "continent"] = "Asia"
df.loc[df.country=="North Korea", "continent"] = "Asia"
df.loc[df.country=="Russia", "continent"] = "Europe"
df.loc[df.country=="Czech Republic", "continent"] = "Europe"
df.loc[df.country=="Lao", "continent"] = "Asia"
df.loc[df.country=="Bolivia", "continent"] = "America"
df.loc[df.country=="Palestine", "continent"] = "Asia"
df.loc[df.country=="Kyrgyz Republic", "continent"] = "Asia"
df.loc[df.country=="St. Vincent and the Grenadines", "continent"] = "America"
df.loc[df.country=="Cape Verde", "continent"] = "Africa"
df.loc[df.country=="Congo, Dem. Rep.", "continent"] = "Africa"
df.loc[df.country=="United States", "continent"] = "America"

Finaly, missing values from `co2_per_capita` are replaced by `0` to allow the graphical representation. Otherwise, there would be problems with the outcome (e.g. missing continents in the legend).

In [18]:
df["co2_per_capita"].fillna(0, axis=0, inplace=True)

Save data to work with it later.

In [19]:
df.to_csv("prepared.csv")