In [None]:
import warnings
warnings.filterwarnings('ignore')

# Lab 2 - Exercise 1

Here we load in a subset of World Indicators data collected by the worldbank([link to dataset](https://datacatalog.worldbank.org/dataset/world-development-indicators)). The notebook containing the code used to subset the data is in the 'Wrangling World Indicators' notebook file.

The dataset contains data from the UK, France, Germany and Italy. The variables we look at are:

* GDP per capita (current US Dollars)
* Imports of goods and services (current US Dollars)
* Land area (sq. km)
* Life expectancy at birth, total (years)
* Population in largest city
* Population growth (annual %)
* Population, total
* Primary education, duration (years) 
* Progression to secondary school (%)
* Rural population (% of total population)

and the data, though incomplete, starts from 1960 till 2004.

Load in pandas and the dataset.

In [None]:
import pandas as pd
df = pd.read_csv('data/world_indicators_pandas.csv', encoding='UTF-8')

How does the data look?

In [None]:
df.i???()

Change year to datetime instead of an integer.

In [None]:
df['year'] =  pd.to_date????(df['year'], format='%Y')

# look at the data types
df.dty???

Check if we have some missing data.

In [None]:
?.shape[0] - df.?()

In [None]:
df.isnull().sum()

What do the top 5 rows look like?

In [None]:
df.?()

We can look at the top 2 if we like.

In [None]:
df.iloc[0:2,?]

Or all the rows for a particular variable.

In [None]:
df.iloc[:,?]

We can pick out data for a particular country.

In [None]:
df[df.Country_Name == ?]

Or a paticular country and variable.

In [None]:
df[df.Country_Name == 'Germany'][?]

In [None]:
df[df.Country_Name == 'France']['Land_area_(sq._km)']

The first land area value is `NaN`. We could fill the first value with the lowest for that country.

Below, we use the `.` notation to grab a column and this can be more convenient and we use `df.loc` (see the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)). For each country we set all the values of Land Area to the lowest for that country. That overwrites the NaN with the correct value.

In [None]:
for country in df.Country_Name.unique():
        df.loc[df.Country_Name == country, 'Land_area_(sq._km)'] = df.loc[df.Country_Name == country, 'Land_area_(sq._km)'].min()

We can also scale GDP and population growth between 0 and 1.

In [None]:
df['GDP_scaled'] = (df['GDP_per_capita_(current_US$)'] - df['GDP_per_capita_(current_US$)'].min()) / (df['GDP_per_capita_(current_US$)'].max() - df['GDP_per_capita_(current_US$)'].min())
df['pop_scaled'] = (df[?] - df['Population,_total'].?()) / (df['Population,_total'].max() - df['Population,_total'].min())

Make sure it looks ok.

In [None]:
df[['Country_Name', 'GDP_scaled', 'pop_scaled']]

Hmm, looks good.

In [None]:
df.head()

We should get an overview of our variables.

In [None]:
df.?()

A few observations:

* The countries we are looking at have had 100% access to electricity. It is not useful in this subset but may be for developing countries.
* Population growth looks interesting. It was negative for at least one year.
* There seems to be some difference in progression to secondary school.

We should describe the data for each country so we can see the differences.

In [None]:
df.groupby(?).describe()

It is too wide. We can select specific variables.

In [None]:
df[['Country_Name', 'Progression_to_secondary_school_(%)']].groupby('Country_Name').describe()

Italy is the country with a minumum of 94% whereas the UK has no data.

Narrowing in on GPD per capita.

In [None]:
df['GDP_per_capita_(current_US$)'].?()

By country.

In [None]:
df[['Country_Name','GDP_per_capita_(current_US$)']].groupby('Country_Name').describe()

Germany has the highest GDP per capita but we also have less data points. Why is this?

Plotting GDP by year will give us a sense of our GDP data.

In [None]:
df.plot(x='year', y="GDP_per_capita_(current_US$)")

Split by country.

In [None]:
df.groupby('Country_Name').plot(?='year', y="GDP_per_capita_(current_US$)")

How about life expectancy?

In [None]:
df.groupby('Country_Name').plot(?='year', ?="Life_expectancy_at_birth,_total_(years)")

Both increase over time. If we ignore year.

In [None]:
df.groupby('Country_Name').plot(x="GDP_per_capita_(current_US$)", y="Life_expectancy_at_birth,_total_(years)", kind='scatter')

We should plot our scaled variabled against one another in seaborn.

First, we let us remind ourselves what our data looks like.

In [None]:
df.head()

Then we can melt our data.

In [None]:
df_long=pd.?(df, id_var?=['Country_Name', 'year'], value_vars=['GDP_scaled', 'pop_scaled'])
df_long

Then we can plot each by year.

In [None]:
import seaborn as sns
sns.relplot(x='year', y='value', hue=?, data=df_long)

Ah, there are different countries in the data. We should include that in our plot.

In [None]:
sns.relplot(x='year', y=?, style='variable', hue='Country_Name', data=df_long)

Do play around with these plots by changing the data variable names. Are there any interesting relationships between the different data variables?

In Exercise 2 we look at what else seaborn can do for us.