# Transforming and Combining Data

In the previous module you worked on a dataset that combined two different `World Health
Organization datasets: population and the number of deaths due to tuberculosis`.
They could be combined because they share a `common attribute: the countries`. This
week you will learn the techniques behind the creation of such a combined dataset.

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

import pandas as pd

## Life expectancy project
In this module we will see (literally, via a chart) if the life expectancy in richer countries tends
to be longer.

Richer countries can afford to spend more on healthcare and on road safety, for example,
to reduce mortality. On the other hand, richer countries may have less healthy lifestyles.
The World Bank provides loans and grants to governments of middle and low-income
countries to help reduce poverty. As part of their work, the World Bank has put together
hundreds of datasets on a range of issues, such as health, education, economy, energy
and the effectiveness of aid in different countries. We will use two of their datasets, which
you can see online by following the links below. You do not need to download the
datasets.

One dataset lists the `gross domestic product (GDP)` for each country, in United States
dollars and cents; the other lists the `life expectancy`, in years, for each country. The latest
life expectancy data I can access is for 2013, so that will be the year we will take for the `GDP`.
The disadvantage of using the `GDP` and the `life expectancy` values for the same year is
that they do not account for the time it takes for a country’s wealth to have an effect on
lifestyle, healthcare and other factors influencing life expectancy.

While it is useful to have all `GDPs` in a common currency to compare different countries, it
doesn’t make much sense to report the `GDP` of a whole country to a supposed precision
of a US cent. I noted that the value for the `USA` is a round number, but it is not for other
countries. This is likely due in part to the conversion of local currencies to `US dollars`. It
makes more sense to report the `GDP` values in a larger unit, e.g. millions of dollars.
Moreover, for those who don’t live in a country using the `US dollar` as the official currency,
it’s probably easier to understand `GDP` values in their own local currency.

To sum up, the project for this module will transform currency values and combine GDP and life
expectancy data.

Note that the combination is made simple by the common country names in the two
datasets, but in general care has to be taken that the common attribute really means the
same thing. For example, if you were combining two datasets on a common
unemployment attribute, you must be sure that it was obtained in the same way as there
are various ways of measuring unemployment.

I’m aware that the GDP is a crude way of comparing wealth across nations. For example,
it doesn’t take population or the cost of living into account. Some of this modules’s tasks
will ask you to add the population data. Think of other ways to improve the analysis
method, of other conversions that might be needed, and of other ways to investigate life
expectancy factors.

**Links:**

- [GDP in current US dollars](http://data.worldbank.org/indicator/NY.GDP.MKTP.CD)

- [Life expectancy at birth](http://data.worldbank.org/indicator/SP.DYN.LE00.IN)

`Note that the dataset can be found in the folder`

## Creating the data

We won’t yet work with the full data. Instead I will create small tables, to better illustrate this
module’s concepts and techniques.
Small tables make it easier to see what is going on and to create specific data
combination and transformation scenarios that test the code.
There are many ways of creating tables in pandas. One of the simplest is to define the
rows as a `list`, with the first element of the list being the first row, the second element being
the second row, etc.
Each row of a table has multiple cells, one for each column. The obvious way is to
represent each row as a list too, the first element of the list being the cell in the first
column, the second element corresponding to the second column, etc. To sum up, the
table is represented as a list of lists.

Here is a table of the 2013 GDP of some countries, in US dollars:

In [None]:
table = [
  ['UK', 2678454886796.7],    # 1st row
  ['USA', 16768100000000.0],  # 2nd row
  ['China', 9240270452047.0], # and so on...
  ['Brazil', 2245673032353.8],
  ['South Africa', 366057913367.1]
]

In [None]:
headings = ['Country', 'GDP (US$)']
gdp = pd.DataFrame(columns=headings, data=table)
gdp

To create a dataframe, I use a pandas function appropriately called `DataFrame()`. I have
to give it two arguments: the names of the columns and the data itself. The column names
are given as a list of strings, the first string being the first column name, etc.

Note that pandas shows large numbers in scientific notation, where, for example, 3e+12
means 3×10^12 , i.e. a 3 followed by 12 zeros.
I define a similar table for the life expectancy, based on the 2013 World Bank data.

And similarly for the life expectancy of those born in 2013...

In [None]:
headings = ['Country name', 'Life expectancy (years)']
table = [
  ['China', 75],
  ['Russia', 71],  
  ['United States', 79],
  ['India', 66],
  ['United Kingdom', 81]
]
life = pd.DataFrame(columns=headings, data=table)
life

### Task

Create a dataframe with all five BRICS countries and their population, in thousands of inhabitants, in 2013. The values (given in the first module notebook) are: Brazil 200362, Russian Federation 142834, India 1252140, China 1393337, South Africa 52776.