# Transforming and Combining Data

In the previous module you worked on a dataset that combined two different `World Health
Organization datasets: population and the number of deaths due to tuberculosis`.
They could be combined because they share a `common attribute: the countries`. This
week you will learn the techniques behind the creation of such a combined dataset.

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

import pandas as pd

In [None]:
table = [
  ['UK', 2678454886796.7],    # 1st row
  ['USA', 16768100000000.0],  # 2nd row
  ['China', 9240270452047.0], # and so on...
  ['Brazil', 2245673032353.8],
  ['South Africa', 366057913367.1]
]

In [None]:
headings = ['Country', 'GDP (US$)']
gdp = pd.DataFrame(columns=headings, data=table)

In [None]:
headings = ['Country name', 'Life expectancy (years)']
table = [
  ['China', 75],
  ['Russia', 71],  
  ['United States', 79],
  ['India', 66],
  ['United Kingdom', 81]
]
life = pd.DataFrame(columns=headings, data=table)

In [None]:
def roundToMillions (value):
    return round(value / 1000000)

In [None]:
def usdToGBP (usd):
    return usd / 1.564768 # average rate during 2013

In [None]:
def expandCountry (name):
    if name == 'UK':
        return 'United Kingdom'
    elif name == 'USA':
        return 'United States'
    else:
        return name

## Applying functions

We have coded the three data conversion functions, they can be applied to the GDP table.
We first select the relevant column:

In [None]:
column = gdp['Country']
column

Next, we can use the column method `apply()` , which applies a given function to each cell in
the column, returning a new column, in which each cell is the conversion of the
corresponding original cell:

In [None]:
column.apply(expandCountry)

Finally, we can add that new column to the dataframe, using a new column heading
A one-argument function can be applied to each cell in a column, in order to obtain a new column with the converted values.

In [None]:
gdp['Country name'] = gdp['Country'].apply(expandCountry)
gdp

In a similar way, we can convert the US dollars to British pounds, then round to the nearest
million, and store the result in a new column. We could apply the conversion and rounding
functions in two separate statements, but using method chaining , we can apply both
functions in a single line of code. This is possible because the column returned by the first
call of `apply()` is the context for the second call of `apply()`.

Given that `apply()` is a column method that returns a column, it can be **chained**, to apply several conversions in one go.

In [None]:
gdp['GDP (£m)'] = gdp['GDP (US$)'].apply(usdToGBP).apply(roundToMillions)
gdp

Applying the conversion functions in a different order will lead to a different result.

In [None]:
gdp['GDP (US$)'].apply(roundToMillions).apply(usdToGBP).apply(round)

Now it’s just a matter of selecting the two new columns, as the original ones are no longer
needed.

In [None]:
headings = ['Country name', 'GDP (£m)']
gdp = gdp[headings]
gdp

Note that method chaining only works if the methods chained return the same type of
value as their context, in the same way that you can chain multiple arithmetic operators
(e.g. 3+4-5) because each one takes two numbers and returns a number that is used by
the next operator in the chain. In this course, methods only have two possible contexts,
columns and dataframes, so you can either chain column methods that return a single
column (that is a Series ), like apply() , or dataframe methods that return dataframes.
For example, gdp.head(4).tail(2) is a dataframe just with China and Brazil, i.e. the
last two of the first four rows of the dataframe shown above. You’ll see further examples of
chaining (and an easier way to select multiple rows) in later modules.

### Task

Take the dataframe you created for earlier, and apply to its population column the rounding function you wrote.