# Transforming and Combining Data

In the previous module you worked on a dataset that combined two different `World Health
Organization datasets: population and the number of deaths due to tuberculosis`.
They could be combined because they share a `common attribute: the countries`. This
week you will learn the techniques behind the creation of such a combined dataset.

In [None]:
import pandas as pd

In [None]:
table = [
  ['UK', 2678454886796.7],    # 1st row
  ['USA', 16768100000000.0],  # 2nd row
  ['China', 9240270452047.0], # and so on...
  ['Brazil', 2245673032353.8],
  ['South Africa', 366057913367.1]
]

In [None]:
headings = ['Country', 'GDP (US$)']
gdp = pd.DataFrame(columns=headings, data=table)

In [None]:
headings = ['Country name', 'Life expectancy (years)']
table = [
  ['China', 75],
  ['Russia', 71],  
  ['United States', 79],
  ['India', 66],
  ['United Kingdom', 81]
]
life = pd.DataFrame(columns=headings, data=table)

In [None]:
def roundToMillions (value):
    return round(value / 1000000)

In [None]:
def usdToGBP (usd):
    return usd / 1.564768 # average rate during 2013 

In [None]:
def expandCountry (name):
    if name == 'UK':
        return 'United Kingdom'
    elif name == 'USA':
        return 'United States'
    else:
        return name

In [None]:
def expandCountry (name):
    if name == 'UK':
        name = 'United Kingdom'
    if name == 'USA':
        name = 'United States'
    return name

In [None]:
gdp['Country name'] = gdp['Country'].apply(expandCountry)
gdp['GDP (£m)'] = gdp['GDP (US$)'].apply(usdToGBP).apply(roundToMillions)
gdp['GDP (US$)'].apply(roundToMillions).apply(usdToGBP).apply(round)
headings = ['Country name', 'GDP (£m)']
gdp = gdp[headings]

## Joining left, right and centre

At this point, both tables have a common column, 'Country name', with fully expanded country names.

In [None]:
Let’s take stock for a moment. There’s the original, unchanged table (with full country
names) about the life expectancy:

In [None]:
life

… and a table with the GDP in millions of pounds and also full country names.

In [None]:
gdp

Both tables have a common column with a common name (‘Country name’). We can **join** the
two tables on that common column, using the **merge()** function. Merging basically puts all columns of the two tables together, without duplicating the common column, and joins
any rows that have the same value in the common column.
There are four possible ways of joining, depending on which rows we want to include in the
resulting table. If we want to include only those countries appearing in the GDP table, we call
the **merge()** function.

A **left join** takes the rows of the left table and adds the columns of the right table. 

In [None]:
pd.merge(gdp, life, on='Country name', how='left')

The first two arguments are the tables to be merged, with the first table being called the
‘left’ table and the second being the ‘right’ table. The on argument is the name of the
common column, i.e. both tables must have a column with that name. The **how** argument
states we want a **left join** , i.e. the resulting rows are dictated by the left (GDP) table. You
can easily see that India and Russia, which appear only in the right (expectancy) table,
don’t show up in the result. You can also see that Brazil and South Africa, which appear
only in the left table, have an undefined life expectancy. (Remember that ‘NaN’ stands for
‘not a number.)

A **right join** will instead take the rows from the right table, and add the columns of the left
table. Therefore, countries not appearing in the left table will have undefined values for the
left table’s columns.

A **right join** takes the rows from the right table, and adds the columns of the left table.

In [None]:
pd.merge(gdp, life, on='Country name', how='right')

The third possibility is an **outer join** which takes all countries, i.e. whether they are in the
left or right table. The result has all the rows of the left and right joins.

An **outer join** takes the union of the rows, i.e. it has all the rows of the left and right joins.

In [None]:
pd.merge(gdp, life, on='Country name', how='outer')

The last possibility is an **inner join** which takes only those countries common to both
tables, i.e. for which I know the GDP and the life expectancy. That’s the join we want, to
avoid any undefined values:

An **inner join** takes the intersection of the rows (i.e. the common rows) of the left and right joins.

In [None]:
gdpVsLife = pd.merge(gdp, life, on='Country name', how='inner')
gdpVsLife

### Task

Join your population dataframe previous task with `gdpVsLife`, in four different ways, and note the differences.

In [None]:
POP

In [None]:
gdpVsLife

In [None]:
pd.merge(POP, gdpVsLife, on='Country name', how='left')

In [None]:
pd.merge(POP, gdpVslife, on='Country name', how='right')

In [None]:
pd.merge(POP, gdpVslife, on='Country name', how='outer')

In [1]:
POPVsgdpVsLife = pd.merge(POP, gdp, life, on='Country name', how='inner')

NameError: name 'pd' is not defined

## Constant variables

You may have noticed that the same column names appear over and over in the code.

If, someday, we decide one of the new columns should be called `‘GDP (million GBP)’`
instead of `‘GDP (£m)’` to make clear which currency is meant (because various countries
use the pound symbol), we need to change the string in every line of code it occurs.

Laziness is the mother of invention. If we assign the string to a variable and then use the
variable everywhere instead of the string, whenever we wish to change the string, we only
have to edit one line of code, where it’s assigned to the variable. A second advantage of
using names instead of values is that we can use the name completion facility of Jupyter
notebooks by pressing **‘TAB’**. Writing code becomes much faster…

gdpInGbp = 'GDP (million GBP)'
gdpInUsd = 'GDP (US$)'
country = 'Country name'
gdp[gdpInGbp] = gdp[gdpInUsd].apply(usdToGbp)
headings = [country, gdpInGbp]
gdp = gdp[headings]

Such variables are meant to be assigned once. They are called **constants** , because their
value never changes. However, if someone else takes our code and wishes to adapt and
extend it, they may not realise those variables are supposed to remain constant. Even we
may forget it and try to assign a new value further down in the code! To help prevent such
slip-ups the Python convention is to write names of constants in uppercase letters, with
words separated by underscores. Thus, any further assignment to a variable in uppercase
will ring an alarm bell `(in your head, the computer remains silent)`.

Constants are used to represent fixed values (e.g. strings and numbers) that occur frequently in a program. Constant names are conventionally written in uppercase, with underscores to separate multiple words.

In [None]:
GDP_USD = 'GDP (US$)'
GDP_GBP = 'GDP (£m)'
GDP_USD

In [None]:
COUNTRY = 'Country name'
gdp[GDP_GBP] = gdp[GDP_USD].apply(usdToGbp)
headings = [COUNTRY, GDP_GBP]
gdp = gdp[headings]

Using constants is not just a matter of laziness. There are various advantages. First,
constants stand out in the code.
Second, when making changes to the repeated values throughout the code, it’s easy to
miss an occurrence. Using constants means the code is always consistent throughout.
Third, the name of the constant can help clarify what the value means. For example,
instead of using the number 1995 throughout the code, define a constant that makes clear
whether it’s a year, the cubic centimetres of a car engine or something else.

To sum up, using constants makes the code clearer, easier to change, and less prone to
silly (but hard to find) mistakes due to inconsistent values.
Any value can be defined as a constant, whether it’s a string, a number or even a
dataframe. For example, you could store the data you have loaded from the file into a
constant, as a reminder to not change the original data. In the rest of the module, we will use
constants mainly for the column names.

### Task

Look through the code you wrote so far, and rewrite it using constants, when appropriate.