In [1]:
import pandas as pd

# Augmenting Datasets

This week we will be covering augmenting datasets. 

![](pandas_cartoon.jpeg)

## What does that mean?

One dataset is good. Two datasets is better. One superpower that Pandas gives you is the ability to combine datasets together. 

There are several ways in which two tables can be joined. These are most easily visualised using Venn diagrams.

![](venn.png)

For example if you have a dataset of inpatient stays and a dataset of referrals we can combine the two to know the referral source of every inpatient stay in our data.

| Patient_id   | Referal Source  | Referal Consultant |
|--------------|-----------------|--------------------|
| 1            | Cardio          | Geoff              |
| 2            | GP              | Jeff               |
| 5            | GP              | Goff               |

<br>

| Patient_id   | Inpatient Start  | Inpatient End | Length of Stay |
|--------------|------------------|---------------|----------------|
| 1            | 2021-10-15       | 2021-10-19    | 4              |
| 2            | 2021-01-15       | 2021-02-15    | 31             |
| 3            | 2021-01-15       | 2021-03-15    | 62             |
| 4            | 2021-01-15       | 2021-02-12    | 28             |

These two datasets can be combined by joining on the common columns. In this case on the patient id. 

### Inner Join

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |

### Left Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 5            | GP              | Goff               | NA              | NA            | NA             |


### Right Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |


### Outer Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |
| 5            | GP              | Goff               | NA              | NA            | NA             |




Let's see how we code that!

In [2]:
df_patient = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
    'ethnicity': ['A', 'H' ,'M', 'A'],
})

df_patient

Unnamed: 0,id,name,ethnicity
0,1,Tom,A
1,2,Jenny,H
2,3,James,M
3,4,Dan,A


In [3]:
df_info = pd.DataFrame({
    'id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

df_info

Unnamed: 0,id,age,sex
0,2,31,F
1,3,20,M
2,4,40,M
3,5,70,F


The syntax for a join is:

pd.merge(table_1, table_2, joining conditions, join type)

The default join type is inner join and the default joining conditions are the common columns.

## What if my columns don't have the same name?

In [4]:
df_info_2 = pd.DataFrame({
    'patient_id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

What would we expect this to look like with each method?

## What if I don't want to lose data which does not have info?

In [5]:
df_stay = pd.DataFrame({
    'id': [2, 2, 4, 4],
    'length_of_stay': [31, 21, 20,40],
    'date': pd.date_range('2019-02-24', periods=4, freq='D')
})

df_stay

Unnamed: 0,id,length_of_stay,date
0,2,31,2019-02-24
1,2,21,2019-02-25
2,4,20,2019-02-26
3,4,40,2019-02-27


Joining on the index

## Exercise

Can you use the referals data that we used last week and the csv of ccg data to get the CCG names on the referrals data?

In [6]:
referrals = pd.read_csv('https://raw.githubusercontent.com/carnall-farrar/python_club/master/data/referrals_oct19_dec20.csv')
ccgs = pd.read_csv('https://raw.githubusercontent.com/carnall-farrar/python_club/master/lessons/augmenting_datasets/ccg_2019.csv')

## Joining dataframes and aggregating data by a new column

Using the referals data we want to aggregate the referral numbers by the ICS name

Bring in the CCG to ICS mapping:

In [8]:
ccg_ics_mapping = pd.read_csv('https://raw.githubusercontent.com/carnall-farrar/python_club/master/lessons/augmenting_datasets/ccg_ics_mapping.csv')


## Maps

Maps are used for a similar purpose but often for a single column. They are a way of writing a translation dictionary for a coded column.

For example you could achieve the same ccg code translation as above by making a dict like this:

```python
ethnicity_map = {
    'A': 'White British',
    'M': 'Black or Black British - Caribbean',
    'H': 'Asian or Asian British - Indian',
}
      
```

You can then apply this map by using the `.map()` method.

```python
df['mapped_column'] = df['to_be_mapped_column'].map(map_dictionary)
```


## Exercise

Can you map the priority types to the following target waiting times (weeks):
 - 2 Weeks Wait: 2
 - Urgent: 4
 - Routine: 18
 
(please note these are not real numbers, do not use them in your analysis)

## Apply

You can also do something similar by defining your own functions and applying them to each element / row in a dataframe.

Applying a function can create a new column which is easily assigned to using the following syntax

```python 
new_column = df['old_column'].apply(function_name)
```

Note! you only have to pass the callable (ie function name) rather than acutally calling the function

The callable will take the column values one by one and use them as arguments for the function.

You don't need any brackets!

A very silly example:

## Exercise

1) Use apply to do small number suppression. If there are less than 7 referrals for a category in a given week, change the number to zero.

2) Use the ccgs data and an inner join to get only the referrals for Cumbria and North East