In [1]:
import pandas as pd

# Augmenting Datasets

This week we will be covering augmenting datasets. 

![](pandas_cartoon.jpeg)

## What does that mean?

One dataset is good. Two datasets is better. One superpower that Pandas gives you is the ability to combine datasets together. 

There are several ways in which two tables can be joined. These are most easily visualised using Venn diagrams.

![](venn.png)

For example if you have a dataset of inpatient stays and a dataset of referrals we can combine the two to know the referral source of every inpatient stay in our data.

| Patient_id   | Referal Source  | Referal Consultant |
|--------------|-----------------|--------------------|
| 1            | Cardio          | Geoff              |
| 2            | GP              | Jeff               |
| 5            | GP              | Goff               |

<br>

| Patient_id   | Inpatient Start  | Inpatient End | Length of Stay |
|--------------|------------------|---------------|----------------|
| 1            | 2021-10-15       | 2021-10-19    | 4              |
| 2            | 2021-01-15       | 2021-02-15    | 31             |
| 3            | 2021-01-15       | 2021-03-15    | 62             |
| 4            | 2021-01-15       | 2021-02-12    | 28             |

These two datasets can be combined by joining on the common columns. In this case on the patient id. 

### Inner Join

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |

### Left Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 5            | GP              | Goff               | NA              | NA            | NA             |


### Right Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |


### Outer Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |
| 5            | GP              | Goff               | NA              | NA            | NA             |




Let's see how we code that!

In [2]:
df_patient = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
    'ethnicity': ['A', 'H' ,'M', 'A'],
})

df_patient

Unnamed: 0,id,name,ethnicity
0,1,Tom,A
1,2,Jenny,H
2,3,James,M
3,4,Dan,A


In [3]:
df_info = pd.DataFrame({
    'id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

df_info

Unnamed: 0,id,age,sex
0,2,31,F
1,3,20,M
2,4,40,M
3,5,70,F


The syntax for a join is:

pd.merge(table_1, table_2, joining conditions, join type)

The default join type is inner join and the default joining conditions are the common columns.

In [4]:
df_new = pd.merge(df_patient, df_info, on='id') # Inner join
df_new

Unnamed: 0,id,name,ethnicity,age,sex
0,2,Jenny,H,31,F
1,3,James,M,20,M
2,4,Dan,A,40,M


## What if my columns don't have the same name?

In [5]:
df_info_2 = pd.DataFrame({
    'patient_id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

In [6]:
pd.merge(
  df_patient, 
  df_info_2, 
  left_on='id', 
  right_on='patient_id'
)

Unnamed: 0,id,name,ethnicity,patient_id,age,sex
0,2,Jenny,H,2,31,F
1,3,James,M,3,20,M
2,4,Dan,A,4,40,M


What would we expect this to look like with each method?

In [7]:
pd.merge(df_patient, df_info, on='id', how='outer')

Unnamed: 0,id,name,ethnicity,age,sex
0,1,Tom,A,,
1,2,Jenny,H,31.0,F
2,3,James,M,20.0,M
3,4,Dan,A,40.0,M
4,5,,,70.0,F


## What if I don't want to lose data which does not have info?

In [8]:
df_stay = pd.DataFrame({
    'id': [2, 2, 4, 4],
    'length_of_stay': [31, 21, 20,40],
    'date': pd.date_range('2019-02-24', periods=4, freq='D')
})

df_stay

Unnamed: 0,id,length_of_stay,date
0,2,31,2019-02-24
1,2,21,2019-02-25
2,4,20,2019-02-26
3,4,40,2019-02-27


In [9]:
pd.merge(df_patient, df_stay, how='left', on='id')

Unnamed: 0,id,name,ethnicity,length_of_stay,date
0,1,Tom,A,,NaT
1,2,Jenny,H,31.0,2019-02-24
2,2,Jenny,H,21.0,2019-02-25
3,3,James,M,,NaT
4,4,Dan,A,20.0,2019-02-26
5,4,Dan,A,40.0,2019-02-27


Joining on the index

In [10]:
pd.merge(df_patient, df_stay, how='left', left_index=True, right_index=True)

Unnamed: 0,id_x,name,ethnicity,id_y,length_of_stay,date
0,1,Tom,A,2,31,2019-02-24
1,2,Jenny,H,2,21,2019-02-25
2,3,James,M,4,20,2019-02-26
3,4,Dan,A,4,40,2019-02-27


## Exercise

Can you use the referals data that we used last week and the csv of ccg data to get the CCG names on the referrals data?

In [19]:
referrals = pd.read_csv('https://raw.githubusercontent.com/carnall-farrar/python_club/master/data/referrals_oct19_dec20.csv')
ccgs = pd.read_csv('https://raw.githubusercontent.com/carnall-farrar/python_club/master/lessons/augmenting_datasets/ccg_2019.csv')

In [20]:
referrals.head()

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
0,2019-10-07,00L,(blank),Routine,13
1,2019-10-07,00L,(blank),Urgent,1
2,2019-10-07,00L,2WW,2 Week Wait,349
3,2019-10-07,00L,Allergy,Routine,3
4,2019-10-07,00L,Cardiology,Routine,84


In [21]:
ccgs.head()

Unnamed: 0,FID,CCG19CD,CCG19CDH,CCG19NM,STP19CD,STP19NM
0,1,E38000001,02N,"NHS Airedale, Wharfedale and Craven CCG",E54000005,West Yorkshire and Harrogate (Health and Care ...
1,2,E38000018,02W,NHS Bradford City CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...
2,3,E38000019,02R,NHS Bradford Districts CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...
3,4,E38000025,02T,NHS Calderdale CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...
4,5,E38000064,03A,NHS Greater Huddersfield CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...


In [23]:
ccg_names = pd.merge(
    referrals, ccgs, how='left', left_on='ccg_code', right_on='CCG19CDH'
)
ccg_names

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals,FID,CCG19CD,CCG19CDH,CCG19NM,STP19CD,STP19NM
0,2019-10-07,00L,(blank),Routine,13,187.0,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
1,2019-10-07,00L,(blank),Urgent,1,187.0,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
2,2019-10-07,00L,2WW,2 Week Wait,349,187.0,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
3,2019-10-07,00L,Allergy,Routine,3,187.0,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
4,2019-10-07,00L,Cardiology,Routine,84,187.0,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
...,...,...,...,...,...,...,...,...,...,...,...
592679,2020-12-21,99M,Surgery - Not Otherwise Specified,Urgent,2,147.0,E38000118,99M,NHS North East Hampshire and Farnham CCG,E54000034,Frimley Health
592680,2020-12-21,99M,Surgery - Vascular,Routine,2,147.0,E38000118,99M,NHS North East Hampshire and Farnham CCG,E54000034,Frimley Health
592681,2020-12-21,99M,Surgery - Vascular,Urgent,2,147.0,E38000118,99M,NHS North East Hampshire and Farnham CCG,E54000034,Frimley Health
592682,2020-12-21,99M,Urology,Routine,25,147.0,E38000118,99M,NHS North East Hampshire and Farnham CCG,E54000034,Frimley Health


## Joining dataframes and aggregating data by a new column

Using the referals data we want to aggregate the referral numbers by the ICS name

In [24]:
referrals.head()

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
0,2019-10-07,00L,(blank),Routine,13
1,2019-10-07,00L,(blank),Urgent,1
2,2019-10-07,00L,2WW,2 Week Wait,349
3,2019-10-07,00L,Allergy,Routine,3
4,2019-10-07,00L,Cardiology,Routine,84


Bring in the CCG to ICS mapping:

In [38]:
ccg_ics_mapping = pd.read_csv('https://raw.githubusercontent.com/carnall-farrar/python_club/master/lessons/augmenting_datasets/ccg_ics_mapping.csv')
ccg_ics_mapping.head()

Unnamed: 0,ccg_code,ccg_name,ics_name,ccg_id,ics_id,ics_code
0,02N,"NHS Airedale, Wharfedale and Craven CCG",West Yorkshire,E38000001,E54000054,QWO
1,02W,NHS Bradford City CCG,West Yorkshire,E38000018,E54000054,QWO
2,02R,NHS Bradford Districts CCG,West Yorkshire,E38000019,E54000054,QWO
3,02T,NHS Calderdale CCG,West Yorkshire,E38000025,E54000054,QWO
4,03A,NHS Greater Huddersfield CCG,West Yorkshire,E38000064,E54000054,QWO


In [26]:
ics_referrals = pd.merge(
    referrals, ccg_ics_mapping, on='ccg_code'
)
ics_referrals

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals,ccg_name,ics_name,ccg_id,ics_id,ics_code
0,2019-10-07,00L,(blank),Routine,13,NHS Northumberland CCG,North East and North Cumbria,E38000130,E54000050,QHM
1,2019-10-07,00L,(blank),Urgent,1,NHS Northumberland CCG,North East and North Cumbria,E38000130,E54000050,QHM
2,2019-10-07,00L,2WW,2 Week Wait,349,NHS Northumberland CCG,North East and North Cumbria,E38000130,E54000050,QHM
3,2019-10-07,00L,Allergy,Routine,3,NHS Northumberland CCG,North East and North Cumbria,E38000130,E54000050,QHM
4,2019-10-07,00L,Cardiology,Routine,84,NHS Northumberland CCG,North East and North Cumbria,E38000130,E54000050,QHM
...,...,...,...,...,...,...,...,...,...,...
592679,2020-12-21,99M,Surgery - Not Otherwise Specified,Urgent,2,NHS North East Hampshire and Farnham CCG,Frimley,E38000118,E54000034,QNQ
592680,2020-12-21,99M,Surgery - Vascular,Routine,2,NHS North East Hampshire and Farnham CCG,Frimley,E38000118,E54000034,QNQ
592681,2020-12-21,99M,Surgery - Vascular,Urgent,2,NHS North East Hampshire and Farnham CCG,Frimley,E38000118,E54000034,QNQ
592682,2020-12-21,99M,Urology,Routine,25,NHS North East Hampshire and Farnham CCG,Frimley,E38000118,E54000034,QNQ


In [27]:
ics_referrals = ics_referrals.groupby('ics_name')['referrals'].sum().reset_index()

In [28]:
ics_referrals.head()

Unnamed: 0,ics_name,referrals
0,"Bath and North East Somerset, Swindon and Wilt...",287573
1,"Bedfordshire, Luton and Milton Keynes",298694
2,Birmingham and Solihull,278208
3,Black Country,361645
4,"Bristol, North Somerset and South Gloucestershire",208343


## Maps

Maps are used for a similar purpose but often for a single column. They are a way of writing a translation dictionary for a coded column.

For example you could achieve the same ccg code translation as above by making a dict like this:

```python
ethnicity_map = {
    'A': 'White British',
    'M': 'Black or Black British - Caribbean',
    'H': 'Asian or Asian British - Indian',
}
      
```

You can then apply this map by using the `.map()` method.

```python
df['mapped_column'] = df['to_be_mapped_column'].map(map_dictionary)
```


## Exercise

Can you map the priority types to the following target waiting times (weeks):
 - 2 Weeks Wait: 2
 - Urgent: 4
 - Routine: 18
 
(please note these are not real numbers, do not use them in your analysis)

In [29]:
priority_mapping = {
    '2 Week Wait': 2,
    'Urgent': 4,
    'Routine': 18
}

referrals['target_wait_time'] = referrals['priority'].map(priority_mapping)
referrals

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals,target_wait_time
0,2019-10-07,00L,(blank),Routine,13,18
1,2019-10-07,00L,(blank),Urgent,1,4
2,2019-10-07,00L,2WW,2 Week Wait,349,2
3,2019-10-07,00L,Allergy,Routine,3,18
4,2019-10-07,00L,Cardiology,Routine,84,18
...,...,...,...,...,...,...
592679,2020-12-21,99M,Surgery - Not Otherwise Specified,Urgent,2,4
592680,2020-12-21,99M,Surgery - Vascular,Routine,2,18
592681,2020-12-21,99M,Surgery - Vascular,Urgent,2,4
592682,2020-12-21,99M,Urology,Routine,25,18


## Apply

You can also do something similar by defining your own functions and applying them to each element / row in a dataframe.

Applying a function can create a new column which is easily assigned to using the following syntax

```python 
new_column = df['old_column'].apply(function_name)
```

Note! you only have to pass the callable (ie function name) rather than acutally calling the function

The callable will take the column values one by one and use them as arguments for the function.

You don't need any brackets!

A very silly example:

In [33]:
ethnicity_map = {
    'A': 'White British',
    'M': 'Black or Black British - Caribbean',
    'H': 'Asian or Asian British - Indian',
}

df_patient['ethnicity_description'] = df_patient['ethnicity'].map(ethnicity_map)

df_patient

Unnamed: 0,id,name,ethnicity,ethnicity_description
0,1,Tom,A,White British
1,2,Jenny,H,Asian or Asian British - Indian
2,3,James,M,Black or Black British - Caribbean
3,4,Dan,A,White British


In [34]:
def string_formatting(my_string):
    return my_string.split(' ')[0]

df_patient['ethnicity_category'] = df_patient['ethnicity_description'].apply(string_formatting)

df_patient

Unnamed: 0,id,name,ethnicity,ethnicity_description,ethnicity_category
0,1,Tom,A,White British,White
1,2,Jenny,H,Asian or Asian British - Indian,Asian
2,3,James,M,Black or Black British - Caribbean,Black
3,4,Dan,A,White British,White


## Exercise

1) Use apply to do small number suppression. If there are less than 7 referrals for a category in a given week, change the number to zero.

In [35]:
def apply_small_number_suppression(value):
    if value < 7:
        return 0
    else:
        return value
    
referrals['referrals_sns'] = referrals['referrals'].apply(apply_small_number_suppression)

In [36]:
referrals

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals,target_wait_time,referrals_sns
0,2019-10-07,00L,(blank),Routine,13,18,13
1,2019-10-07,00L,(blank),Urgent,1,4,0
2,2019-10-07,00L,2WW,2 Week Wait,349,2,349
3,2019-10-07,00L,Allergy,Routine,3,18,0
4,2019-10-07,00L,Cardiology,Routine,84,18,84
...,...,...,...,...,...,...,...
592679,2020-12-21,99M,Surgery - Not Otherwise Specified,Urgent,2,4,0
592680,2020-12-21,99M,Surgery - Vascular,Routine,2,18,0
592681,2020-12-21,99M,Surgery - Vascular,Urgent,2,4,0
592682,2020-12-21,99M,Urology,Routine,25,18,25


2) Use the ccgs data and an inner join to get only the referrals for Cumbria and North East

In [37]:
my_stp = ccgs.loc[ccgs['STP19NM'] == 'Cumbria and North East']

cumbria_referrals = pd.merge(referrals, my_stp, left_on = 'ccg_code', right_on = 'CCG19CDH', how = 'inner')

cumbria_referrals

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals,target_wait_time,referrals_sns,FID,CCG19CD,CCG19CDH,CCG19NM,STP19CD,STP19NM
0,2019-10-07,00L,(blank),Routine,13,18,13,187,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
1,2019-10-07,00L,(blank),Urgent,1,4,0,187,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
2,2019-10-07,00L,2WW,2 Week Wait,349,2,349,187,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
3,2019-10-07,00L,Allergy,Routine,3,18,0,187,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
4,2019-10-07,00L,Cardiology,Routine,84,18,84,187,E38000130,00L,NHS Northumberland CCG,E54000049,Cumbria and North East
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20785,2020-12-21,99C,Surgery - Not Otherwise Specified,Routine,14,18,14,186,E38000127,99C,NHS North Tyneside CCG,E54000049,Cumbria and North East
20786,2020-12-21,99C,Surgery - Plastic,Routine,7,18,7,186,E38000127,99C,NHS North Tyneside CCG,E54000049,Cumbria and North East
20787,2020-12-21,99C,Surgery - Vascular,Routine,8,18,8,186,E38000127,99C,NHS North Tyneside CCG,E54000049,Cumbria and North East
20788,2020-12-21,99C,Urology,Routine,19,18,19,186,E38000127,99C,NHS North Tyneside CCG,E54000049,Cumbria and North East
