In [1]:
# Importing Python packages we are likely to need
import pandas as pd  # useful for reading and manipulating data tables

# Augmenting Datasets

This week we will be covering augmenting datasets.

## What does that mean?

One dataset is good. Two datasets is better. One superpower that Pandas gives you is the ability to combine datasets together. 

For example if you have a dataset of inpatient stays and a dataset of referals we can combine the two to know the referal source of every inpatient stay in our data.

| Patient_id   | Referal Source  | Referal Consultant |
|--------------|-----------------|--------------------|
| 1            | Cardio          | Geoff              |
| 2            | GP              | Jeff               |
| 5            | GP              | Goff               |

<br>

| Patient_id   | Inpatient Start  | Inpatient End | Length of Stay |
|--------------|------------------|---------------|----------------|
| 1            | 2021-10-15       | 2021-10-19    | 4              |
| 2            | 2021-01-15       | 2021-02-15    | 31             |
| 3            | 2021-01-15       | 2021-03-15    | 62             |
| 4            | 2021-01-15       | 2021-02-12    | 28             |

### Inner Join

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |

### Left Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 5            | GP              | Goff               | NA              | NA            | NA             |


### Right Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |


### Outer Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |
| 5            | GP              | Goff               | NA              | NA            | NA             |


These two datasets can be combined by joining on the common columns. In this case on the patient id. 

There are several ways in which two tables can be joined. These are most easily visualised using Venn diagrams.

![](venn.png)

In [2]:
df_patient = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
})

df_patient

Unnamed: 0,id,name
0,1,Tom
1,2,Jenny
2,3,James
3,4,Dan


In [3]:
df_info = pd.DataFrame({
    'id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

df_info

Unnamed: 0,id,age,sex
0,2,31,F
1,3,20,M
2,4,40,M
3,5,70,F


In [4]:
df_new = pd.merge(df_patient, df_info, on='id') # Inner join

In [8]:
df_new

Unnamed: 0,id,name,age,sex
0,2,Jenny,31,F
1,3,James,20,M
2,4,Dan,40,M


## What if my columns don't have the same name?

In [5]:
df_info_2 = pd.DataFrame({
    'patient_id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

In [6]:
pd.merge(
  df_patient, 
  df_info_2, 
  left_on='id', 
  right_on='patient_id'
)

Unnamed: 0,id,name,patient_id,age,sex
0,2,Jenny,2,31,F
1,3,James,3,20,M
2,4,Dan,4,40,M


In [11]:
# What would we expect this to look like with each method?
pd.merge(df_patient, df_info, on='id', how='outer')

Unnamed: 0,id,name,age,sex
0,1,Tom,,
1,2,Jenny,31.0,F
2,3,James,20.0,M
3,4,Dan,40.0,M
4,5,,70.0,F


## What if I don't want to lose data which does not have info?

In [12]:
df_patient = pd.DataFrame({
    'id': [1,2,3,4],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
})

df_patient

Unnamed: 0,id,name
0,1,Tom
1,2,Jenny
2,3,James
3,4,Dan


In [13]:
df_stay = pd.DataFrame({
    'id': [2, 2, 4, 4],
    'treatment': ['A', 'B' ,'A', 'C'],
    'length_of_stay': [31, 21, 20,40],
    'date': pd.date_range('2019-02-24', periods=4, freq='D')
})

df_stay

Unnamed: 0,id,treatment,length_of_stay,date
0,2,A,31,2019-02-24
1,2,B,21,2019-02-25
2,4,A,20,2019-02-26
3,4,C,40,2019-02-27


In [14]:
pd.merge(df_patient, df_stay, how='left', on='id')

Unnamed: 0,id,name,treatment,length_of_stay,date
0,1,Tom,,,NaT
1,2,Jenny,A,31.0,2019-02-24
2,2,Jenny,B,21.0,2019-02-25
3,3,James,,,NaT
4,4,Dan,A,20.0,2019-02-26
5,4,Dan,C,40.0,2019-02-27


## Index = True

Joining on the index

In [15]:
pd.merge(df_patient, df_stay, how='left', left_index=True, right_index=True)

Unnamed: 0,id_x,name,id_y,treatment,length_of_stay,date
0,1,Tom,2,A,31,2019-02-24
1,2,Jenny,2,B,21,2019-02-25
2,3,James,4,A,20,2019-02-26
3,4,Dan,4,C,40,2019-02-27


JOINING ON MULTIPLE COLUMNS

## Exercise

Can you use the referals data that we used last week and the CSV in this dir to get the CCG names on the referrals data?

In [17]:
referals = pd.read_csv('../../data/referrals_oct19_dec20.csv')
ccgs = pd.read_csv('ccg_2019.csv')

In [19]:
referals.head()

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
0,2019-10-07,00L,(blank),Routine,13
1,2019-10-07,00L,(blank),Urgent,1
2,2019-10-07,00L,2WW,2 Week Wait,349
3,2019-10-07,00L,Allergy,Routine,3
4,2019-10-07,00L,Cardiology,Routine,84


In [20]:
ccgs.head()

Unnamed: 0,FID,CCG19CD,CCG19CDH,CCG19NM,STP19CD,STP19NM
0,1,E38000001,02N,"NHS Airedale, Wharfedale and Craven CCG",E54000005,West Yorkshire and Harrogate (Health and Care ...
1,2,E38000018,02W,NHS Bradford City CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...
2,3,E38000019,02R,NHS Bradford Districts CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...
3,4,E38000025,02T,NHS Calderdale CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...
4,5,E38000064,03A,NHS Greater Huddersfield CCG,E54000005,West Yorkshire and Harrogate (Health and Care ...


In [21]:
ccg_names = pd.merge(
    ccgs, referals, how='inner', left_on='CCG19CDH', right_on='ccg_code'
)

In [23]:
ccg_names[[
    'week_start',
    'ccg_code',
    'CCG19NM',
    'specialty',
    'priority',
    'referrals'
]]

Unnamed: 0,week_start,ccg_code,CCG19NM,specialty,priority,referrals
0,2019-10-07,02T,NHS Calderdale CCG,(blank),Routine,2
1,2019-10-07,02T,NHS Calderdale CCG,2WW,2 Week Wait,165
2,2019-10-07,02T,NHS Calderdale CCG,Cardiology,Routine,27
3,2019-10-07,02T,NHS Calderdale CCG,Cardiology,Urgent,27
4,2019-10-07,02T,NHS Calderdale CCG,Children's & Adolescent Services,Routine,57
...,...,...,...,...,...,...
437981,2020-12-21,01H,NHS North Cumbria CCG,Surgery - Plastic,Routine,1
437982,2020-12-21,01H,NHS North Cumbria CCG,Surgery - Vascular,Routine,13
437983,2020-12-21,01H,NHS North Cumbria CCG,Surgery - Vascular,Urgent,10
437984,2020-12-21,01H,NHS North Cumbria CCG,Urology,Routine,23


## Maps

Maps are used for a similar purpose but often for a single column. They are a way of writing a translation dictionary for a coded column.

For example you could achieve the same ccg code translation as above by making a dict like this:

```python
ccg_name_map = {
    '02N': 'NHS Airedale, Wharfedale and Craven CCG',
    '02W': 'NHS Bradford City CCG',
    '02R': 'NHS Bradford Districts CCG',
    '02T': 'NHS Calderdale CCG',
    '03A': 'NHS Greater Huddersfield CCG',
    '03E': 'NHS Harrogate and Rural District CCG',
    ...
}
      
```

You can then apply this map by using the `.map()` method.

```python
df['mapped_column'] = df['to_be_mapped_column'].map(map_dictionary)
```


In [24]:
# Exercise - Using this can you add a column of CCG name to the referals data?
ccg_code_dict = ccgs[['CCG19CDH', 'CCG19NM']].set_index('CCG19CDH')['CCG19NM'].to_dict()
# This dictionary is provided as an example - I am just using the csv provided to create the dictionary for you to use.

In [28]:
referals['ccg_name'] = referals['ccg_code'].map(ccg_code_dict)
referals.head()

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals,ccg_name
0,2019-10-07,00L,(blank),Routine,13,NHS Northumberland CCG
1,2019-10-07,00L,(blank),Urgent,1,NHS Northumberland CCG
2,2019-10-07,00L,2WW,2 Week Wait,349,NHS Northumberland CCG
3,2019-10-07,00L,Allergy,Routine,3,NHS Northumberland CCG
4,2019-10-07,00L,Cardiology,Routine,84,NHS Northumberland CCG


In [30]:
referals_new = referals[['week_start', 'ccg_code', 'ccg_name', 'specialty']]

In [31]:
referals_new

Unnamed: 0,week_start,ccg_code,ccg_name,specialty
0,2019-10-07,00L,NHS Northumberland CCG,(blank)
1,2019-10-07,00L,NHS Northumberland CCG,(blank)
2,2019-10-07,00L,NHS Northumberland CCG,2WW
3,2019-10-07,00L,NHS Northumberland CCG,Allergy
4,2019-10-07,00L,NHS Northumberland CCG,Cardiology
...,...,...,...,...
592679,2020-12-21,99M,NHS North East Hampshire and Farnham CCG,Surgery - Not Otherwise Specified
592680,2020-12-21,99M,NHS North East Hampshire and Farnham CCG,Surgery - Vascular
592681,2020-12-21,99M,NHS North East Hampshire and Farnham CCG,Surgery - Vascular
592682,2020-12-21,99M,NHS North East Hampshire and Farnham CCG,Urology


## Apply

You can also do something similar by defining your own functions and applying them to each element / row in a dataframe.

A very silly example:

In [32]:
df_results = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'score': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

In [33]:
def calculate_percentage_score(score):
    return score/100

Applying a function can create a new column which is easily assigned to using the following syntax

```python 
new_column = df['old_column'].apply(function_name)
```

Note! you only have to pass the callable (ie function name) rather than acutally calling the function

The callable will take the column values one by one and use them as arguments for the function.

You don't need any brackets!

In [34]:
df_results['percentage_score_proper'] = df_results['score'].apply(calculate_percentage_score)


In [35]:
df_results

Unnamed: 0,id,score,sex,percentage_score_proper
0,1,31,F,0.31
1,2,20,M,0.2
2,3,40,M,0.4
3,4,70,F,0.7


## Making column wise changes

We know how to do basic arithmetic between columns but what if we wanted to do something more complicated. 

In [None]:
# Exercise can we make a column that is a true/false flag on sex i.e. could we make a "female" column which is true if the patient is female?


In [None]:
# Exercise - Using np.round() and the groupby we learnt last week can you round the average number of referals per ccg to the nearest 10?

import numpy as np 

np.round(10234,-1)

---

In [None]:
import pandas as pd
import numpy as np

In [None]:
func_map = {
    "sum": np.sum([1,1]),
    "multiply": np.multiply(2,2)
}

In [None]:
df_test = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'test': ["sum", "sum", "multiply", "multiply"]
})

In [None]:
df_test['output'] = df_test['test'].map(func_map)

In [None]:
df_test