In [2]:
# Importing Python packages we are likely to need
import pandas as pd  # useful for reading and manipulating data tables

# Augmenting Datasets

This week we will be covering augmenting datasets.

## What does that mean?

One dataset is good. Two datasets is better. One superpower that Pandas gives you is the ability to combine datasets together. 

For example if you have a dataset of inpatient stays and a dataset of referals we can combine the two to know the referal source of every inpatient stay in our data.

| Patient_id   | Referal Source  | Referal Consultant |
|--------------|-----------------|--------------------|
| 1            | Cardio          | Geoff              |
| 2            | GP              | Jeff               |
| 5            | GP              | Goff               |

<br>

| Patient_id   | Inpatient Start  | Inpatient End | Length of Stay |
|--------------|------------------|---------------|----------------|
| 1            | 2021-10-15       | 2021-10-19    | 4              |
| 2            | 2021-01-15       | 2021-02-15    | 31             |
| 3            | 2021-01-15       | 2021-03-15    | 62             |
| 4            | 2021-01-15       | 2021-02-12    | 28             |

### Inner Join

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |

### Left Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 5            | GP              | Goff               | NA              | NA            | NA             |


### Right Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |


### Outer Join 

| Patient_id   | Referal Source  | Referal Consultant | Inpatient Start | Inpatient End | Length of Stay |
|--------------|-----------------|--------------------|-----------------|---------------|----------------|
| 1            | Cardio          | Geoff              | 2021-10-15      | 2021-10-19    | 4              |
| 2            | GP              | Jeff               | 2021-01-15      | 2021-02-15    | 31             |
| 3            | NA              | NA                 | 2021-01-15      | 2021-03-15    | 62             |
| 4            | NA              | NA                 | 2021-01-15      | 2021-02-12    | 28             |
| 5            | GP              | Goff               | NA              | NA            | NA             |


These two datasets can be combined by joining on the common columns. In this case on the patient id. 

There are several ways in which two tables can be joined. These are most easily visualised using Venn diagrams.

![](venn.png)

In [2]:
df_patient = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
})

df_patient

Unnamed: 0,id,name
0,1,Tom
1,2,Jenny
2,3,James
3,4,Dan


In [3]:
df_info = pd.DataFrame({
    'id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

df_info

Unnamed: 0,id,age,sex
0,2,31,F
1,3,20,M
2,4,40,M
3,5,70,F


In [4]:
pd.merge(df_patient, df_info, on='id') # Inner join

Unnamed: 0,id,name,age,sex
0,2,Jenny,31,F
1,3,James,20,M
2,4,Dan,40,M


## What if my columns don't have the same name?

In [5]:
df_info_2 = pd.DataFrame({
    'patient_id': [2, 3, 4, 5],
    'age': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

In [6]:
pd.merge(
  df_patient, 
  df_info_2, 
  left_on='id', 
  right_on='patient_id'
)

Unnamed: 0,id,name,patient_id,age,sex
0,2,Jenny,2,31,F
1,3,James,3,20,M
2,4,Dan,4,40,M


In [7]:
# What would we expect this to look like with each method?
pd.merge(df_patient, df_info, on='id', how=?)

SyntaxError: invalid syntax (1590357138.py, line 2)

## What if I don't want to lose data which does not have info?

In [8]:
df_patient = pd.DataFrame({
    'id': [1,2,3,4],
    'name': ['Tom', 'Jenny', 'James', 'Dan'],
})

df_patient

Unnamed: 0,id,name
0,1,Tom
1,2,Jenny
2,3,James
3,4,Dan


In [9]:
df_stay = pd.DataFrame({
    'id': [2, 2, 4, 4],
    'treatment': ['A', 'B' ,'A', 'C'],
    'length_of_stay': [31, 21, 20,40],
    'date': pd.date_range('2019-02-24', periods=4, freq='D')
})

df_stay

Unnamed: 0,id,treatment,length_of_stay,date
0,2,A,31,2019-02-24
1,2,B,21,2019-02-25
2,4,A,20,2019-02-26
3,4,C,40,2019-02-27


In [10]:
pd.merge(df_patient, df_stay, how='left', on='id')

Unnamed: 0,id,name,treatment,length_of_stay,date
0,1,Tom,,,NaT
1,2,Jenny,A,31.0,2019-02-24
2,2,Jenny,B,21.0,2019-02-25
3,3,James,,,NaT
4,4,Dan,A,20.0,2019-02-26
5,4,Dan,C,40.0,2019-02-27


## Index = True

Joining on the index

In [13]:
pd.merge(df_patient, df_stay, how='left', left_index=True, right_index=True)

Unnamed: 0,id_x,name,id_y,treatment,length_of_stay,date
0,1,Tom,2,A,31,2019-02-24
1,2,Jenny,2,B,21,2019-02-25
2,3,James,4,A,20,2019-02-26
3,4,Dan,4,C,40,2019-02-27


JOINING ON MULTIPLE COLUMNS

## Exercise

Can you use the referals data that we used last week and the CSV in this dir to get the CCG names on the referals data?

In [4]:
# referals = pd.read_csv('~/Downloads/referrals_oct19_dec20.csv')
ccgs = pd.read_csv('ccg_2019.csv')

## Maps

Maps are used for a similar purpose but often for a single column. They are a way of writing a translation dictionary for a coded column.

For example you could achieve the same ccg code translation as above by making a dict like this:

```python
ccg_name_map = {
    '02N': 'NHS Airedale, Wharfedale and Craven CCG',
    '02W': 'NHS Bradford City CCG',
    '02R': 'NHS Bradford Districts CCG',
    '02T': 'NHS Calderdale CCG',
    '03A': 'NHS Greater Huddersfield CCG',
    '03E': 'NHS Harrogate and Rural District CCG',
    ...
}
      
```

You can then apply this map by using the `.map()` method.

```python
df['mapped_column'] = df['to_be_mapped_column'].map(map_dictionary)
```

In [6]:
# Exercise - Using this can you add a column of CCG name to the referals data?
ccg_code_dict = ccgs[['CCG19CDH', 'CCG19NM']].set_index('CCG19CDH')['CCG19NM'].to_dict()
# This dictionary is provided as an example - I am just using the csv provided to create the dictionary for you to use.

## Apply

You can also do something similar by defining your own functions and applying them to each element / row in a dataframe.

A very silly example:

In [17]:
df_results = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'score': [31, 20, 40, 70],
    'sex': ['F', 'M', 'M', 'F']
})

In [18]:
def calculate_percentage_score(score):
    return score/100

Applying a function can create a new column which is easily assigned to using the following syntax

```python 
new_column = df['old_column'].apply(function_name)
```

Note! you only have to pass the callable (ie function name) rather than acutally calling the function

The callable will take the column values one by one and use them as arguments for the function.

You don't need any brackets!

In [19]:
df_results['percentage_score_proper'] = df_results['score'].apply(calculate_percentage_score)


In [20]:
df_results

Unnamed: 0,id,score,sex,percentage_score_proper
0,1,31,F,0.31
1,2,20,M,0.2
2,3,40,M,0.4
3,4,70,F,0.7


## Making column wise changes

We know how to do basic arithmetic between columns but what if we wanted to do something more complicated. 

In [21]:
# Exercise can we make a column that is a true/false flag on sex i.e. could we make a "female" column which is true if the patient is female?


In [22]:
# Exercise - Using np.round() and the groupby we learnt last week can you round the average number of referals per ccg to the nearest 10?

import numpy as np 

np.round(10234,-1)

10230