## Peer Group creation

Occasionally, your source data will have comparison values - these could be simple aggregations (Scotland or Health Board) or more complex "peer" values where each location is compared against a group of similar locations.

This recipe demonstrates how you can create random peer groups and use them to drive the aggregation of your freshly anonymised data as part of post-processing. 

Let's look at the sample `inptients` data:

In [1]:
import pandas as pd
import numpy as np

source = pd.read_csv("inpatients.csv")
source.head()

Unnamed: 0,quarter_date,hb_code,hb_name,loc_code,loc_name,measure,stays,los,avlos,sex,age
0,2018-12-31,S08000015,NHS Ayrshire & Arran,A101H,Arran War Memorial Hospital,Elective Inpatients,0,0,,Female,20-29
1,2018-03-31,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,20,27,1.35,Female,0-9
2,2018-06-30,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,28,35,1.25,Female,0-9
3,2018-09-30,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,20,21,1.05,Female,0-9
4,2018-12-31,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,20,22,1.1,Female,0-9


There are a couple of important considerations to keep in mind when calculating peer values:

- Peer values must respect other categorical dimensions, like dates, age and measure.
- If multiple geography levels are present, peer groups must be specific to each level.
- Locations in peer groups must be persistent and not change mid-processing.
- Processing needs to be efficient as number of rows can be quite large.
- Any derived columns, like `avlos`, will need to have special treatment.

Note that in this dataset health boards and hospitals are not split into data levels so we'll treat them as equivalent. In a real situation you'd want to anonymise locations and then aggregate them into health board level separately.

Let's define a function that will generate our peer groups:

In [2]:
def generate_peer_groups(locations, min_peers, max_peers, seed=0):
    '''
    Helper function to generate random peer groups.
    
    Paramters:
    ----------
    locations : np.array
        a peer group will be generated for each of the locations
    min_peers : int
        a minimumum number of peers a group should have
    max_peers : int
        a maximum number of peers a group should have
    seed:     : int
        random seed for reproducibility
    
    Returns:
    --------
    A dictionary with peer information.   
    '''
    
    def new_peer_group(min_peers, max_peers):
        '''
        Helper function to initialise a peer group dictionary.
        '''
        
        return {
            "peer_count" : np.random.randint(min_peers, max_peers + 1),
            "peers"      : []
        }
    
    np.random.seed(seed)
    result = {}
    
    #initialise all locations and their peer groups
    for location in locations:
        result[location] = new_peer_group(min_peers, max_peers)
    
    for i, location in enumerate(locations):
        peer_group = result[location]
        #add peer with itself, if able
        if location not in peer_group["peers"]:
            peer_group["peers"].append(location)
        #add random peers from the list
        remaining = peer_group["peer_count"] - len(peer_group["peers"])
        if remaining > 0:
            loc_length = np.arange(len(locations))
            idx = np.random.choice(loc_length[loc_length != i], remaining, replace=False)
            peer_group["peers"].extend([locations[i] for i in idx])
            #try to add initial location to its peers' own groups
            for i in idx:
                peer_loc = locations[i]
                #make sure that the peer location is included in its own group
                if peer_loc not in result[peer_loc]["peers"]:
                    result[peer_loc]["peers"].append(peer_loc)
                if result[peer_loc]["peer_count"] > len(result[peer_loc]["peers"]):
                    result[peer_loc]["peers"].append(location)

    return result

Now let's generate peer groups for the our dataset with each location peered with between 4 and 6 other locations, including itself.

In [3]:
peer_col = "loc_name"
peer_groups = generate_peer_groups(source[peer_col].unique(), 3, 6)
peer_groups["Arran War Memorial Hospital"]

{'peer_count': 3,
 'peers': ['Arran War Memorial Hospital',
  'West Glasgow',
  'Lorn & Islands Hospital']}

In [4]:
peer_groups["West Glasgow"]

{'peer_count': 4,
 'peers': ['West Glasgow',
  'Arran War Memorial Hospital',
  'NHS Highland',
  'Vale of Leven General Hospital']}

You can see that each location is always paired with itself, plus a random number of peers. If possible, we will make sure peers are shared between groups, so if `West Glasgow` was picked for `Arran War Memorial Hospital`, we try to place `Arran War Memorial Hospital` in the `West Glasgow` peer group.

Before we can use these peer groups, we need to format them into a dataframe:

In [5]:
peer_names = []
peer_values = []

for key, value in peer_groups.items():
    peer_names.extend([key] * value["peer_count"])
    peer_values.extend(value["peers"])   

peer_reference_df = pd.DataFrame(data={
    "peer_group_name" : peer_names,
    "peer_location" : peer_values
})

peer_reference_df.head()

Unnamed: 0,peer_group_name,peer_location
0,Arran War Memorial Hospital,Arran War Memorial Hospital
1,Arran War Memorial Hospital,West Glasgow
2,Arran War Memorial Hospital,Lorn & Islands Hospital
3,NHS Ayrshire & Arran,NHS Ayrshire & Arran
4,NHS Ayrshire & Arran,NHS Greater Glasgow & Clyde


Remember the first point about respecting other categorical dimensions - we define them here...

In [6]:
join_cols = [
    "quarter_date", 
    "measure",
    "sex",
    "age"
]

...and use them to join the two dataframes:

In [7]:
peer_aggregated_df = (pd
    .merge(peer_reference_df, source, how="left", left_on="peer_location", right_on="loc_name")
    .drop(columns="avlos")   
    .groupby(["peer_group_name"] + join_cols).sum()
    .rename(columns=lambda x: x + "_peer")
    .reset_index()
    .rename(columns={"peer_group_name" : "loc_name"}))

peer_aggregated_df.head()

Unnamed: 0,loc_name,quarter_date,measure,sex,age,stays_peer,los_peer
0,Aberdeen Royal Infirmary,2018-03-31,All Daycases,Female,0-9,35,0
1,Aberdeen Royal Infirmary,2018-03-31,All Daycases,Female,10-19,106,0
2,Aberdeen Royal Infirmary,2018-03-31,All Daycases,Female,20-29,423,0
3,Aberdeen Royal Infirmary,2018-03-31,All Daycases,Female,30-39,649,0
4,Aberdeen Royal Infirmary,2018-03-31,All Daycases,Female,40-49,974,0


You've noticed that we dropped the `avlos` column from our previous step because we can't just sum the averages - we have to use the newly aggregated columns to generate this derived column. Thanksfully, it's very easy:

In [8]:
peer_aggregated_df["avlos_peer"] = peer_aggregated_df["los_peer"] / peer_aggregated_df["stays_peer"]

Finally, we join our peer measures on to the source data:

In [9]:
final = pd.merge(source, peer_aggregated_df, how="left", on=["loc_name"] + join_cols)
final.head()

Unnamed: 0,quarter_date,hb_code,hb_name,loc_code,loc_name,measure,stays,los,avlos,sex,age,stays_peer,los_peer,avlos_peer
0,2018-12-31,S08000015,NHS Ayrshire & Arran,A101H,Arran War Memorial Hospital,Elective Inpatients,0,0,,Female,20-29,27,90,3.333333
1,2018-03-31,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,20,27,1.35,Female,0-9,463,996,2.151188
2,2018-06-30,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,28,35,1.25,Female,0-9,432,1016,2.351852
3,2018-09-30,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,20,21,1.05,Female,0-9,398,993,2.494975
4,2018-12-31,S08000015,NHS Ayrshire & Arran,S08000015,NHS Ayrshire & Arran,Elective Inpatients,20,22,1.1,Female,0-9,370,928,2.508108
