### Notebook to explore additional pain medication details 

This notebook builds upon notebook 1 by adding in additional drugs and data.  The dictionary developed in notebook 1 is then expanded to include the new drugs.  The data is then processed to generate tables
related to patient that have 6 month and 12 month follow ups.

After this the new dictionary is used to process further data in notebook 3

In [None]:
import os
import numpy as np
import pandas as pd
import pickle
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from pathlib import Path

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
path_to_data = Path(os.getcwd()).parent / "data"

### Read in the new medication table 

In [None]:
new_meds_file_name = "Medication 21.03.2023.csv"

In [None]:
raw_df = pd.read_csv(path_to_data / Path(new_meds_file_name))

In [None]:
raw_df.head()

In [None]:
# Check which columns are in the dataframe
raw_df.columns

In [None]:
raw_df.loc[raw_df['ssid']=='BO-08-113']

All medicines seem to be in the "medication_name_strength" column.  To check all of these are in the dictionary from the earlier notebook, that dictionary will be loaded, and all of the uniques entries in this column will be checked to see whether it is present in the dictionary

In [None]:
raw_df['medication_name_strength'] = raw_df['medication_name_strength'].apply(lambda x: str(x))

In [None]:
drugs_list = raw_df['medication_name_strength'].to_list()
unique_names = list(set(drugs_list))
unique_names.sort()
unique_names[:10]

This is a very different list to teh original list of medications.  To produce a new, consilidated drug dictionary it was decided to start with the dictionary from the earlier notebook.  The earlier dictionary was loaded into final_dict

In [None]:
drug_dict_file = "drug_dict.pkl"
drug_dict_path = path_to_data / Path(drug_dict_file)
with open(drug_dict_path, 'rb') as dict_file:
    final_dict = pickle.load(dict_file)

In [None]:
list(final_dict.keys())

To see which drugs in the new list were not in the old, a list was generated

In [None]:
for drug in unique_names:
    if drug not in final_dict.keys():
        print(drug)

Next fuzzy matching was used to see which of the unmatched drugs might be associated with the keys from the final dictionary

In [None]:
def get_matches(drug_names, unique_names, pr_threshold, tr_threshold):
    """ The fuzzy matching algorithm is used to identify which of the unique names is close to a drug name, where close 
    is defined by the threshold.  If the threshold is set too high then potential matches can be missed, if too low 
    then the same name can become assiciated with multiple drugs.
    """
    unique_names_arr = np.asarray(list(unique_names))
    pr_drugs_dict = {}
    tr_drugs_dict = {}
    for drug in drug_names:
        partial_ratios = []
        token_ratios = []
        for name in unique_names:
            partial_ratio = fuzz.partial_ratio(drug.lower(), name)
            partial_ratios.append(partial_ratio)
            token_ratio = fuzz.token_sort_ratio(drug, name)
            token_ratios.append(token_ratio)
        pr_arr = np.asarray(partial_ratios)
        tr_arr = np.asarray(token_ratios)
        pr_matches = unique_names_arr[np.nonzero(pr_arr>pr_threshold)]
        tr_matches = unique_names_arr[np.nonzero(tr_arr>tr_threshold)]
        pr_drugs_dict[drug] = pr_matches.tolist()
        tr_drugs_dict[drug] = tr_matches.tolist()
    return pr_drugs_dict, tr_drugs_dict

See what drugs can be matches against the main drugs list

In [None]:
drug_names = list(final_dict.keys())[1:]
pr_drugs_dict, tr_drugs_dict = get_matches(drug_names, unique_names, pr_threshold=75, tr_threshold=75)

In [None]:
pr_drugs_dict

In [None]:
final_dict

In [None]:
pr_drugs_dict['Aspirin']

Check how many of the items in the matches dict keys match the values in the existing dictionary.  If they are already there they can be remove and don't need to be worried about

In [None]:
# Generate a list containing all of the names contained in the final_dict value items
assigned_names = []
for key in final_dict.keys():
    for value in final_dict[key]:
        assigned_names.append(value)

In [None]:
# Now check which of the drugs in unique names already exists in the final_dict values
for drug in unique_names:
    if drug in assigned_names:
        print(f'{drug} found')

The drugs that are already in assigned names can be removed from the unique names list so that it is more obvious which ones remain

In [None]:
len(unique_names)

In [None]:
for drug in unique_names:
    if drug in assigned_names:
        unique_names.remove(drug)
        print(f'{drug} found and processed')

In [None]:
# See how many are left after removing the above
len(unique_names)

In [None]:
unique_names

In [None]:
assigned_names

### Load in pain meds drug summary
Since there were a lot of new meds in the list it was necessary to know which were pain meds.  Helen created a table of the new pain meds, their correct spelling, which group they belonges to etc

In [None]:
path_to_new_pain_meds = path_to_data / Path("new_pain_meds.csv")
new_meds_df = pd.read_csv(path_to_new_pain_meds)

In [None]:
new_meds_df.head()

There were a few typo's and additional characters in this table that needed to be corrected

In [None]:
new_meds_df = new_meds_df.replace("Morphine sulphate", "Morphine Sulphate")
new_meds_df = new_meds_df.replace("Fentanyl ", "Fentanyl")
new_meds_df = new_meds_df.replace("Tramadol\xa0", "Tramadol")

In [None]:
new_meds_df

Create a new dictionary from the above of the same form as the original dictionary

In [None]:
new_meds_dict = {}
for idx, row in new_meds_df.iterrows():
    group = row['Group']
    variant = row["New pain drugs"]
    if group not in new_meds_dict:
        new_meds_dict[group] = []
    if variant not in new_meds_dict[group]:
        new_meds_dict[group].append(variant)

In [None]:
new_meds_dict

### Create new drug dictionary

Merge the two dictionaries to create a new final dictionary which is applicable to everything

In [None]:
import copy
new_final_dict = copy.deepcopy(final_dict)

In [None]:
for key in new_meds_dict.keys():
    if key not in new_final_dict.keys():
        new_final_dict[key] = []
    for value in new_meds_dict[key]:
        if value not in new_final_dict[key]:
            new_final_dict[key].append(value)

In [None]:
new_final_dict

### Update the dict with some additional pain meds found after the above. 
There are some cases that need to be added (mainly ones that are not similar or mis-spelt, but different names).  These will be done on an individual basis

In [None]:
# Add new values discoverd subsequently
new_final_dict["Co-codamol"].append("Zapain")
new_final_dict["Co-codamol"].append("Zapai")
new_final_dict["Ibuprofen"].append("Ibugel")
new_final_dict["Duloxetine"].append("Duloxatin")
new_final_dict["Morphine Sulphate"].append("Zoomorph")
new_final_dict["Morphine Sulphate"].append("Aramorph")
new_final_dict["Dihydrocodiene"].append("Diharocodeine")
new_final_dict["Dihydrocodiene"].append("Dihydeocodeine")
new_final_dict["Aspirin"].append("Dispersable Asprin")
new_final_dict["Aspirin"].append("Dispersible Asprin")
new_final_dict["Aspirin"].append("Asprin 75mg")
new_final_dict["Co-dydramol"].append("Co-dyamol")
new_final_dict["Co-dydramol"].append("Co Dydramol")
new_final_dict["Buprenorphine"].append("Transdermal patch buprephine")
new_final_dict["Buprenorphine"].append("Butec patches")
new_final_dict["Co-proxamol"] = ["Co-proxamol", "Coproxanol"]
new_final_dict["Oxycodone"].append("Longtec")
new_final_dict["Oxycodone"].append("Oxycontin")
new_final_dict["Diclofenac Sodium"].append("Athrotec")
new_final_dict["Diclofenac Sodium"].append("Voltarol")
new_final_dict["Diclofenac Sodium"].append("Voltarol cream")
new_final_dict["Diclofenac Sodium"].append("Voltral")
new_final_dict["Diclofenac Sodium"].append("Arthrotec 50")
new_final_dict["Tapentadol"].append("Tapentodol")
new_final_dict["Tramadol"].append("Marol")
new_final_dict["Tramadol"].append("Malol")
new_final_dict["Tramadol"].append("Morol")
new_final_dict["Tramadol"].append("Tradorec")
new_final_dict["Diazepam"].append("Diazepan")
new_final_dict["Naproxen"].append("Naproxin")
new_final_dict["Capsaicin"].append("Axsain Cream")
new_final_dict["Pregabalin"].append("Lirica")

In [None]:
new_final_dict

In [None]:
# Save the new dictionary
path_to_new_dict = path_to_data / Path("new_final_dict.pkl")
with open(path_to_new_dict, 'wb') as nfd:
    pickle.dump(new_final_dict, nfd)

### Apply the new dictionary to correct the spelling and names of the pain meds dataframe

In [None]:
# Iterate around replacing any mis-spelt values
for key in new_final_dict.keys():
    for value in new_final_dict[key]:
        raw_df = raw_df.replace(value, key)

In [None]:
raw_df

In [None]:
raw_df.columns

In [None]:
new_columns = ['ssid', 'study_event_oid', 'medication_name_strength', 'medication_dosage_form', 'medication_doseage_frequency',
       'how_long_using_medication', 'end_date_using_medication']

### Process the dataframe with the corrected drug names 
It is desirable to look at data based upon follow ups at 6 and 12 month separately.  The follow up data is embedded in teh "study_event_old" column and will be separated out

In [None]:
processed_df = raw_df[new_columns]

In [None]:
list(processed_df["study_event_oid"])

To be able to separate the 6 month and 12 month cases a new column will be created defining the follow up period

In [None]:
processed_df['follow_up'] = processed_df.loc[:, 'study_event_oid'].map(lambda x: x.split(' ')[1])

In [None]:
def concat_ssid_data(df):
    """ create a new dataframe based upon the unique SSID values.  There are likely to be some follow ups for 6 and 12 months
    for some cases but not all.  This function will assign all of the base columns from the original columns into the same row
    with the six month values in columns before the 12 month values.  In this was all of the data for a specifi SSID will be in the same form
    but with the column names mapped from the original to the base names
    """
    original_col_names = ['medication_name_strength', 'medication_dosage_form', 
                          'medication_doseage_frequency', 'end_date_using_medication', 'follow_up']
    base_column_names = ['drug_name', 'dosage', 'frequency', 'used', 'follow_up']
    new_df = pd.DataFrame([])
    # generate unique list of ssids
    ssids = df['ssid'].unique()
    ssids.sort()
    for i, ssid in enumerate(ssids):
        # process 6 months and then 12 months
        filtered_df = df[df['ssid']==ssid]
        filtered_df = filtered_df[filtered_df['follow_up']=='6']
        # At this point we have a subset of the dataframe for entries with the current ssid and for which the follow up was 6 months
        # copy the row into the new dataframe at the correct point
        new_row_dict = {'ssid': ssid}
        counter = 1
        for key, row in filtered_df.iterrows():
            if pd.isnull(row[original_col_names[0]]):
                # if no valid medication name then skip this row
                continue
            for i, col_name in enumerate(original_col_names):
                new_row_dict[base_column_names[i]+'_'+str(counter)] = row[col_name]
            counter += 1
        # process for 12 month
        filtered_df = df[df['ssid']==ssid]
        filtered_df = filtered_df[filtered_df['follow_up']=='12']
        for key, row in filtered_df.iterrows():
            for i, col_name in enumerate(original_col_names):
                new_row_dict[base_column_names[i]+'_'+str(counter)] = row[col_name]
            counter += 1
        # Add new row to dataframe
        new_df = new_df.append(new_row_dict, ignore_index=True)
        # filter the dataframe to obtain only rows with ssid
    return new_df

Check how many rows there are in the filtered dataframe that are in the new meds list

Create the new SSID based dataframe using the above function

The above table was not used, instead it was decided to create a different dataframe for each perios

In [None]:
def create_new_table(df, follow_up_target):
    """ similar to the above table but in this case produces a separate table for the 6 month follow up and 12 month follow
    up cases
    """
    original_col_names = ['medication_name_strength', 'medication_dosage_form', 
                          'medication_doseage_frequency', 'end_date_using_medication', 'follow_up']
    base_column_names = ['drug_name', 'dosage', 'frequency', 'used', 'follow_up']
    new_df = pd.DataFrame([])
    # generate unique list of ssids
    ssids = df['ssid'].unique()
    ssids.sort()
    for i, ssid in enumerate(ssids):
        filtered_df = df[df['ssid']==ssid]
        filtered_df = filtered_df[filtered_df['follow_up']==str(follow_up_target)]
        # copy the row into the new dataframe at the correct point
        new_row_dict = {'ssid': ssid}
        counter = 1
        for key, row in filtered_df.iterrows():
            if pd.isnull(row[original_col_names[0]]):
                continue
            for i, col_name in enumerate(original_col_names):
                new_row_dict[base_column_names[i]+'_'+str(counter)] = row[col_name]
            counter += 1
        # Add new row to dataframe
        new_df = new_df.append(new_row_dict, ignore_index=True)
        # filter the dataframe to obtain only rows with ssid
    return new_df

In [None]:
interim6_df = copy.deepcopy(processed_df)
six_month_df = create_new_table(interim6_df, 6)
six_month_df

In [None]:
interim12_df = copy.deepcopy(processed_df)
twelve_month_df = create_new_table(interim6_df, 12)
twelve_month_df

In [None]:
six_month_df.columns

### Add pain meds usage criteria to the dataframes 
The number of controlled, opiod and non-controlled drugs being used by the patient

In [None]:
cntr_pain_meds_plus_opioids = ['Gabapentin', 'Tramadol', 'Pregabalin', 'Morphine Sulphate', 'Fentanyl', 'Oxycodone', 
                          'Buprenorphine', 'Diazepam', 'Tapentadol', 'Co-codamol', 'Co-dydramol', 'Codeine', 
                          'Dihydrocodiene', 'Co-proxamol']
cntr_pain_meds = ['Gabapentin', 'Tramadol', 'Pregabalin', 'Morphine Sulphate', 'Fentanyl', 'Oxycodone', 
                          'Buprenorphine', 'Diazepam', 'Tapentadol', 'Co-proxamol']
all_pain_meds = list(new_final_dict.keys())
all_pain_meds

In [None]:
cntr_pain_meds

In [None]:
def count_pain_meds(row: pd.Series, drug_set: list) -> int:
    """ count the number of entries the patient is taking that are in the drug set given
    
    args:
        row (pd.Series): series representing a row of a DataFrame
        drug_set (list): list of the drugs to be included in the count
        
    return:
        count (int): Number of drugs in the set being taken by the patient
    """
    count=0
    for col in row:
        if col in drug_set:
            count +=1
    return count


def get_class_of_drug(row: pd.Series, col_to_test) -> int:
    """ returns a value to represent the class of the strongest pain drug being taken by a patient
    
    args:
        row (pd.Series): series representing a row of a DataFrame
        
    returns:
        int: 0 if no pain meds, 1 if no controlled drugs, 2 if the patient in using controlled drugs
    """
    contr_drugs = row[col_to_test]
    non_cont_drugs = row['All_pain_meds']
    if non_cont_drugs == 0 & contr_drugs==0:
        return 0
    if contr_drugs > 0:
        return 2
    else:
        return 1

In [None]:
summary_dict = {
    'All_pain_meds': all_pain_meds,
    'Num_cont_meds': cntr_pain_meds,
    'Num_cont_op_meds': cntr_pain_meds_plus_opioids
}

In [None]:
def add_summary_to_df(df, summary_dict):
    for key, value in summary_dict.items():
        df[key] = df.apply(lambda x: count_pain_meds(x, value), axis=1)
    return df

In [None]:
six_month_df = add_summary_to_df(six_month_df, summary_dict)
six_month_df.head(10)

In [None]:
from functools import partial

Add columns to define the strongest type of pain meds being taked

In [None]:
assign_class_controlled = partial(get_class_of_drug, col_to_test="Num_cont_meds")
assign_class_opioids = partial(get_class_of_drug, col_to_test="Num_cont_op_meds")

In [None]:
six_month_df['drug_class'] = six_month_df.apply(lambda x: assign_class_controlled(x), axis=1)

In [None]:
six_month_df['drug_class_op'] = six_month_df.apply(lambda x: assign_class_opioids(x), axis=1)

In [None]:
six_month_df.head(30)

In [None]:
six_month_df.loc[213]

Save the six month dataframe to a csv in the data directory

In [None]:
path_to_save = path_to_data / Path("six_month_summary.csv")
six_month_df.to_csv(path_to_save)

Repeat the process for the 12 month follow ups

In [None]:
twelve_month_df = add_summary_to_df(twelve_month_df, summary_dict)
twelve_month_df.head(10)

In [None]:
twelve_month_df['drug_class'] = twelve_month_df.apply(lambda x: assign_class_controlled(x), axis=1)
twelve_month_df['drug_class_op'] = twelve_month_df.apply(lambda x: assign_class_opioids(x), axis=1)

In [None]:
twelve_month_df.head(30)

In [None]:
path_to_save = path_to_data / Path("twelve_month_summary.csv")
twelve_month_df.to_csv(path_to_save)

### Count usage of drugs

In [None]:
drug_names = list(new_final_dict.keys())
drug_names

In [None]:
count_dict_6 = {}
for drug in drug_names:
    count_dict_6[drug] = np.sum(six_month_df.values == drug)

In [None]:
count_dict_6

In [None]:
count_6_df = pd.DataFrame.from_dict(data=count_dict_6, orient='index', columns=['count'])
count_6_df

In [None]:
count_6_df = count_6_df.sort_values('count')

In [None]:
count_6_df

In [None]:
count_6_df.to_csv(path_to_data / Path("count_6_df.csv"))

In [None]:
count_dict_12 = {}
for drug in drug_names:
    count_dict_12[drug] = np.sum(twelve_month_df.values == drug)

In [None]:
count_12_df = pd.DataFrame.from_dict(data=count_dict_12, orient='index', columns=['count'])
count_12_df = count_12_df.sort_values('count')
count_12_df

In [None]:
count_12_df.to_csv(path_to_data / Path("count_12_df.csv"))