# Flag Analysis

This notebook does some preliminary work showcasing a sample of what is possible if you run the CLI followed by the `examples/generate_record_flags.py` script to convert the long form search output to record-level flags. It utilizes the `data/wide_form_output.csv` file that results from running the above script and is included in the repo. 

The script is required to generate flags that include metadata categories and is somewhat configured by our use case but should be adjustable. It is also required, as opposed to a join, in order to create binary flags for detections. If you have an idea how to improve that script or how to get the same results using simpler methods (i.e. a join) please [Submit an Issue](https://github.com/UK-IPOP/drug-extraction/issues) 🙂. We use that script frequently so any help is welcomed.

## Imports

In [116]:
import pandas as pd
import numpy as np
from pathlib import Path

## Datasets

We will be using the original dataset that the CLI was run on (`data/cook_records.csv`) and then the wide-form output (`data/wide_form_output.csv`)

In [117]:
# read and show example of original data
df = pd.read_csv(Path().cwd().parent / "data" / "cook_records.csv", low_memory=False)
print(df.shape)
df[['Residence Zip', 'Incident Zip Code']] = df[['Residence Zip', 'Incident Zip Code']].replace({'unknown': np.nan})
df.sample()

(74976, 30)


Unnamed: 0,Case Number,Date of Incident,Date of Death,Age,Gender,Race,Latino,Manner of Death,Primary Cause,Primary Cause Line A,...,Incident Zip Code,longitude,latitude,location,Residence City,Residence Zip,OBJECTID,Chicago Ward,Chicago Community Area,COVID Related
71100,ME2023-00855,02/09/2023 10:54:00 AM,02/09/2023 09:54:00 AM,75.0,Male,White,False,NATURAL,ORGANIC CARDIOVASCULAR DISEASE,,...,60640,-87.676967,41.963394,"(41.96339363, -87.67696716)",Chicago,60640,72800,47.0,LINCOLN SQUARE,False


In [118]:
# read and show example of search output
wide_df = pd.read_csv(Path().cwd().parent / "data" / "wide_form_output.csv")
print(wide_df.shape)
wide_df.fillna(0, inplace=True)
wide_df.sample()

(30582, 121)


Unnamed: 0,row_id,HEROIN,FENTANYL_meta,OXYMORPHONE,HYPOXIA_meta,ETH_ALC_meta,ANPP,CYCLOBENZAPRINE,ALCOHOL,HYDROXYBUPROPION,...,ETHANOL,DRUG_meta,HYDROMORPHONE,DEXTRORPHAN,PCP,TIZANIDINE,CANNABINOIDS_meta,OPIATE_meta,CARFENTANIL,HYPOXIC
11738,ME2020-08299,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we want to join these on their case identifier which will be `Case Number` for the original records and `row_id` for the wide-form output.

In [119]:
merged = pd.merge(
    left=df,
    right=wide_df,
    left_on="Case Number",
    right_on="row_id",
    how="left",
    validate="1:1",
)
print(merged.shape)
merged.sample()

(74976, 151)


Unnamed: 0,Case Number,Date of Incident,Date of Death,Age,Gender,Race,Latino,Manner of Death,Primary Cause,Primary Cause Line A,...,ETHANOL,DRUG_meta,HYDROMORPHONE,DEXTRORPHAN,PCP,TIZANIDINE,CANNABINOIDS_meta,OPIATE_meta,CARFENTANIL,HYPOXIC
28188,ME2018-05782,12/03/2018 08:30:00 AM,12/03/2018 08:39:00 AM,69.0,Male,Black,False,NATURAL,ORGANIC CARDIOVASCULAR DISEASE,,...,,,,,,,,,,


## Record-Level Analysis

Now that we have the merged datasets we can look at some correlations at the record level as well as cross-tabulate drug-detections by a few demographic fields.

Both of these graphics are informative, but don't tell the whole story... remember to consider population characteristics and constraints when coming to conclusions. These graphics don't "control for" disparities within subpopulations.

In [128]:
# check correlation between age, gun related, chicago ward fentanyl related, and heroin related
# replace all True/False boolean with 1/0 so correlation works
# scroll a bunch :)
(
    merged[["Age", "Gun Related", "Chicago Ward", "latitude", "longitude", "Opioid Related", "FENTANYL_meta", "HEROIN"]]
    .replace({True: 1, False: 0})
    .corr(numeric_only=True)
    .style.format("{:.3}")
    .background_gradient(cmap="coolwarm", vmin=-1, vmax=1)
    .set_caption("Correlation between numeric variables and booleans")
)

Unnamed: 0,Age,Gun Related,Chicago Ward,latitude,longitude,Opioid Related,FENTANYL_meta,HEROIN
Age,1.0,-0.381,0.0708,0.0946,-0.146,-0.218,-0.454,-0.326
Gun Related,-0.381,1.0,-0.0954,-0.0946,0.114,-0.152,-0.00716,-0.00517
Chicago Ward,0.0708,-0.0954,1.0,0.709,-0.495,0.00252,-0.0429,-0.0354
latitude,0.0946,-0.0946,0.709,1.0,-0.5,-0.00231,-0.0409,-0.0328
longitude,-0.146,0.114,-0.495,-0.5,1.0,0.0316,0.121,0.0821
Opioid Related,-0.218,-0.152,0.00252,-0.00231,0.0316,1.0,0.786,0.567
FENTANYL_meta,-0.454,-0.00716,-0.0429,-0.0409,0.121,0.786,1.0,0.401
HEROIN,-0.326,-0.00517,-0.0354,-0.0328,0.0821,0.567,0.401,1.0


In [129]:
# a cross tabulation examining the role of gender and race in fentanyl involved deaths
# this does no cleaning on the gender or race categories and assumes they are correct from the source data
# we do drop `Unknown/Other` to make the output simpler for this example
table = (
    merged[["Gender", "Race", "FENTANYL"]].replace({1.0: True, 0.0: False})
    .query("Gender != 'Unknown'")
    .query("Race != 'Unknown'")
    .query("Race != 'Unknown'")
    .query("Race != 'Other'")
    .dropna()
)
pd.crosstab(
    [table["Gender"], table["Race"]],
    table["FENTANYL"],
    margins=True,
    normalize=True,
).style.format("{:.2%}").background_gradient(cmap="Blues").set_caption(
    "Cross Tabulation for Fentanyl Involved Deaths"
)

Unnamed: 0_level_0,FENTANYL,False,True,All
Gender,Race,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,Am. Indian,0.03%,0.01%,0.04%
Female,Asian,0.88%,0.03%,0.91%
Female,Black,10.24%,3.35%,13.59%
Female,White,16.37%,2.97%,19.34%
Male,Am. Indian,0.04%,0.02%,0.06%
Male,Asian,1.31%,0.13%,1.44%
Male,Black,13.91%,11.42%,25.32%
Male,White,28.69%,10.61%,39.30%
All,,71.47%,28.53%,100.00%
