
# Attach UMLS CUIs with Auto‑Detected Drug Columns

This notebook tags your **MarketScan** and **FAERS** raw drug lists with **UMLS CUIs** using your standardized maps.
It will **auto‑detect** the drug string column in each raw file (so you don't have to remember exact header names).



## 1) Configure paths (edit if needed)
The defaults use the absolute paths you provided. If you're running on another machine, change these.


In [10]:

from pathlib import Path

# --- EDIT THESE IF YOUR PATHS DIFFER ---
MS_RAW   = Path("/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/ms_druglist.csv")
FAERS_RAW= Path("/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/faers_druglist.csv")

MAP_DIR  = Path("/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/cross_quartz")
MS_MAP_FN    = MAP_DIR / "marketscan_standardized.csv"
FAERS_MAP_FN = MAP_DIR / "faers_standardized.csv"
# Optional references
FINAL_JOINED = MAP_DIR / "final_joined.csv"
FUZZY_REVIEW = MAP_DIR / "fuzzy_review_candidates.csv"

# Output directory — write outputs alongside the maps by default
OUT_DIR = MAP_DIR

MS_RAW, FAERS_RAW, OUT_DIR


(PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/ms_druglist.csv'),
 PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/faers_druglist.csv'),
 PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/cross_quartz'))


## 2) Helpers
- `normalize`: lower/strip/collapse whitespace  
- `find_drug_column`: auto‑detect a likely drug string column from common names  
- `attach_cui`: normalized left join from raw → standardized map


In [11]:

import pandas as pd

def normalize(s: pd.Series) -> pd.Series:
    return (
        s.fillna("")
         .astype(str)
         .str.lower()
         .str.strip()
         .str.replace(r"\s+", " ", regex=True)
    )

COMMON_DRUG_COL_CANDIDATES = [
    # very common FAERS/MarketScan headers
    "fda_drug", "drug_ms", "drugname", "prod_ai",
    # generic fallbacks
    "drug", "drug_str", "drug_name", "gname", "name",
]

def find_drug_column(df: pd.DataFrame, preferred: str | None = None) -> str:
    """Pick the best drug column present in df.
    If 'preferred' provided and exists, use it; otherwise search a candidate list.
    Raise ValueError if nothing reasonable found.
    """
    cols = [c for c in df.columns]
    if preferred and preferred in cols:
        return preferred
    for c in COMMON_DRUG_COL_CANDIDATES:
        if c in cols:
            return c
    # soft heuristic: pick first object dtype column with 'drug' in name
    for c in cols:
        if "drug" in c.lower():
            return c
    raise ValueError(f"No suitable drug column found. Available columns: {cols}")

def attach_cui(raw_df, map_df, raw_drug_col: str, map_drug_col: str = "Drug"):
    # sanity checks
    if map_drug_col not in map_df.columns:
        raise ValueError(f"Map file missing '{map_drug_col}'. Columns: {list(map_df.columns)}")
    left  = raw_df.copy()
    right = map_df.copy()
    left["_join_key"]  = normalize(left[raw_drug_col])
    right["_join_key"] = normalize(right[map_drug_col])
    # de-dup map on join key, prefer rows that have a CUI present
    right = (right.sort_values(by=["UMLS_CUI"], na_position="last")
                  .drop_duplicates(subset=["_join_key"], keep="first"))
    merged = (left.merge(
        right.drop(columns=[map_drug_col]),
        on="_join_key", how="left", validate="m:1")
        .drop(columns=["_join_key"]))
    return merged

def summarize_qc(df, cui_col="UMLS_CUI"):
    total = len(df)
    matched = df[cui_col].notna().sum() if cui_col in df.columns else 0
    return pd.DataFrame([{
        "rows": total,
        "matched": matched,
        "unmatched": total - matched,
        "match_rate": round(matched / total, 4) if total else 0.0
    }])



## 3) Load inputs & auto‑inspect columns
This cell prints the columns of each raw file and the standardized maps, then chooses the drug column automatically.


In [12]:

# Load
ms_raw    = pd.read_csv(MS_RAW, dtype=str, low_memory=False)
faers_raw = pd.read_csv(FAERS_RAW, dtype=str, low_memory=False)
ms_map    = pd.read_csv(MS_MAP_FN, dtype=str, low_memory=False)
faers_map = pd.read_csv(FAERS_MAP_FN, dtype=str, low_memory=False)

print("MarketScan raw columns:", list(ms_raw.columns))
print("FAERS raw columns     :", list(faers_raw.columns))
print("MS map columns        :", list(ms_map.columns))
print("FAERS map columns     :", list(faers_map.columns))

# Auto-detect raw drug columns
ms_drug_col    = find_drug_column(ms_raw, preferred=None)   # if you know it, pass preferred="drug_ms"
faers_drug_col = find_drug_column(faers_raw, preferred=None)

ms_drug_col, faers_drug_col


MarketScan raw columns: ['drug_ms']
FAERS raw columns     : ['fda_drug']
MS map columns        : ['Drug', 'clean_drug', 'base_drug', 'UMLS_CUI', 'Preferred_Term', 'Preferred_TTY']
FAERS map columns     : ['Drug', 'clean_drug', 'base_drug', 'UMLS_CUI', 'Preferred_Term', 'Preferred_TTY']


('drug_ms', 'fda_drug')


## 4) Attach CUIs
Performs normalized string left‑joins from the raw lists onto their corresponding maps.


In [13]:

# Attach
ms_with    = attach_cui(ms_raw,    ms_map,    raw_drug_col=ms_drug_col,    map_drug_col="Drug")
faers_with = attach_cui(faers_raw, faers_map, raw_drug_col=faers_drug_col, map_drug_col="Drug")

ms_with.head(10), faers_with.head(10)


(                                        drug_ms  \
 0       1,1,1,3,3-Pentafluoropropane/Norflurane   
 1              5-Methyltetrahydrofolate Calcium   
 2                  5-Methyltetrahydrofolic Acid   
 3  5-Methyltetrahydrofolic Acid/Glucosamine HCl   
 4                              Abacavir Sulfate   
 5                   Abacavir Sulfate/Lamivudine   
 6        Abacavir Sulfate/Lamivudine/Zidovudine   
 7              Abacavir/Dolutegravir/Lamivudine   
 8                                     Abatacept   
 9                                   Abemaciclib   
 
                                      clean_drug  \
 0       1 1 1 3 3-pentafluoropropane/norflurane   
 1              5-methyltetrahydrofolate calcium   
 2                  5-methyltetrahydrofolic acid   
 3  5-methyltetrahydrofolic acid/glucosamine hcl   
 4                              abacavir sulfate   
 5                   abacavir sulfate/lamivudine   
 6        abacavir sulfate/lamivudine/zidovudine   
 7        


## 5) Write outputs & QC
- `ms_with_cui.csv` / `faers_with_cui.csv`  
- `unmatched_ms.csv` / `unmatched_faers.csv`  
- `qc_attach_summary.csv`


In [14]:

OUT_DIR.mkdir(parents=True, exist_ok=True)

ms_out_fn    = OUT_DIR / "ms_with_cui.csv"
faers_out_fn = OUT_DIR / "faers_with_cui.csv"

ms_with.to_csv(ms_out_fn, index=False)
faers_with.to_csv(faers_out_fn, index=False)

# QC + unmatched
qc_ms = summarize_qc(ms_with).assign(source="MarketScan")
qc_fa = summarize_qc(faers_with).assign(source="FAERS")
qc = pd.concat([qc_ms, qc_fa], ignore_index=True)[["source","rows","matched","unmatched","match_rate"]]

qc_fn   = OUT_DIR / "qc_attach_summary.csv"
um_ms   = OUT_DIR / "unmatched_ms.csv"
um_fa   = OUT_DIR / "unmatched_faers.csv"

qc.to_csv(qc_fn, index=False)
ms_with[ms_with.get("UMLS_CUI").isna()].to_csv(um_ms, index=False)
faers_with[faers_with.get("UMLS_CUI").isna()].to_csv(um_fa, index=False)

qc, ms_out_fn, faers_out_fn, qc_fn, um_ms, um_fa


(       source   rows  matched  unmatched  match_rate
 0  MarketScan   2584     2145        439      0.8301
 1       FAERS  21816    13870       7946      0.6358,
 PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/cross_quartz/ms_with_cui.csv'),
 PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/cross_quartz/faers_with_cui.csv'),
 PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/cross_quartz/qc_attach_summary.csv'),
 PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/cross_quartz/unmatched_ms.csv'),
 PosixPath('/Users/rahurkar.1/Library/CloudStorage/OneDrive-TheOhioStateUniversityWexnerMedicalCenter/FAERS/drug_id_platform/cross_quartz/unmatched_faers.csv'))