# Cohere Health Machine Learning Takehome Assignment
The following notebook was generated by members of the Cohere Health Machine Learning team and should be used for the purpose of progressing the interview takehome assignment. The code contained within this notebook is merely an example of how candidates can get started and is optional to use. Feel free to reach out to your recruiting team if you have any questions.

## Load in Data
You must provide YOUR OWN PATH to the location of the sampleclinicalnotes.zip file in the `PATH_TO_ZIP` object.

In [1]:
from load_data import load_ann, load_txt
import pandas as pd
import text_utils
import df_utils

In [2]:
PATH_TO_ZIP = "data/"
DATA_PATH = f"{PATH_TO_ZIP}training_20180910/"
print(f"Full data path: {DATA_PATH}")

Full data path: data/training_20180910/


In [3]:
# read in txt files
txt_df = load_txt(DATA_PATH)

Time taken to read .txt files: 0.13614726066589355


In [4]:
# read in REASONS entities from .ann files
ent_df, rel_df = load_ann(DATA_PATH)

Time taken to read .ann files and extract all metadata: 0.25187253952026367


In [5]:
all_dfs = [txt_df, ent_df, rel_df]
txt_df, ent_df, rel_df = [df_utils.fix_slashes(df) for df in all_dfs]

In [6]:
rel_df = df_utils.clean_rel_df(rel_df)
ent_df = df_utils.clean_ent_df(ent_df)

In [7]:
txt_df.head()

Unnamed: 0,file_idx,text
0,training_20180910/100035,Admission Date: [**2115-2-22**] ...
1,training_20180910/100039,Admission Date: [**2174-4-18**] ...
2,training_20180910/100187,Admission Date: [**2107-1-17**] ...
3,training_20180910/100229,Admission Date: [**2114-12-24**] ...
4,training_20180910/100564,Admission Date: [**2144-1-20**] ...


In [8]:
ent_df.head()

Unnamed: 0,file_idx,entity_id,category,start_idx,end_idx,text
0,training_20180910/100035,T1,Reason,10179,10197,recurrent seizures
1,training_20180910/100035,T3,Drug,10227,10233,ativan
2,training_20180910/100035,T5,Route,10240,10242,IM
3,training_20180910/100035,T6,Drug,10455,10465,Topiramate
4,training_20180910/100035,T7,Strength,10466,10470,25mg


In [9]:
rel_df.head()

Unnamed: 0,file_idx,relationship_id,category,entity1,entity2
0,training_20180910/100035,R1,Reason-Drug,T1,T3
1,training_20180910/100035,R4,Route-Drug,T5,T3
2,training_20180910/100035,R5,Strength-Drug,T7,T6
3,training_20180910/100035,R6,Route-Drug,T8,T6
4,training_20180910/100035,R7,Frequency-Drug,T9,T6


In [10]:
rel_df.category.value_counts()

category
Strength-Drug     6702
Form-Drug         6654
Frequency-Drug    6310
Route-Drug        5538
Reason-Drug       5169
Dosage-Drug       4225
ADE-Drug          1107
Duration-Drug      643
Name: count, dtype: int64

In [11]:
ent_df['length_diff'] = (ent_df.end_idx - ent_df.start_idx) - ent_df.text.str.len()

In [12]:
txt_df = df_utils.split_txt_df(txt_df)

In [13]:
full_df = df_utils.get_underlying_factors(txt_df, ent_df)

In [14]:
full_df[['primary_diagnosis', 'underlying_factors']]

Unnamed: 0,primary_diagnosis,underlying_factors
0,Anoxic Brain Injury s/p PEA arrest x,"asthma, seizures, seizure"
1,Abdominal Pain,nausea
2,Pulmonary Embolism with history of DVT and IVC...,"hematoma, anticoagulation, pneumonia, multifoc..."
3,Sepsis,"abdominal pain, infection, hyperkalemia, adren..."
4,Deep Vein Thrombosis of subclavian vein,"pain, thrombolysis"
...,...,...
298,Acute Renal Failure,
299,HCV/HCC,
300,congestive heart failure,"copd, ethylene glycol intoxication, hyperkalem..."
301,Lymphedema with superimposed cellulitis and un...,


In [15]:
for p_d in full_df.primary_diagnosis:
    print(p_d)

Anoxic Brain Injury s/p PEA arrest x
Abdominal Pain
Pulmonary Embolism with history of DVT and IVC filter
Sepsis
Deep Vein Thrombosis of subclavian vein
Healthcare associated pneumonia
Altered mental status secondary to excessive narcotics
Malignant pleural effusion
Diabetic Ketoacidosis
Chronic obstructive pulmonary disease
Disseminated intravascular coagulation
Pneumonia
N/A
Type A Aortic Dissection
Metastatic rectal cancer
Hypothyroidism
Vocal cord dysfunction
Secondary: Hypertension
Myocardial infarction and profound vagal reaction
Jejunal Ulcer
Diarrhea
lymphoblastic lymphoma / ALL
LGIB s/p colectomy
Alcoholic Cirrhosis
ACUTE ISSUES:
Supratherapeutic INR
Lower GI bleed
traumatic left frontal intaparychymal hemorrhage (contusion
alcohol intoxication compounded by librium intoxication
Free air on CXR s/p G-tube placement
Diffuse Large B Cell Lymphoma
Acute-on-chronic anemia with guaiac + stools and no further GI
# Fall
GI bleed secondary to colonic lesion
Coronary artery disease s/p

In [51]:
import spacy
import scispacy

from scispacy.linking import EntityLinker

nlp = spacy.load("en_ner_bc5cdr_md")

In [59]:
doc = nlp(full_df.diagnosis[1])

In [63]:
doc.ents[1].text

'chronic renal failure -Systolic Heart failure'

In [54]:
full_df.diagnosis[1]

'Primary: -Abdominal Pain -Acute on chronic renal failure -Systolic Heart failure'

In [55]:
doc = nlp(full_df.diagnosis[2])

In [56]:
full_df.diagnosis[2]

'Primary: 1) Pulmonary Embolism with history of DVT and IVC filter placement in [**2106-7-8**] 2) Community Acquired Pneumonia 3) History of GI Bleed (extensive) in [**2106-7-8**] when anticoagulated 4) Abdominal wall hematoma, with acute blood loss anemia requiring 10 units PRBCs when anticoagulated for current pulmonary embolism 5) Noscomial Pneumonia with GNR in sputum, 6) Coagulopathy 7) Noscomial UTI with E. coli - quinolone resistant 8) Vagnitis, attributed to broad spectrum antibiotic usage 9) otitis externa 10) tachycardia 11) diarrhea 12) incidentally noted left renal cyst/mass NOS 13) Coagulase negative staphylococcal bacteremia 14) Rectus sheath hematoma in setting of anticoagulation . Secondary: 1) chronic orthostatic hypotension 2) recurrent otitis externa 3) ulcerative colitis in remission 4) chronic obstructive pulmonary disease 5) depression 6) h/o schizoaffective disorder'

In [57]:
doc.ents

(Pulmonary Embolism,
 DVT,
 Pneumonia,
 GI Bleed,
 hematoma,
 acute blood loss anemia,
 pulmonary embolism,
 Coagulopathy,
 quinolone,
 Vagnitis,
 otitis,
 tachycardia,
 diarrhea,
 bacteremia,
 hematoma,
 chronic orthostatic hypotension,
 otitis,
 ulcerative colitis,
 chronic obstructive pulmonary disease,
 depression,
 schizoaffective disorder)

In [58]:
for ent in doc.ents:
    print(f'Ent: {ent.text}\tLabel: {ent.label_}')

Ent: Pulmonary Embolism	Label: DISEASE
Ent: DVT	Label: DISEASE
Ent: Pneumonia	Label: DISEASE
Ent: GI Bleed	Label: DISEASE
Ent: hematoma	Label: DISEASE
Ent: acute blood loss anemia	Label: DISEASE
Ent: pulmonary embolism	Label: DISEASE
Ent: Coagulopathy	Label: DISEASE
Ent: quinolone	Label: CHEMICAL
Ent: Vagnitis	Label: CHEMICAL
Ent: otitis	Label: DISEASE
Ent: tachycardia	Label: DISEASE
Ent: diarrhea	Label: DISEASE
Ent: bacteremia	Label: DISEASE
Ent: hematoma	Label: DISEASE
Ent: chronic orthostatic hypotension	Label: DISEASE
Ent: otitis	Label: DISEASE
Ent: ulcerative colitis	Label: DISEASE
Ent: chronic obstructive pulmonary disease	Label: DISEASE
Ent: depression	Label: DISEASE
Ent: schizoaffective disorder	Label: DISEASE
