## Data description

### Important Terms

- **Clinical Case**: The scenario (e.g., symptoms, complaints, concerns) the Standardized Patient presents to the test taker (medical student, resident or physician). Ten clinical cases are represented in this dataset.
- **Patient Note**: Text detailing important information related by the patient during the encounter (physical exam and interview).
- **Feature**: A clinically relevant concept. A rubric describes the key concepts relevant to each case.

### Training Data

<ul>
    <li>
        <b>patient_notes.csv</b> - A collection of about 40,000 Patient Note history portions. Only a subset of these have features annotated. You may wish to apply unsupervised learning techniques on the notes without annotations. The patient notes in the test set are not included in the public version of this file.
        <ul><li>pn_num - A unique identifier for each patient note.</li></ul>
        <ul><li>case_num - A unique identifier for the clinical case a patient note represents.</li></ul>
        <ul><li>pn_history - The text of the encounter as recorded by the test taker.</li></ul>
    </li>
    <li>
        <b>features.csv</b> - The rubric of features (or key concepts) for each clinical case.
            <ul><li>feature_num - A unique identifier for each feature.</li></ul>
            <ul><li>case_num - A unique identifier for each case.</li></ul>
            <ul><li>feature_text - A description of the feature.</li></ul>
    </li>
    <li>
        <b>train.csv</b> - Feature annotations for 1000 of the patient notes, 100 for each of ten cases.
                <ul><li>id - Unique identifier for each patient note / feature pair.</li></ul>
                <ul><li>pn_num - The patient note annotated in this row.</li></ul>
                <ul><li>feature_num - The feature annotated in this row.</li></ul>
                <ul><li>case_num - The case to which this patient note belongs.</li></ul>
                <ul><li>annotation - The text(s) within a patient note indicating a feature. A feature may be indicated multiple times within a single note.</li></ul>
                <ul><li>location - Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon ;.</li></ul>
   </li>
</ul>

In [1]:
import sys
sys.path.append("..")

In [2]:
import os

import pandas as pd

## Constants

In [3]:
data_folder = os.path.join("..", "data")
raw_folder = os.path.join(data_folder, "raw")

patient_notes_file_path = os.path.join(raw_folder, "patient_notes.csv")
features_file_path = os.path.join(raw_folder, "features.csv")
train_file_path = os.path.join(raw_folder, "train.csv")
test_file_path = os.path.join(raw_folder, "test.csv")

## Extract

In [4]:
patient_notes_df = pd.read_csv(patient_notes_file_path)

patient_notes_df

Unnamed: 0,pn_num,case_num,pn_history
0,0,0,"17-year-old male, has come to the student heal..."
1,1,0,17 yo male with recurrent palpitations for the...
2,2,0,Dillon Cleveland is a 17 y.o. male patient wit...
3,3,0,a 17 yo m c/o palpitation started 3 mos ago; \...
4,4,0,17yo male with no pmh here for evaluation of p...
...,...,...,...
42141,95330,9,Ms. Madden is a 20 yo female presenting w/ the...
42142,95331,9,A 20 YO F CAME COMPLAIN A DULL 8/10 HEADACHE T...
42143,95332,9,Ms. Madden is a 20yo female who presents with ...
42144,95333,9,Stephanie madden is a 20 year old woman compla...


In [5]:
features_df = pd.read_csv(features_file_path)

features_df

Unnamed: 0,feature_num,case_num,feature_text
0,0,0,Family-history-of-MI-OR-Family-history-of-myoc...
1,1,0,Family-history-of-thyroid-disorder
2,2,0,Chest-pressure
3,3,0,Intermittent-symptoms
4,4,0,Lightheaded
...,...,...,...
138,912,9,Family-history-of-migraines
139,913,9,Female
140,914,9,Photophobia
141,915,9,No-known-illness-contacts


In [6]:
train_df = pd.read_csv(train_file_path)

train_df

Unnamed: 0,id,case_num,pn_num,feature_num,annotation,location
0,00016_000,0,16,0,['dad with recent heart attcak'],['696 724']
1,00016_001,0,16,1,"['mom with ""thyroid disease']",['668 693']
2,00016_002,0,16,2,['chest pressure'],['203 217']
3,00016_003,0,16,3,"['intermittent episodes', 'episode']","['70 91', '176 183']"
4,00016_004,0,16,4,['felt as if he were going to pass out'],['222 258']
...,...,...,...,...,...,...
14295,95333_912,9,95333,912,[],[]
14296,95333_913,9,95333,913,[],[]
14297,95333_914,9,95333,914,['photobia'],['274 282']
14298,95333_915,9,95333,915,['no sick contacts'],['421 437']


In [7]:
test_df = pd.read_csv(test_file_path)

test_df

Unnamed: 0,id,case_num,pn_num,feature_num
0,00016_000,0,16,0
1,00016_001,0,16,1
2,00016_002,0,16,2
3,00016_003,0,16,3
4,00016_004,0,16,4


## Analysis

### Create full train dataset with:
1. train
2. features
3. patient notes

In [8]:
train_df = pd.merge(
    train_df, features_df, how='left', on=["feature_num", "case_num"])

train_df = pd.merge(
    train_df, patient_notes_df, how='left', on=['pn_num', 'case_num'])

train_df.head()

Unnamed: 0,id,case_num,pn_num,feature_num,annotation,location,feature_text,pn_history
0,00016_000,0,16,0,['dad with recent heart attcak'],['696 724'],Family-history-of-MI-OR-Family-history-of-myoc...,HPI: 17yo M presents with palpitations. Patien...
1,00016_001,0,16,1,"['mom with ""thyroid disease']",['668 693'],Family-history-of-thyroid-disorder,HPI: 17yo M presents with palpitations. Patien...
2,00016_002,0,16,2,['chest pressure'],['203 217'],Chest-pressure,HPI: 17yo M presents with palpitations. Patien...
3,00016_003,0,16,3,"['intermittent episodes', 'episode']","['70 91', '176 183']",Intermittent-symptoms,HPI: 17yo M presents with palpitations. Patien...
4,00016_004,0,16,4,['felt as if he were going to pass out'],['222 258'],Lightheaded,HPI: 17yo M presents with palpitations. Patien...


### Create full test dataset with:
1. test
2. features
3. patient notes

In [9]:
test_df = pd.merge(
    test_df, features_df, how='left', on=["feature_num", "case_num"])

test_df = pd.merge(
    test_df, patient_notes_df, how='left', on=['pn_num', 'case_num'])

test_df.head()

Unnamed: 0,id,case_num,pn_num,feature_num,feature_text,pn_history
0,00016_000,0,16,0,Family-history-of-MI-OR-Family-history-of-myoc...,HPI: 17yo M presents with palpitations. Patien...
1,00016_001,0,16,1,Family-history-of-thyroid-disorder,HPI: 17yo M presents with palpitations. Patien...
2,00016_002,0,16,2,Chest-pressure,HPI: 17yo M presents with palpitations. Patien...
3,00016_003,0,16,3,Intermittent-symptoms,HPI: 17yo M presents with palpitations. Patien...
4,00016_004,0,16,4,Lightheaded,HPI: 17yo M presents with palpitations. Patien...


### So, we need to predict annotation based on Scripts (pn_history) and Symptoms (feature_text) 

In [15]:
train_df[train_df.pn_num == 16]

Unnamed: 0,id,case_num,pn_num,feature_num,annotation,location,feature_text,pn_history
0,00016_000,0,16,0,['dad with recent heart attcak'],['696 724'],Family-history-of-MI-OR-Family-history-of-myoc...,HPI: 17yo M presents with palpitations. Patien...
1,00016_001,0,16,1,"['mom with ""thyroid disease']",['668 693'],Family-history-of-thyroid-disorder,HPI: 17yo M presents with palpitations. Patien...
2,00016_002,0,16,2,['chest pressure'],['203 217'],Chest-pressure,HPI: 17yo M presents with palpitations. Patien...
3,00016_003,0,16,3,"['intermittent episodes', 'episode']","['70 91', '176 183']",Intermittent-symptoms,HPI: 17yo M presents with palpitations. Patien...
4,00016_004,0,16,4,['felt as if he were going to pass out'],['222 258'],Lightheaded,HPI: 17yo M presents with palpitations. Patien...
5,00016_005,0,16,5,[],[],No-hair-changes-OR-no-nail-changes-OR-no-tempe...,HPI: 17yo M presents with palpitations. Patien...
6,00016_006,0,16,6,"['adderall', 'adderrall', 'adderrall']","['321 329', '404 413', '652 661']",Adderall-use,HPI: 17yo M presents with palpitations. Patien...
7,00016_007,0,16,7,[],[],Shortness-of-breath,HPI: 17yo M presents with palpitations. Patien...
8,00016_008,0,16,8,[],[],Caffeine-use,HPI: 17yo M presents with palpitations. Patien...
9,00016_009,0,16,9,"['palpitations', 'heart beating/pounding']","['26 38', '96 118']",heart-pounding-OR-heart-racing,HPI: 17yo M presents with palpitations. Patien...
