# n2c2 Shared Task
https://portal.dbmi.hms.harvard.edu/projects/n2c2-t1/

03/07/2018

## Parse the Record Files and Extract the Tags
There are 202 record files in the folder train/. Each record file contains the clinical notes for a patient. At the end of the record file there is a list of tags indicating whther the patient meets the 13 clinical trial criteria.

* **DRUG-ABUSE**: Drug abuse, current or past
* **ALCOHOL-ABUS**: Current alcohol use over weekly recommended limits
* **ENGLISH**: Patient must speak English
* **MAKES-DECISIONS**: Patient must make their own medical decisions 
* **ABDOMINAL**: History of intra abdominal surgery, small or large intestine resection or small bowel obstructionn
* **MAJOR-DIABETES**: Major diabetes-related complication
  - For the purposes of this annotation, we define “major complication” (as opposed to "minor complication”) as any of the following that are a result of (or strongly correlated with) uncontrolled diabetes:
   * Amputation
   * Kidney damage
   * Skin conditions
   * Retinopathy
   * nephropathy
   * neuropathy
* **ADVANCED-CAD**: Advanced cardiovascular disease
  - For the purposes of this annotation, we define “advanced” as having two or more of the following:
   * Taking two or more medications to treat CAD
   * History of myocardial infarction
   * Currently experiencing angina
   * Ischemia, past or present
* **MI-6MOS**: Myocardial infarction in the past 6 months
* **KETO-1YR**: Diagnosis of ketoacidosis in the past year
* **DIETSUPP-2MOS**: Taken a dietary supplement (excluding Vitamin D) in the past 2 months
* **ASP-FOR-MI**: Use of aspirin to prevent myocardial infarction
* **HBA1C**: Any HbA1c value between 6.5 and 9.5%
* **REATININE**: Serum creatinine > upper limit of normal

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

### List the files in the train directory
All the files are located in the train directory.

In [5]:
import os
files = os.listdir("train")
len(files)

202

In [6]:
import xmltodict
with open("train/" + files[0], "r") as fd:
    doc = xmltodict.parse(fd.read())

### Test the XML Parse Functionality
The text fields and tags are needed to be extracted.

In [7]:
doc["PatientMatching"]["TAGS"]['ABDOMINAL']['@met']

'not met'

In [15]:
doc['PatientMatching']['TEXT']

'Record date: 2068-02-04\n\nASSOCIATED ARTHRITIS SPECIALISTS CENTER            Quijano, Baylee\n                                              2-03-68\n \n \n \nIdentification:  Patient is a 53-year old markedly obese female \ncomplaining of bilateral weight knee pain.  She denies any morning \nstiffness, any jelling phenomena.  She has had no trauma or effusions in \nthe knee.  She had noted on x-ray many years ago that she had had a \nchipped bone in the right knee and some mild osteoarthritis but it has \nbothered her only intermittently until this year.  Last year she did go \nto Briggs Stratton and dropped to 270 lb. at which point her knees felt \nbetter, but currently she is back up to over 300 lb. and at her height \nof 5 ft. 1 in., she is in pain.  She has not visited a physician.  She \nhas no internist who currently checks on her general medical health, but \nshe does report that the occasional Advil she takes does produce some \ndyspepsia.  She has no history of gout, no his

### Create a Pandas Dataframe from the Files
The dataframe contains all the content from all the files. Each row is identified by the file name some we can refer it back later.

In [35]:
record_files = []
record_text = []
tags2lists = {}

In [36]:
tags = ['ABDOMINAL', 'ADVANCED-CAD', 'ALCOHOL-ABUSE', 'ASP-FOR-MI', 'CREATININE', \
        'DIETSUPP-2MOS', 'DRUG-ABUSE', 'ENGLISH', 'HBA1C', 'KETO-1YR',\
        'MAJOR-DIABETES', 'MAKES-DECISIONS', 'MI-6MOS']
len(tags)

13

In [37]:
for file in files:
    with open("train/" + file, "r") as fd:
        doc = xmltodict.parse(fd.read())
        record_files.append(file)
        record_text.append(doc['PatientMatching']['TEXT'])
        for tag in tags:
            alist = []
            if tag in tags2lists.keys():
                alist = tags2lists[tag]
            else:
                tags2lists[tag] = alist
            
            alist.append(doc["PatientMatching"]["TAGS"][tag]['@met'])

In [38]:
len(tags2lists['MI-6MOS'])

202

In [39]:
len(record_text)

202

In [42]:
tags2lists['record_file'] = record_files
tags2lists['record_text'] = record_text
df = pd.DataFrame(tags2lists)

In [43]:
df.head()

Unnamed: 0,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS,record_file,record_files,record_text
0,met,met,not met,met,not met,not met,not met,met,not met,not met,not met,met,not met,162.xml,162.xml,Record date: 2068-02-04\n\nASSOCIATED ARTHRITI...
1,met,not met,met,not met,not met,met,not met,met,not met,not met,not met,met,not met,176.xml,176.xml,Record date: 2085-04-22\n\n \nThis patient wan...
2,not met,met,not met,met,met,met,not met,met,met,not met,met,met,not met,189.xml,189.xml,Record date: 2090-07-07\n\nWillow Gardens Care...
3,not met,met,not met,met,not met,met,not met,not met,met,not met,not met,met,met,214.xml,214.xml,Record date: 2096-07-15\n\n\n\nResults01/31/20...
4,met,not met,not met,met,not met,met,not met,met,not met,not met,met,met,not met,200.xml,200.xml,Record date: 2170-02-17\n\n \n\nReason for Vis...


In [45]:
df = df[['record_file', 'record_text', 'ABDOMINAL', 'ADVANCED-CAD', 'ALCOHOL-ABUSE', 'ASP-FOR-MI', 'CREATININE', \
        'DIETSUPP-2MOS', 'DRUG-ABUSE', 'ENGLISH', 'HBA1C', 'KETO-1YR',\
        'MAJOR-DIABETES', 'MAKES-DECISIONS', 'MI-6MOS']]

In [46]:
df.head()

Unnamed: 0,record_file,record_text,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS
0,162.xml,Record date: 2068-02-04\n\nASSOCIATED ARTHRITI...,met,met,not met,met,not met,not met,not met,met,not met,not met,not met,met,not met
1,176.xml,Record date: 2085-04-22\n\n \nThis patient wan...,met,not met,met,not met,not met,met,not met,met,not met,not met,not met,met,not met
2,189.xml,Record date: 2090-07-07\n\nWillow Gardens Care...,not met,met,not met,met,met,met,not met,met,met,not met,met,met,not met
3,214.xml,Record date: 2096-07-15\n\n\n\nResults01/31/20...,not met,met,not met,met,not met,met,not met,not met,met,not met,not met,met,met
4,200.xml,Record date: 2170-02-17\n\n \n\nReason for Vis...,met,not met,not met,met,not met,met,not met,met,not met,not met,met,met,not met


In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 15 columns):
record_file        202 non-null object
record_text        202 non-null object
ABDOMINAL          202 non-null object
ADVANCED-CAD       202 non-null object
ALCOHOL-ABUSE      202 non-null object
ASP-FOR-MI         202 non-null object
CREATININE         202 non-null object
DIETSUPP-2MOS      202 non-null object
DRUG-ABUSE         202 non-null object
ENGLISH            202 non-null object
HBA1C              202 non-null object
KETO-1YR           202 non-null object
MAJOR-DIABETES     202 non-null object
MAKES-DECISIONS    202 non-null object
MI-6MOS            202 non-null object
dtypes: object(15)
memory usage: 23.8+ KB


## Save the Data as CSV 

In [48]:
df.to_csv("all-train.csv")

# Vectorization

In [8]:
train_df = pd.read_csv("all-train.csv")

In [13]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,record_file,record_text,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS
0,0,162.xml,Record date: 2068-02-04\n\nASSOCIATED ARTHRITI...,met,met,not met,met,not met,not met,not met,met,not met,not met,not met,met,not met
1,1,176.xml,Record date: 2085-04-22\n\n \nThis patient wan...,met,not met,met,not met,not met,met,not met,met,not met,not met,not met,met,not met
2,2,189.xml,Record date: 2090-07-07\n\nWillow Gardens Care...,not met,met,not met,met,met,met,not met,met,met,not met,met,met,not met
3,3,214.xml,Record date: 2096-07-15\n\n\n\nResults01/31/20...,not met,met,not met,met,not met,met,not met,not met,met,not met,not met,met,met
4,4,200.xml,Record date: 2170-02-17\n\n \n\nReason for Vis...,met,not met,not met,met,not met,met,not met,met,not met,not met,met,met,not met


In [10]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 16 columns):
Unnamed: 0         202 non-null int64
record_file        202 non-null object
record_text        202 non-null object
ABDOMINAL          202 non-null object
ADVANCED-CAD       202 non-null object
ALCOHOL-ABUSE      202 non-null object
ASP-FOR-MI         202 non-null object
CREATININE         202 non-null object
DIETSUPP-2MOS      202 non-null object
DRUG-ABUSE         202 non-null object
ENGLISH            202 non-null object
HBA1C              202 non-null object
KETO-1YR           202 non-null object
MAJOR-DIABETES     202 non-null object
MAKES-DECISIONS    202 non-null object
MI-6MOS            202 non-null object
dtypes: int64(1), object(15)
memory usage: 25.3+ KB


In [11]:
train_text = train_df['record_text']

In [12]:
train_text.describe()

count                                                   202
unique                                                  202
top       Record date: 2079-01-14\n\n\nPICH 9\n89 James ...
freq                                                      1
Name: record_text, dtype: object

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
vectorizer = TfidfVectorizer(analyzer="word", ngram_range=(1,3), \
                            stop_words="english", dtype=np.float32)

In [15]:
train_vec = vectorizer.fit_transform(train_text)

In [16]:
train_vec.shape

(202, 505950)

# LogisticRegression

In [19]:
from sklearn.linear_model import LogisticRegression

In [20]:
train_df.columns

Index(['Unnamed: 0', 'record_file', 'record_text', 'ABDOMINAL', 'ADVANCED-CAD',
       'ALCOHOL-ABUSE', 'ASP-FOR-MI', 'CREATININE', 'DIETSUPP-2MOS',
       'DRUG-ABUSE', 'ENGLISH', 'HBA1C', 'KETO-1YR', 'MAJOR-DIABETES',
       'MAKES-DECISIONS', 'MI-6MOS'],
      dtype='object')

In [21]:
ABDOMINAL_Y_train = train_df['ABDOMINAL']
logreg = LogisticRegression()
logreg.fit(train_vec, ABDOMINAL_Y_train)
acc_log = round(logreg.score(train_vec, ABDOMINAL_Y_train) * 100, 2)
acc_log

85.150000000000006

In [22]:
ADVANCED_CAD_Y_train = train_df['ADVANCED-CAD']
logreg = LogisticRegression()
logreg.fit(train_vec, ADVANCED_CAD_Y_train)
acc_log = round(logreg.score(train_vec, ADVANCED_CAD_Y_train) * 100, 2)
acc_log

90.099999999999994

In [23]:
ALCOHOL_ABUSE_Y_train = train_df['ALCOHOL-ABUSE']
logreg = LogisticRegression()
logreg.fit(train_vec, ALCOHOL_ABUSE_Y_train)
acc_log = round(logreg.score(train_vec, ALCOHOL_ABUSE_Y_train) * 100, 2)
acc_log

96.530000000000001

In [24]:
ASP_Y_train = train_df['ASP-FOR-MI']
logreg = LogisticRegression()
logreg.fit(train_vec, ASP_Y_train)
acc_log = round(logreg.score(train_vec, ASP_Y_train) * 100, 2)
acc_log

80.200000000000003

In [25]:
CREATININE_Y_train = train_df['CREATININE']
logreg = LogisticRegression()
logreg.fit(train_vec, CREATININE_Y_train)
acc_log = round(logreg.score(train_vec, CREATININE_Y_train) * 100, 2)
acc_log

96.530000000000001

In [26]:
DIETSUPP_Y_train = train_df['DIETSUPP-2MOS']
logreg = LogisticRegression()
logreg.fit(train_vec, DIETSUPP_Y_train)
acc_log = round(logreg.score(train_vec, DIETSUPP_Y_train) * 100, 2)
acc_log

99.5

In [31]:
DRUG_Y_train = train_df['DRUG-ABUSE']
logreg = LogisticRegression()
logreg.fit(train_vec, DRUG_Y_train)
acc_log = round(logreg.score(train_vec, DRUG_Y_train) * 100, 2)
acc_log

94.060000000000002

In [32]:
ENGLISH_Y_train = train_df['ENGLISH']
logreg = LogisticRegression()
logreg.fit(train_vec, ENGLISH_Y_train)
acc_log = round(logreg.score(train_vec, ENGLISH_Y_train) * 100, 2)
acc_log

95.049999999999997

In [92]:
HBA1C_Y_train = train_df['HBA1C']
logreg = LogisticRegression()
logreg.fit(train_vec, HBA1C_Y_train)
acc_log = round(logreg.score(train_vec, HBA1C_Y_train) * 100, 2)
acc_log

67.819999999999993

In [99]:
KETO_Y_train = train_df['KETO-1YR']
logreg = LogisticRegression()
logreg.fit(train_vec, KETO_Y_train)
acc_log = round(logreg.score(train_vec, KETO_Y_train) * 100, 2)
acc_log

99.5

In [98]:
MAJOR_Y_train = train_df['MAJOR-DIABETES']
logreg = LogisticRegression()
logreg.fit(train_vec, MAJOR_Y_train)
acc_log = round(logreg.score(train_vec, MAJOR_Y_train) * 100, 2)
acc_log

99.5

In [85]:
MAKES_Y_train = train_df['MAKES-DECISIONS']
logreg = LogisticRegression()
logreg.fit(train_vec, MAKES_Y_train)
acc_log = round(logreg.score(train_vec, MAKES_Y_train) * 100, 2)
acc_log

96.040000000000006

In [93]:
MI_Y_train = train_df['MI-6MOS']
logreg = LogisticRegression()
logreg.fit(train_vec, MI_Y_train)
acc_log = round(logreg.score(train_vec, MI_Y_train) * 100, 2)
acc_log

91.090000000000003

In [None]:
def train_logisticRegression(X_train, y_train):
    param_grid = {'penalty': ['l1,'l2]}

# Text Preprocessing

In [1]:
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

In [2]:
def preprocessing(text):
    removelist = ['\n','\t']
    tokens = word_tokenize(text.lower())
    processed = []
    pattern = re.compile(r"\d+-\d+-\d+")
    pattern2 = re.compile(r"\d+/\d+/\d+")
    pattern3 = re.compile(r"was")
    pattern4 = re.compile(r"\d+/\d+")
    for each in tokens:
        if each not in removelist:            
            match = pattern.search(each)
            match2 = pattern2.search(each)
            match3 = pattern3.search(each)
            match4 = pattern4.search(each)
            if not match and not match2 and not match3 and not match4:
                processed.append(each)
    res = ' '.join(processed)
    res = re.sub('[^a-z0-9.%]',' ',res)
    res = re.sub(r'\s+', '', res)
    return res

In [3]:
import pandas as pd
train_df = pd.read_csv("all-train.csv")
test_df = pd.read_csv("all-test.csv")

In [325]:
df_record = train_df['record_text'].copy()

In [326]:
df_record.shape

(202,)

In [327]:
for i, each in enumerate(df_record):
    df_record[i] = preprocessing(each)

In [328]:
df_record[198]

'recorddateccearpaincoughhpi49y.o.femalecocoughwithclearmucusleftearpainandnasalcongestionforthepastweek.shedeniesanychestpainshortnessofbreathfeverchillsorsweat.nosorethroat.shedoessmokeaboutonepackdayfor25yearsanddoesnotwanttoquit.shealsonotesproblemswithbendingherrightthumbwithouthotraumapainparathesiasornumbness.symptomspresentforpast3months.allergiesnkahealthmaintcholesterol222mammogramseereportinresultspapsmearseereportinresultshba1c10.70procedureshysterectomyflowsheetsbloodpressureweight206lbhentperrlaoralwoutlesionsnecksupplenasalturbinatesmildlyenlargedwithnosinustenderness.nolymphadenopathyandtmsclear.lungsctaextinabilitytobendrightthumbwithouttendernesswarmthorswelling.ap1.allergieswithcough.claritinandrobitussinforcoughandgivensmokinghistorywillcheckchestxray.2.rightthumbstiffness.checkxrayandreferraltoorthopedics.3.healthmaint.pt.declinedphysicalandonreviewingchartnotedtohavediabetes.recommendfuforphysicalandassessmentofdiabetes.gracec.valerieyunm.d.recorddateccfollowuplab

# KETO-1YR analysis

In [59]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,record_file,record_text,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS
0,0,162.xml,Record date: 2068-02-04\n\nASSOCIATED ARTHRITI...,met,met,not met,met,not met,not met,not met,met,not met,not met,not met,met,not met
1,1,176.xml,Record date: 2085-04-22\n\n \nThis patient wan...,met,not met,met,not met,not met,met,not met,met,not met,not met,not met,met,not met
2,2,189.xml,Record date: 2090-07-07\n\nWillow Gardens Care...,not met,met,not met,met,met,met,not met,met,met,not met,met,met,not met
3,3,214.xml,Record date: 2096-07-15\n\n\n\nResults01/31/20...,not met,met,not met,met,not met,met,not met,not met,met,not met,not met,met,met
4,4,200.xml,Record date: 2170-02-17\n\n \n\nReason for Vis...,met,not met,not met,met,not met,met,not met,met,not met,not met,met,met,not met


I will try to find the only one patient whose "KETO-1YR" is "met"

In [60]:
Flag = 0
for each in train_df['KETO-1YR']:
    if each == 'met':
        print(Flag)
    Flag = Flag + 1

141


You could see the patient is in the 142 line of the csv file

In [20]:
train_df.loc[141,'record_file']

'291.xml'

This is the file "291.xml".

In [363]:
import re
renal_failure = []
creatinine = []
diabetes = []
HBA1C = []
pattern1 = re.compile(r"(?<!family history\s)renalfailure")
pattern2 = re.compile(r"(?<!family history\s)creatinine")
pattern3 = re.compile(r"(?<!family history\s)insulindependentdiabetesmellitus")
Flag = 0
Flag2 = 0
for each in df_record:
    match1 = pattern1.search(each)
    match2 = pattern2.search(each)
    match3 = pattern3.search(each) 
    if match1 and Flag not in renal_failure:
        renal_failure.append(Flag)
    if match2 and Flag not in creatinine:
        creatinine.append(Flag)
    if match3 and Flag not in diabetes:
        diabetes.append(Flag)
    Flag = Flag + 1
for each in train_df['HBA1C']:
    if each == 'met' and Flag2 not in HBA1C:
        HBA1C.append(Flag2)
    Flag2 = Flag2 + 1

In [364]:
set1 = set(renal_failure)
set2 = set(creatinine)
set3 = set(diabetes)
set4 = set(HBA1C)
len(set1 & set2 & set3 & set4)

2

In [365]:
set1 & set2 & set3 & set4

{36, 141}

In [362]:
train_df.iloc[141]

Unnamed: 0                                                       141
record_file                                                  291.xml
record_text        Record date: 2095-01-16\n\n                   ...
ABDOMINAL                                                        met
ADVANCED-CAD                                                     met
ALCOHOL-ABUSE                                                not met
ASP-FOR-MI                                                       met
CREATININE                                                       met
DIETSUPP-2MOS                                                not met
DRUG-ABUSE                                                   not met
ENGLISH                                                          met
HBA1C                                                            met
KETO-1YR                                                         met
MAJOR-DIABETES                                                   met
MAKES-DECISIONS                   

# HBA1C analysis

In [9]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,record_file,record_text,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS
0,0,162.xml,Record date: 2068-02-04\n\nASSOCIATED ARTHRITI...,met,met,not met,met,not met,not met,not met,met,not met,not met,not met,met,not met
1,1,176.xml,Record date: 2085-04-22\n\n \nThis patient wan...,met,not met,met,not met,not met,met,not met,met,not met,not met,not met,met,not met
2,2,189.xml,Record date: 2090-07-07\n\nWillow Gardens Care...,not met,met,not met,met,met,met,not met,met,met,not met,met,met,not met
3,3,214.xml,Record date: 2096-07-15\n\n\n\nResults01/31/20...,not met,met,not met,met,not met,met,not met,not met,met,not met,not met,met,met
4,4,200.xml,Record date: 2170-02-17\n\n \n\nReason for Vis...,met,not met,not met,met,not met,met,not met,met,not met,not met,met,met,not met


In [10]:
Flag = 0
HBA1C = []
for each in train_df['HBA1C']:
    if each == 'met' and Flag not in HBA1C:
        HBA1C.append(Flag)
    Flag = Flag + 1

In [11]:
len(HBA1C)

67

In [12]:
HBA1C

[2,
 3,
 10,
 14,
 21,
 22,
 26,
 28,
 30,
 33,
 34,
 35,
 36,
 37,
 39,
 40,
 41,
 43,
 50,
 54,
 55,
 60,
 67,
 75,
 76,
 77,
 81,
 82,
 86,
 90,
 91,
 97,
 98,
 102,
 104,
 107,
 108,
 115,
 116,
 118,
 122,
 124,
 125,
 128,
 135,
 139,
 141,
 143,
 147,
 150,
 152,
 153,
 154,
 156,
 158,
 160,
 168,
 171,
 173,
 175,
 177,
 180,
 192,
 197,
 198,
 200,
 201]

In [335]:
import re
HBA1C_MET = []
flag = 0
flag_group = []
for num in range(0,20):
    pattern = re.compile(r"(?<=a1c[a-z0-9\s]{"+str(num)+"})\d+\.\d{0,2}")
    pattern2 = re.compile(r"(?<=a1c[a-z0-9\s]{"+str(num)+"})\d\.?\d{0,2}(?=%)")
    for each in df_record:
        match = pattern.findall(each)
        match2 = pattern2.findall(each)
        if match:
            #print(flag)
            #print(match)
            for each in match:
                float_each = float(each)
                if float_each >= 6.5 and float_each <= 9.5:
                    flag_group.append(flag)
                    HBA1C_MET.append(match)
        if match2:
            for each in match2:
                float_each = float(each)
                if float_each >= 6.5 and float_each <= 9.5:
                    flag_group.append(flag)
                    HBA1C_MET.append(match2)            
        flag =flag + 1
    flag = 0

In [336]:
flag_group_set = set(flag_group)
len(flag_group_set)

68

In [337]:
flag_group_set

{2,
 3,
 6,
 10,
 14,
 21,
 22,
 26,
 28,
 30,
 33,
 34,
 35,
 36,
 37,
 39,
 40,
 41,
 43,
 50,
 54,
 55,
 60,
 63,
 67,
 75,
 76,
 77,
 81,
 82,
 86,
 90,
 91,
 96,
 97,
 98,
 102,
 104,
 107,
 108,
 115,
 118,
 122,
 124,
 125,
 128,
 135,
 139,
 141,
 143,
 147,
 150,
 153,
 154,
 156,
 158,
 160,
 168,
 169,
 171,
 173,
 177,
 180,
 192,
 197,
 198,
 200,
 201}

In [339]:
df_record[63]

'recorddatethesourceofthisnoteistherchemergencydept.informationsystem.allupdatesshouldoriginateinedis.redbudcommunityhospitalemergencydepartmentrecordobservationunitedisnotestatussignedpatientjacksyasseenmrn8841350dobsexmregistrationdatetime20900amedobsnoteadmissionnotetimepatientseen418chiefcomplaintdeltamshypertensionhpi82y.omwhxhtnhyperthyroidismbaselinemilddementiapwchangeinmentalstatus.perpt.sonheunabletorecognizehiswifeandwantedtogetoutofthehouseheagitatedbutnoncombativeepisodelastedforabout1hr.deniesanyfcnohanocppalpitationsnospeechdifficultynoneuromotordeficitnonumbnesstinglingofextremities.pfshxrospastmedicalhistoryseehpi.familyhistorynoncontributory.socialhistorynonsmokerquit10y.oallergynkarosconstitutionalmajorweightgain.fatigue.fever.chills.headeyesheadache.visionchanges.entnecknosignficiantfindings.chestrespiratoryshortnessofbreath.wheezing.dyspneaonexertion.cough.cardiovascularlegswelling.giabdominalvomiting.diarrhea.guflankcvapelvicnosignficiantfindings.musculoskeletalex

In [338]:
train_df.iloc[63]

Unnamed: 0                                                        63
record_file                                                  329.xml
record_text        Record date: 2149-06-08\n\n**The source of thi...
ABDOMINAL                                                    not met
ADVANCED-CAD                                                 not met
ALCOHOL-ABUSE                                                not met
ASP-FOR-MI                                                       met
CREATININE                                                   not met
DIETSUPP-2MOS                                                    met
DRUG-ABUSE                                                   not met
ENGLISH                                                      not met
HBA1C                                                        not met
KETO-1YR                                                     not met
MAJOR-DIABETES                                               not met
MAKES-DECISIONS                   

In [15]:
train_df.loc[141,'record_file']

'291.xml'

In [90]:
train_df.iloc[49]

Unnamed: 0                                                        49
record_file                                                  288.xml
record_text        Record date: 2065-10-20\n\n                   ...
ABDOMINAL                                                        met
ADVANCED-CAD                                                 not met
ALCOHOL-ABUSE                                                not met
ASP-FOR-MI                                                   not met
CREATININE                                                   not met
DIETSUPP-2MOS                                                not met
DRUG-ABUSE                                                   not met
ENGLISH                                                          met
HBA1C                                                        not met
KETO-1YR                                                     not met
MAJOR-DIABETES                                               not met
MAKES-DECISIONS                   

In [368]:
result_file = open("result.txt", 'w+')   
i = 1
u = 2
print("第年,本金息总计", file=result_file)  

In [370]:
len(train_df)

202

In [372]:
train_df.tail()

Unnamed: 0.1,Unnamed: 0,record_file,record_text,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS
197,197,219.xml,Record date: 2070-01-05\n\n \n \n \n \n \nJanu...,not met,met,not met,met,met,met,not met,met,met,not met,met,met,not met
198,198,392.xml,"Record date: 2111-09-26\n\nCC: Ear pain, coug...",met,met,not met,not met,met,met,not met,met,met,not met,met,met,not met
199,199,345.xml,Record date: 2072-11-25\n\n ...,not met,not met,not met,met,met,not met,not met,met,not met,not met,met,met,not met
200,200,184.xml,Record date: 2062-01-29\n\nTRIBAL INTERNAL MED...,not met,met,not met,met,met,met,met,met,met,not met,met,met,not met
201,201,147.xml,"Record date: 2091-08-09\n\n\n\nAugust 9, 2091\...",not met,not met,not met,met,not met,met,not met,met,met,not met,not met,met,not met


In [373]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,record_file,record_text,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS
0,0,162.xml,Record date: 2068-02-04\n\nASSOCIATED ARTHRITI...,met,met,not met,met,not met,not met,not met,met,not met,not met,not met,met,not met
1,1,176.xml,Record date: 2085-04-22\n\n \nThis patient wan...,met,not met,met,not met,not met,met,not met,met,not met,not met,not met,met,not met
2,2,189.xml,Record date: 2090-07-07\n\nWillow Gardens Care...,not met,met,not met,met,met,met,not met,met,met,not met,met,met,not met
3,3,214.xml,Record date: 2096-07-15\n\n\n\nResults01/31/20...,not met,met,not met,met,not met,met,not met,not met,met,not met,not met,met,met
4,4,200.xml,Record date: 2170-02-17\n\n \n\nReason for Vis...,met,not met,not met,met,not met,met,not met,met,not met,not met,met,met,not met
