## Document selection for manual anntoation

Manual annotation is a process of marking documents with annotations that can be used to train or test a natural language processing system. We are using brat annotation tool to acquire annotations for your final project. 
Setting up a project within brat tool requires a set of files:
- *.txt files - Text files to be annotated
- *.ann files - Annotation files that will store annotations
- annotation.conf: annotation type configuration
- visual.conf: annotation display configuration
- tools.conf: annotation tool configuration
- kb_shortcuts.conf: keyboard shortcut tool configuration

Each annotation project typically defines its own annotation.conf. Defining visual.conf, tools.conf and kb_shortcuts.conf is not necessary, and the system falls back on simple default visuals, tools and shortcuts if these files are not present. 

First we need to select files to be annotated. For our project we will draw documents from MIMIC database.

Let's connect to the database (remember, the password to the database is the same as the user name):


In [1]:
import pymysql
import pandas as pd
import getpass
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"),db='mimic2')

Enter MySQL passwd for jovyan········


We have connected to the database. Let's see the tables that we have to work with.

In [None]:
pd.read_sql("SELECT table_name, table_rows FROM information_schema.tables where table_schema='mimic2'", conn)

There are quite a few tables to explore and you should do it for your project. The documents are in "noteevents" table. Let's see what columns this table has. 

In [None]:
pd.read_sql("SELECT column_name, is_nullable, column_type FROM information_schema.columns WHERE table_name = 'noteevents'",conn)


Note that there are only two fields that are not nullable, which means that cannot be blank, so they have a potential to be the primary key. Let's take a look at what the data looks like:

In [2]:
pd.read_sql("SELECT subject_id, charttime, text  from noteevents where text is not null limit 10",conn)


Unnamed: 0,subject_id,charttime,text
0,56,2644-01-17 00:00:00,\n \n \n \nAdmission Date: [**2644-1-17**] ...
1,56,2644-01-17 00:00:00,\n\n\n DATE: [**2644-1-17**] 10:53 AM\n ...
2,56,2644-01-17 00:00:00,\n\n\n DATE: [**2644-1-17**] 10:53 AM\n ...
3,56,2644-01-17 00:00:00,\n\n\n DATE: [**2644-1-17**] 10:43 AM\n ...
4,56,2644-01-17 00:00:00,\n\n\n DATE: [**2644-1-17**] 6:37 AM\n ...
5,56,2644-01-17 06:18:00,\nNSG Admit noteB:\nPlease refer to careview a...
6,56,2644-01-17 20:12:00,\nNursing Progress Note:\nPlease refer to Care...
7,56,2644-01-18 03:40:00,\nCondition Update\nD: See carevue for specifi...
8,56,2644-01-18 18:10:00,\nNursing Progress Note\nPlease see carvue for...
9,56,2644-01-19 00:00:00,\n\n\n DATE: [**2644-1-19**] 12:09 PM\n ...


 The table does not have document id, but has "subject_id" as a patient identifier. This is a limitation of a Demo dataset. NOTEID column exists in the full MIMIC database, but not in the current demo version. We will have to come up with a unique name for the documents.
 
### Select documents from the databse

For this demo, we will select 10 random documents from "noteevents" table that contain a keyword that we are interested in.

In [4]:
docs_text = pd.read_sql("SELECT subject_id, text from noteevents   where text like '%ankle%brachial%index%' order by rand()",conn)
docs_text.head()

Unnamed: 0,subject_id,text
0,15011,\n\n\n DATE: [**3500-2-19**] 9:45 AM\n ...
1,23097,\n\n\n DATE: [**3185-1-18**] 2:22 PM\n ...
2,21380,\n\n\n DATE: [**3470-8-11**] 1:44 PM\n ...
3,18600,\n\n\n DATE: [**2840-12-31**] 12:32 PM\n ...
4,25879,\n\n\n DATE: [**3357-2-21**] 8:26 AM\n ...


Iterate through the documents to see what they look like.

In [5]:
for index, row in docs_text.iterrows():
    print(index, row[0], row[1])

0 15011 


     DATE: [**3500-2-19**] 9:45 AM
     ART EXT (REST ONLY)                                             Clip # [**Clip Number (Radiology) 14025**]
     Reason: Patient with gangrene 2nd toe right foot, s/p angioplasty,, 
     ______________________________________________________________________________
     UNDERLYING MEDICAL CONDITION:
      70 year old man with gangrene right foot
     REASON FOR THIS EXAMINATION:
      Patient with gangrene 2nd toe right foot, s/p angioplasty,
      Please evaluate lower extremity pulses doppler wveforms and PVR's, etc.
     ______________________________________________________________________________
                                     FINAL REPORT
     REASON:  Gangrene of toe.  In addition, patient is status post popliteal
     artery procedure.
     
     FINDINGS:
     
     Doppler evaluation was performed on both lower extremities at rest.
     
     On the right, Doppler tracings are triphasic at the femoral and popliteal
     

### Wrting documents into files

To keep your documents separate from everyone elses, enter your UNID. 

In [6]:
unid = 'u1166466'

Check the folder just to see that you have prepared your workspace.

In [7]:
%%bash  -s "$unid"
echo ~/BRAT/$1/*
ls   ~/BRAT/$1/ 

/home/u1166466/BRAT/u1166466/Example /home/u1166466/BRAT/u1166466/Project_1
Example
Project_1


Create a folder for your project. Let's name it "Project_1"

In [8]:
%%bash  -s "$unid"
mkdir   ~/BRAT/$1/Project_pad 
echo ~/BRAT/$1/*
ls   ~/BRAT/$1/ 

/home/u1166466/BRAT/u1166466/Example /home/u1166466/BRAT/u1166466/Project_1 /home/u1166466/BRAT/u1166466/Project_pad
Example
Project_1
Project_pad


The project is created, so now we can write our files. Text files from "text" field will contain the note text, and the file name will be a combination of subject_id and the row index from the data frame. This combination will create a unique name for your files. ".ann" files are blank at first.

In [9]:
path = "/home/"+str(unid)+"/BRAT/"+str(unid)+"/Project_pad"
for index, row in docs_text.iterrows():
    new_file_path_txt = path+"/"+str(row[0]) + "_" + str(index) + ".txt" 
    new_file_path_ann = path+"/"+str(row[0]) + "_" + str(index) + ".ann" 
    print(new_file_path_txt)
    print(new_file_path_ann)   
    

/home/u1166466/BRAT/u1166466/Project_pad/15011_0.txt
/home/u1166466/BRAT/u1166466/Project_pad/15011_0.ann
/home/u1166466/BRAT/u1166466/Project_pad/23097_1.txt
/home/u1166466/BRAT/u1166466/Project_pad/23097_1.ann
/home/u1166466/BRAT/u1166466/Project_pad/21380_2.txt
/home/u1166466/BRAT/u1166466/Project_pad/21380_2.ann
/home/u1166466/BRAT/u1166466/Project_pad/18600_3.txt
/home/u1166466/BRAT/u1166466/Project_pad/18600_3.ann
/home/u1166466/BRAT/u1166466/Project_pad/25879_4.txt
/home/u1166466/BRAT/u1166466/Project_pad/25879_4.ann
/home/u1166466/BRAT/u1166466/Project_pad/21223_5.txt
/home/u1166466/BRAT/u1166466/Project_pad/21223_5.ann
/home/u1166466/BRAT/u1166466/Project_pad/6809_6.txt
/home/u1166466/BRAT/u1166466/Project_pad/6809_6.ann
/home/u1166466/BRAT/u1166466/Project_pad/6677_7.txt
/home/u1166466/BRAT/u1166466/Project_pad/6677_7.ann
/home/u1166466/BRAT/u1166466/Project_pad/12272_8.txt
/home/u1166466/BRAT/u1166466/Project_pad/12272_8.ann
/home/u1166466/BRAT/u1166466/Project_pad/1795_9.tx

In [10]:
for index, row in docs_text.iterrows():
    new_file_path_txt = path+"/"+str(row[0]) + "_" + str(index) + ".txt" 
    new_file_path_ann = path+"/"+str(row[0]) + "_" + str(index) + ".ann" 
    f=open(new_file_path_txt, "w")
    f.write(row[1])
    f.close()
    f=open(new_file_path_ann, "w")
    f.write("")
    f.close()

Check to make sure that the files got written to the correct folder.

In [11]:
%%bash  -s "$unid"  
ls   ~/BRAT/$1/Project_pad/*

/home/u1166466/BRAT/u1166466/Project_pad/10083_67.ann
/home/u1166466/BRAT/u1166466/Project_pad/10083_67.txt
/home/u1166466/BRAT/u1166466/Project_pad/10594_30.ann
/home/u1166466/BRAT/u1166466/Project_pad/10594_30.txt
/home/u1166466/BRAT/u1166466/Project_pad/10594_47.ann
/home/u1166466/BRAT/u1166466/Project_pad/10594_47.txt
/home/u1166466/BRAT/u1166466/Project_pad/12272_8.ann
/home/u1166466/BRAT/u1166466/Project_pad/12272_8.txt
/home/u1166466/BRAT/u1166466/Project_pad/12403_18.ann
/home/u1166466/BRAT/u1166466/Project_pad/12403_18.txt
/home/u1166466/BRAT/u1166466/Project_pad/12403_19.ann
/home/u1166466/BRAT/u1166466/Project_pad/12403_19.txt
/home/u1166466/BRAT/u1166466/Project_pad/12573_45.ann
/home/u1166466/BRAT/u1166466/Project_pad/12573_45.txt
/home/u1166466/BRAT/u1166466/Project_pad/1266_34.ann
/home/u1166466/BRAT/u1166466/Project_pad/1266_34.txt
/home/u1166466/BRAT/u1166466/Project_pad/1266_43.ann
/home/u1166466/BRAT/u1166466/Project_pad/1266_43.txt
/home/u1166466/BRAT/u1166466/Proje

Now you have files to annotate.