## Document selection for manual anntoation

Manual annotation is a process of marking documents with annotations that can be used to train or test a natural language processing system. We are using brat annotation tool to acquire annotations for your final project. 
Setting up a project within brat tool requires a set of files:
- *.txt files - Text files to be annotated
- *.ann files - Annotation files that will store annotations
- annotation.conf: annotation type configuration
- visual.conf: annotation display configuration
- tools.conf: annotation tool configuration
- kb_shortcuts.conf: keyboard shortcut tool configuration

Each annotation project typically defines its own annotation.conf. Defining visual.conf, tools.conf and kb_shortcuts.conf is not necessary, and the system falls back on simple default visuals, tools and shortcuts if these files are not present. 

First we need to select files to be annotated. For our project we will draw documents from MIMIC database.

Let's connect to the database (remember, the password to the database is the same as the user name):


In [1]:
import pymysql
import pandas as pd
import getpass
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"),db='mimic2')

Enter MySQL passwd for jovyan········


We have connected to the database. Let's see the tables that we have to work with.

In [None]:
pd.read_sql("SELECT table_name, table_rows FROM information_schema.tables where table_schema='mimic2'", conn)

There are quite a few tables to explore and you should do it for your project. The documents are in "noteevents" table. Let's see what columns this table has. 

In [None]:
pd.read_sql("SELECT column_name, is_nullable, column_type FROM information_schema.columns WHERE table_name = 'noteevents'",conn)


Note that there are only two fields that are not nullable, which means that cannot be blank, so they have a potential to be the primary key. Let's take a look at what the data looks like:

In [None]:
pd.read_sql("SELECT subject_id, charttime, text  from noteevents where text is not null limit 10",conn)


 The table does not have document id, but has "subject_id" as a patient identifier. This is a limitation of a Demo dataset. NOTEID column exists in the full MIMIC database, but not in the current demo version. We will have to come up with a unique name for the documents.
 
### Select documents from the databse

For this demo, we will select 10 random documents from "noteevents" table that contain a keyword that we are interested in.

In [2]:
docs_text = pd.read_sql("SELECT subject_id, text from noteevents   where text like '% fever %' order by rand() limit  10",conn)
docs_text

Unnamed: 0,subject_id,text
0,3974,\n\n\n DATE: [**2534-6-13**] 8:26 AM\n ...
1,10130,\n\n\n DATE: [**2528-1-31**] 4:17 PM\n ...
2,23395,\n\n\n DATE: [**3208-5-15**] 3:09 PM\n ...
3,10424,\n \n \n \nAdmission Date: [**3268-7-6**] ...
4,6112,\n\n\n DATE: [**2802-3-18**] 2:01 PM\n ...
5,17167,\n\n\n DATE: [**3025-6-12**] 5:59 AM\n ...
6,21280,\n\n\n DATE: [**2503-11-26**] 4:20 PM\n ...
7,8396,\n \n \n \nAdmission Date: [**2838-5-27**] ...
8,10912,\n\n\n DATE: [**3292-10-9**] 10:12 AM\n ...
9,16963,\nNursing Progress Note\nNo siginificant event...


Iterate through the documents to see what they look like.

In [None]:
for index, row in docs_text.iterrows():
    print(index, row[0], row[1])

### Wrting documents into files

To keep your documents separate from everyone elses, enter your UNID. 

In [3]:
unid = 'u0384041'

Check the folder just to see that you have prepared your workspace.

In [None]:
%%bash  -s "$unid"
echo ~/BRAT/$1/*
ls   ~/BRAT/$1/ 

Create a folder for your project. Let's name it "Project_1"

In [4]:
%%bash  -s "$unid"
mkdir   ~/BRAT/$1/Project_1 
echo ~/BRAT/$1/*
ls   ~/BRAT/$1/ 

/home/u0384041/BRAT/u0384041/Example /home/u0384041/BRAT/u0384041/Project_1
Example
Project_1


mkdir: cannot create directory ‘/home/u0384041/BRAT/u0384041/Project_1’: File exists


The project is created, so now we can write our files. Text files from "text" field will contain the note text, and the file name will be a combination of subject_id and the row index from the data frame. This combination will create a unique name for your files. ".ann" files are blank at first.

In [5]:
path = "/home/"+str(unid)+"/BRAT/"+str(unid)+"/Project_1"
for index, row in docs_text.iterrows():
    new_file_path_txt = path+"/"+str(row[0]) + "_" + str(index) + ".txt" 
    new_file_path_ann = path+"/"+str(row[0]) + "_" + str(index) + ".ann" 
    print(new_file_path_txt)
    print(new_file_path_ann)   
    

/home/u0384041/BRAT/u0384041/Project_1/3974_0.txt
/home/u0384041/BRAT/u0384041/Project_1/3974_0.ann
/home/u0384041/BRAT/u0384041/Project_1/10130_1.txt
/home/u0384041/BRAT/u0384041/Project_1/10130_1.ann
/home/u0384041/BRAT/u0384041/Project_1/23395_2.txt
/home/u0384041/BRAT/u0384041/Project_1/23395_2.ann
/home/u0384041/BRAT/u0384041/Project_1/10424_3.txt
/home/u0384041/BRAT/u0384041/Project_1/10424_3.ann
/home/u0384041/BRAT/u0384041/Project_1/6112_4.txt
/home/u0384041/BRAT/u0384041/Project_1/6112_4.ann
/home/u0384041/BRAT/u0384041/Project_1/17167_5.txt
/home/u0384041/BRAT/u0384041/Project_1/17167_5.ann
/home/u0384041/BRAT/u0384041/Project_1/21280_6.txt
/home/u0384041/BRAT/u0384041/Project_1/21280_6.ann
/home/u0384041/BRAT/u0384041/Project_1/8396_7.txt
/home/u0384041/BRAT/u0384041/Project_1/8396_7.ann
/home/u0384041/BRAT/u0384041/Project_1/10912_8.txt
/home/u0384041/BRAT/u0384041/Project_1/10912_8.ann
/home/u0384041/BRAT/u0384041/Project_1/16963_9.txt
/home/u0384041/BRAT/u0384041/Project_

In [6]:
for index, row in docs_text.iterrows():
    new_file_path_txt = path+"/"+str(row[0]) + "_" + str(index) + ".txt" 
    new_file_path_ann = path+"/"+str(row[0]) + "_" + str(index) + ".ann" 
    f=open(new_file_path_txt, "w")
    f.write(row[1])
    f.close()
    f=open(new_file_path_ann, "w")
    f.write("")
    f.close()

Check to make sure that the files got written to the correct folder.

In [7]:
%%bash  -s "$unid"  
ls   ~/BRAT/$1/Project_1/*

/home/u0384041/BRAT/u0384041/Project_1/10130_1.ann
/home/u0384041/BRAT/u0384041/Project_1/10130_1.txt
/home/u0384041/BRAT/u0384041/Project_1/10424_3.ann
/home/u0384041/BRAT/u0384041/Project_1/10424_3.txt
/home/u0384041/BRAT/u0384041/Project_1/10912_8.ann
/home/u0384041/BRAT/u0384041/Project_1/10912_8.txt
/home/u0384041/BRAT/u0384041/Project_1/14574_1.ann
/home/u0384041/BRAT/u0384041/Project_1/14574_1.txt
/home/u0384041/BRAT/u0384041/Project_1/16963_9.ann
/home/u0384041/BRAT/u0384041/Project_1/16963_9.txt
/home/u0384041/BRAT/u0384041/Project_1/17167_5.ann
/home/u0384041/BRAT/u0384041/Project_1/17167_5.txt
/home/u0384041/BRAT/u0384041/Project_1/18671_6.ann
/home/u0384041/BRAT/u0384041/Project_1/18671_6.txt
/home/u0384041/BRAT/u0384041/Project_1/2014_2.ann
/home/u0384041/BRAT/u0384041/Project_1/2014_2.txt
/home/u0384041/BRAT/u0384041/Project_1/21280_6.ann
/home/u0384041/BRAT/u0384041/Project_1/21280_6.txt
/home/u0384041/BRAT/u0384041/Project_1/21553_0.ann
/home/u0384041/BRAT/u0384041/Proj

Now you have files to annotate.