![title](in.jpg)
# Innoplexus Online Hiring Hackathon: Saving lives with AI

## Problem Statement

Clinical studies often require detailed patients’ information documented in clinical narratives. **Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task to extract entities of interest (e.g., disease names, medication names and lab tests) from clinical narratives, thus to support clinical and translational research.** Clinical notes have been analyzed in greater detail to harness important information for clinical research and other healthcare operations, as they depict rich, detailed medical information.


In this challenge, hackers are invited to extract all disease names from a given set of 20000 paragraphs/documents in the test set provided the labelled entities (diseases) for 30000 documents in the train set.

For example, here is a sentence from a clinical report:

*We compared the inter-day reproducibility of post-occlusive **reactive hyperemia** (PORH) assessed by single-point laser Doppler flowmetry (LDF) and laser speckle contrast analysis (LSCI).*


In the sentence given, **reactive hyperemia (in bold)** is the named entity with the type disease/indication.

 

## Data Description
The train file has the following structure:
 
|Variable | Definition|
|---|---|
|id|	Unique ID for a token/word|
|Doc_ID	|Unique ID for a Document/Paragraph|
|Sent_ID|	Unique ID for a Sentence|
|Word	|Exact word/token|
|tag	(Target)| Named Entity Tag  |

The target 'tag' follows the **Inside-outside-beginning (IOB)** tagging format. The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in named-entity recognition.

**The B-indications (beginning) tag indicates that the token is the beginning of a disease entity (disease name in this case)
An I-indications (inside) tag indicates that the token is inside an entity
An O (outside) tag indicates that a token is outside a disease entity**
 
**Example**
For more clarity, let's look at the same sample in the given tabular format, each row here corresponds to a word/token:

The disease **'reactive hyperemia'** is labelled using **'B-indications'** for the word **'reactive'** and **'I-indications'** for the word **'hypermia'**. All the other words that are outside **'reactive hyperemia'** are labelled with **'O'.**


## Evaluation Metric

The evaluation for this contest is based on modified F1-Score as explained below:
Suppose the ground truth has the following entities (mentioned in square brackets) for the given sentence

**[Malaria] and [Yellow Fever] remain more deadly than [Hepatitis B] today**

This has 3 entities.
Supposing the actual prediction has the following

**[Malaria] [and] [Yellow] Fever remain more deadly than Hepatitis B [today]**

We have an exact match for Malaria, false positives for and and today, a false negative for Hepatitis B and a substring match for Yellow. We compute precision and recall by first defining matching criteria. We are also trying to reward partial match here and not just exact entity match.

Here, True positives are of 2 types - Exact match and partial match and we are giving a weight of 1 to Exact Match and 0.5 to partial match. The computations are as follows:

Exact Match = 1 (Malaria) and Partial Match = 1 ( Yellow which overlaps Yellow Fever), False Positives =2 (and, and today), False Negatives = 1 (Hepatitis B)

**Precision** = (Exact Match + 0.5 * Partial Match) / (Exact Match + Partial Match + False Positives) = (1 + 0.5)/(1+1+2) = 0.375

**Recall** = (Exact Match + 0.5 * Partial Match) / (Exact Match + Partial Match + False Negatives) = (1 + 0.5)/(1+1+1) = 0.50

**F1 Score** = (2 * Precision * Recall)/(Precision + Recall) = 0.428


The counts of exact match, partial match, false positives and false negatives is summed across all sentences in the test set and overall F1 Score is the leaderboard score.

Please find the script for the evaluation metric implemented in Python at this [link](https://gist.github.com/frenzy2106/3a12b7fefeb33941edea45d881d6f81a) 

In [1]:
import numpy as np
import pandas as pd

In [2]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
s=pd.read_csv('sample_submission.csv')
train.head()

Unnamed: 0,id,Doc_ID,Sent_ID,Word,tag
0,1,1,1,Obesity,O
1,2,1,1,in,O
2,3,1,1,Low-,O
3,4,1,1,and,O
4,5,1,1,Middle-Income,O


In [3]:
test.head()

Unnamed: 0,id,Doc_ID,Sent_ID,Word
0,4543834,30001,191283,CCCVA
1,4543835,30001,191283,","
2,4543836,30001,191283,MANOVA
3,4543837,30001,191283,","
4,4543838,30001,191283,my


In [4]:
s.head()

Unnamed: 0,id,Sent_ID,tag
0,4543834,191283,O
1,4543835,191283,O
2,4543836,191283,O
3,4543837,191283,O
4,4543838,191283,O


In [None]:
train['tag'].value_counts()

In [5]:
train.dropna(inplace=True)
print(train.shape)

(4543703, 5)


In [6]:
# df=train[train['Doc_ID']<=20000]
df=train.copy()

In [9]:
df.tag.value_counts()

O                4446076
B-indications      53003
I-indications      44624
Name: tag, dtype: int64

In [11]:
test['Word'].fillna('NA',inplace=True)

In [12]:
from sklearn.metrics import classification_report

In [13]:
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from collections import Counter

### Cleaning

In [14]:
import re
def clean(v):
    v=str(v)
    r=''
    
    if len(v)!=1 and (v[-1]=='.' or v.find(':')!=-1 or v.find("'")!=-1 or v.find(",")!=-1):
        r=re.sub(r'[^\w\s]','',v)
    else:
        r=v
    if r=='':
        return ','
    return r
    
df['Word']=df['Word'].apply(clean)


In [15]:
test['Word']=test['Word'].apply(clean)

In [16]:
df.shape

(4543703, 5)

**Grouping by Sent_ID and checking if the tokens formed are in unison with training rows**

**Extracting pos tagging from the tokens using nltk.pos_tag**

Train

In [17]:
from nltk import word_tokenize, pos_tag, ne_chunk
agg_func = lambda s: " ".join([w for w in s['Word'].values.tolist()])
sentids=df.groupby(['Sent_ID']).apply(agg_func)
post=[]
for lll in sentids:
    words = word_tokenize(lll)
    if len(lll.split(" "))!=len(words):
        print(lll)
        print(words)
        raise Exception('heheheh')
    post.extend([q[1] for q in pos_tag(words)])
    


In [18]:
len(post)

4543703

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk

Test

In [None]:

test=pd.read_csv('test.csv')
test['Word'].fillna('NA',inplace=True)
test['Word']=test['Word'].apply(clean)

In [None]:

from nltk import word_tokenize, pos_tag, ne_chunk
pos=[]
tsentids=test.groupby(['Sent_ID']).apply(agg_func)
for ll in tsentids:
    
    pos.extend([q[1] for q in pos_tag(word_tokenize(ll))])
print(len(pos))

In [None]:
print(len(pos),test.shape)
import gc
gc.collect()

In [None]:
print(len(pos),pos)

In [None]:
df['pos']=post
test['pos']=pos

In [None]:
# post
df.head()

** Generating conlltags with help of nltk.chunk and making another feature with  IOB tagging as 'O' / 'I' / 'B'**

Train

In [None]:

from nltk.chunk import conlltags2tree, tree2conlltags
from nltk import word_tokenize, pos_tag, ne_chunk


from nltk import word_tokenize, pos_tag, ne_chunk
pos_iob=[]
for lx in sentids:
    words = word_tokenize(lx)
    if len(lx.split(" "))!=len(words):
#         print(i)
        print(lll)
        print(words)
        raise Exception('heheheh')
    pos_iob.extend([q[2][0] for q in tree2conlltags(ne_chunk(pos_tag(words)))])

In [None]:
len(pos_iob)

Test

In [None]:

from nltk.chunk import conlltags2tree, tree2conlltags
from nltk import word_tokenize, pos_tag, ne_chunk


from nltk import word_tokenize, pos_tag, ne_chunk
# agg_func = lambda s: " ".join([w for w in s['Word'].values.tolist()])
# sentids=df.groupby(['Sent_ID']).apply(agg_func)
post_iob=[]
for lxx in tsentids:
    words = word_tokenize(lxx)
    if len(lxx.split(" "))!=len(words):
#         print(i)
        print(lxx)
        print(words)
        raise Exception('heheheh')
    post_iob.extend([q[2][0] for q in tree2conlltags(ne_chunk(pos_tag(words)))])

### Saving data

In [None]:
df['tree_iob']=pos_iob
test['tree_iob']=post_iob
df['tree_iob'].to_csv('train_treeiob.csv',index=False)
test['tree_iob'].to_csv('test_treeiob.csv',index=False)

In [None]:
test['tree_iob'].value_counts()

In [None]:
# X
df['pos'].to_csv('train_pos.csv',index=False)
test['pos'].to_csv('test_pos.csv',index=False)
# np.save('sent_X.npy',X)