<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Predicting-Entities" data-toc-modified-id="Predicting-Entities-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Predicting Entities<br></a></span></li><li><span><a href="#Set-up" data-toc-modified-id="Set-up-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Set-up</a></span><ul class="toc-item"><li><span><a href="#Import-necessary-packages" data-toc-modified-id="Import-necessary-packages-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import necessary packages</a></span></li><li><span><a href="#Helper-functions" data-toc-modified-id="Helper-functions-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Helper functions</a></span></li><li><span><a href="#Import-trained-model,-word-embedding,-and-data-to-build-validation-data-of-off" data-toc-modified-id="Import-trained-model,-word-embedding,-and-data-to-build-validation-data-of-off-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Import trained model, word embedding, and data to build validation data of off</a></span><ul class="toc-item"><li><span><a href="#Trained-NER-model" data-toc-modified-id="Trained-NER-model-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Trained NER model</a></span></li><li><span><a href="#Word-embedding-model" data-toc-modified-id="Word-embedding-model-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Word embedding model</a></span></li><li><span><a href="#Metadata" data-toc-modified-id="Metadata-2.3.3"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>Metadata</a></span></li></ul></li></ul></li><li><span><a href="#Predict-metadata-for-each-class" data-toc-modified-id="Predict-metadata-for-each-class-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Predict metadata for each class</a></span></li></ul></div>

# Predicting Entities<br>
Adam Klie<br>
11/02/2019<br>
Script to predict entities in from trained model

# Set-up

## Import necessary packages

In [1]:
# Data processing
import numpy as np
import pandas as pd
from sklearn import preprocessing


# Data visualization
from tqdm import tqdm
import matplotlib
import seaborn as sns

# NLP
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from string import punctuation

# Neural nets
from keras.models import load_model

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
Using TensorFlow backend.



## Helper functions

In [2]:
# Function to embed tokens from text into word embedding space
def get_features(docs, max_length):
    docs = list(docs)
    Xs = np.zeros((len(docs), max_length), dtype='int32')
    for i, doc in enumerate(docs):
        j = 0
        for token in doc:
            vector_id = token.vocab.vectors.find(key=token.orth)
            if vector_id >= 0:
                Xs[i, j] = vector_id
            else:
                Xs[i, j] = 0
            j += 1
            if j >= max_length:
                break
    return Xs

## Import trained model, word embedding, and data to build validation data of off

### Trained NER model

In [3]:
model_iter = '11_class'
model_date = '2021_03_01'
grouping = pd.read_csv('../results/embedding/{model}/0.8_entity_merging.csv'.format(model=model_iter), index_col=0)
groups = grouping[grouping["I"] == 0]["GroupName"].values

In [24]:
le = preprocessing.LabelEncoder()
le.classes_ = np.load('../results/training/{model}/revision/classes.npy'.format(model = model_iter))
model = load_model('../models/revision/{model}_{date}_v4.h5'.format(model = model_iter, date=model_date))

### Word embedding model

In [5]:
nlp = spacy.load('../data/wikipedia-pubmed-and-PMC-w2v')

  return f(*args, **kwds)
  return f(*args, **kwds)


### Metadata

In [6]:
SRS_dir = "../data/sra/allSRS_05_15_2018.pickle"
allSRS = pd.read_pickle(SRS_dir)

# Predict metadata for each class

In [25]:
trial_num = 10

In [26]:
for validation_class in groups:
    validation_class = validation_class.replace(' ', '_')
    validation_class = validation_class.replace('/', '_')
    print(validation_class)

    # Read in validation data for a specific class to predict on
    filename = '../results/validation/{model}/{myclass}_validation_set.pickle'.format(model = model_iter, 
                                                                                      myclass = validation_class)
    validation_data = pd.read_pickle(filename)

    processed_test = validation_data.str.split('[;.,]', expand = True).stack()
    processed_test = processed_test.str.replace('\s+', ' ')

    # Predict the empty state to use as baseline probability emission
    val_docs = list(nlp.pipe(' '))
    val_X = get_features(val_docs, max_length = model.input_shape[1])
    emptyState = model.predict_proba(val_X)[0,:]

    stopWords = set(stopwords.words('english'))
    rows = []
    key_list = []
    for i, (key, sent) in enumerate(tqdm(processed_test.items(), total=len(processed_test))):

        # Sentence preprocessing
        #sent = re.sub(r'[^a-zA-Z0-9]+', ' ', sent)  # remove non alpha numeric characters
        tokens = re.split(pattern = ' ', string = sent)  # tokenize the description
        tokens = list(filter(lambda token:(token!='') and (token not in stopWords), tokens))  # filter out stopwords
        sent = ' '.join(tokens)

        n_gram_max = min([len(tokens), 7])
        for n_gram in range(2, n_gram_max + 1):

            # Get prediction for all current n-grams
            grams = list(map(lambda L:" ".join(L), list(ngrams(tokens, n_gram))))  
            val_docs = list(nlp.pipe(grams))  # get spacy objects for each token passed in
            val_X = get_features(val_docs, max_length = model.input_shape[1])
            predictM = model.predict_proba(val_X)

            # Take only those n-grams that have a total probability greater than the empty state + 0.01
            # and also have two tokens present in word-embedding
            tmp_df = pd.DataFrame(data = predictM, columns = le.classes_, index = grams)
            empty_mask = (tmp_df - emptyState).abs().sum(axis=1) < 0.01
            moreThanTwoValToken_mask = (val_X != 0).sum(axis=1) >= 2
            tmp_df[empty_mask&moreThanTwoValToken_mask] = 0

            # Set up keys for dataframe with probabilities of each n-gram, will be useful later
            for j, gram in enumerate(tmp_df.index):
                i_end = j + n_gram
                textBefore = " ".join(tokens[:j]) + ('' if j==0 else ' ')
                start_char_pos = len(textBefore)
                key_list.append(key + (i, sent, n_gram, j, i_end, gram, start_char_pos)) 
                rows.append(tmp_df.iloc[j])

    proba_df = pd.concat(rows, keys = key_list, axis = 1).T
    proba_df.index.names = ['srs', 'attribute', 'sentence_number', 'kthSrs', 
                            'orig_text', 'n-gram_length', 'word_start', 'word_end', 'token', 
                            'starting_char_pos']

    textS = pd.Series(proba_df.index.get_level_values('orig_text').unique())
    textM = textS.str.count(' ') >= 0
    selectedTexts = textS[textM].values # get the original texts

    n_threshold = 2
    proba_sub = proba_df[(proba_df.index.get_level_values('n-gram_length') >= n_threshold) &
                         (proba_df.index.get_level_values('orig_text').isin(selectedTexts))]

    max_proba = proba_sub.max(axis=1)
    second_proba = proba_sub.quantile(0.999, interpolation='lower', axis = 1)
    scoreMargin_m = (max_proba-second_proba) > 0.1  # proba difference between 1st and 2nd must be greater than 0.1
    m_val = scoreMargin_m & (~proba_sub.index.get_level_values('token').str.contains('[0-9 ]+ [0-9 ]+'))

    tmpDf = pd.DataFrame({'predicted':proba_sub[m_val].idxmax(axis=1),'score':proba_sub[m_val].max(axis=1)})

    scoreSortedDf = tmpDf[m_val].sort_values(['orig_text','word_start','score'], ascending = False).reset_index()

    v = scoreSortedDf.copy()
    scoreSortedDf = scoreSortedDf.assign(OverlapGroup=(len(processed_test)*(v.kthSrs)+ 
                                              (v.word_end - v.word_start.shift(-1)).shift().lt(0).cumsum()))

    hitDf=scoreSortedDf.sort_values(['OverlapGroup','score'],ascending=False).drop_duplicates(['OverlapGroup','predicted']
                                                                                       ).sort_values('orig_text')
    hitDf['token_len']=hitDf['token'].str.len()
    hitDf['recovered_txt']=hitDf.apply(
        lambda tmpS2:tmpS2.loc['orig_text'][tmpS2.loc['starting_char_pos']:(tmpS2.loc['starting_char_pos']+tmpS2.loc['token_len'])],axis=1)

    hitDf.to_pickle('../results/prediction/{model}/revision/trial_{trial}/{trial}_{myclass}_prediction.pickle'.format(model = model_iter, 
                                                                                       trial = trial_num, myclass = validation_class))

Species


100%|██████████| 1517/1517 [00:39<00:00, 38.68it/s]
  0%|          | 5/1633 [00:00<00:33, 48.20it/s]

Strain


100%|██████████| 1633/1633 [00:40<00:00, 40.79it/s]
  0%|          | 3/1078 [00:00<00:38, 27.84it/s]

Cell_type


100%|██████████| 1078/1078 [00:27<00:00, 38.80it/s]
  1%|          | 5/808 [00:00<00:20, 39.12it/s]

Genotype


100%|██████████| 808/808 [00:24<00:00, 33.05it/s]
  2%|▏         | 3/158 [00:00<00:05, 29.57it/s]

Condition_Disease


100%|██████████| 158/158 [00:04<00:00, 36.65it/s]
  0%|          | 5/1432 [00:00<00:35, 39.85it/s]

Tissue


100%|██████████| 1432/1432 [00:40<00:00, 35.73it/s]
  2%|▏         | 6/279 [00:00<00:05, 54.22it/s]

Sex


100%|██████████| 279/279 [00:07<00:00, 36.49it/s]
  0%|          | 2/1366 [00:00<01:33, 14.60it/s]

Age


100%|██████████| 1366/1366 [00:36<00:00, 37.21it/s]
  1%|          | 1/83 [00:00<00:09,  9.01it/s]

Data_type


100%|██████████| 83/83 [00:07<00:00, 11.26it/s]
  1%|▏         | 5/372 [00:00<00:07, 47.11it/s]

Platform


100%|██████████| 372/372 [00:09<00:00, 38.57it/s]
 19%|█▉        | 4/21 [00:00<00:00, 28.46it/s]

Protocol


100%|██████████| 21/21 [00:00<00:00, 29.71it/s]
