# Tokenization of sentences (preparation for LDA)

N.B.! This notebook requires a dataframe of "sentencized" texts.

Run "Text_Sentencizer.ipynb" and create a "sentencized.tsv" before running this notebook.

## Configuration

In [1]:
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
import itertools

from tqdm import tqdm_notebook as tqdm 

from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

# Data Loading

In [2]:
flatten_df = pd.read_csv('../pickles/wellcome_sentencized.tsv',sep='\t',quoting=csv.QUOTE_NONE)
flatten_df.drop(columns=['Unnamed: 0'],inplace=True)
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24542 entries, 0 to 24541
Data columns (total 3 columns):
manuscript_ID    24542 non-null object
review_ID        24542 non-null object
sentences        24542 non-null object
dtypes: object(3)
memory usage: 575.3+ KB


Remove standard sentences.

In [3]:
import re
std_sentence = ['I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.',
                'I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.',
                'I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.',
                'We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.',                
                'We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.',
                'We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.',
                'I believe that I have an appropriate levels of expertise to determine whether or not it meets an acceptable scientific standard.',
                'I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.',
                'Are sufficient details of methods and analysis provided to allow replication by others?',
                'Are sufficient details of the methods and analysis provided to allow replication by others?',
                'Are sufficient details of methods and materials provided to allow replication by others?',
                'Are sufficient details of the methods provided to allow replication by others?',
                'There are generally sufficient details of the methods and analysis provide to allow replication by others.',
                'Are the conclusions drawn adequately supported by the results?'
               ]

def remove_std_sent(text, std_sentence):
    for sent in std_sentence:
        patt = re.compile(sent)
        text = re.sub(patt, '', str(text))

flatten_df = flatten_df[~flatten_df.sentences.isin(std_sentence)]
flatten_df['sentences'].apply(lambda x: remove_std_sent(x,std_sentence))
flatten_df.reset_index(drop=True,inplace=True)
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23369 entries, 0 to 23368
Data columns (total 3 columns):
manuscript_ID    23369 non-null object
review_ID        23369 non-null object
sentences        23369 non-null object
dtypes: object(3)
memory usage: 547.8+ KB


Test preprocessing function.

In [4]:
from peertax.tokenizer_LDA import custom_tokenizer as ct
from random import randint
num = randint(0,len(flatten_df))
sent_test = [flatten_df.loc[num,'sentences']]
sent_after = ct(sent_test)

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))




In [5]:
print(sent_test)

['Thus, reproducibility was not established, and the experiment is underpowered.']


In [6]:
print(sent_after)

['reproducibility establish experiment underpowered']


Run tokenizer.

In [7]:
t = time()
txt = ct(flatten_df['sentences'])
print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

HBox(children=(IntProgress(value=0, max=23369), HTML(value='')))


Time to clean up everything: 1.99 mins


Put the results in a DataFrame to remove missing values.

Don't remove duplicates because they are still reviews!

In [8]:
df_clean = pd.DataFrame({'token': txt})
df_clean = df_clean.dropna()
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22858 entries, 0 to 23368
Data columns (total 1 columns):
token    22858 non-null object
dtypes: object(1)
memory usage: 357.2+ KB


Merge with initial dataset to retain indexing. Assign a provisional 'token' column (will update after creating bigram and trigram)

In [9]:
df_cleaned = pd.concat([flatten_df, df_clean], axis=1, join='inner').reset_index(drop=True)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22858 entries, 0 to 22857
Data columns (total 4 columns):
manuscript_ID    22858 non-null object
review_ID        22858 non-null object
sentences        22858 non-null object
token            22858 non-null object
dtypes: object(4)
memory usage: 714.4+ KB


In [10]:
df_cleaned.head()

Unnamed: 0,manuscript_ID,review_ID,sentences,token
0,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The Deciphering Mechanisms of Developmental Di...,deciphering mechanisms developmental disorders...
1,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The lines chosen for this study were selected ...,line choose study select homozygous lethal sub...
2,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The authors employ High Resolution Episcopic M...,author employ high resolution episcopic micros...
3,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,They exploit this rich dataset with a systemat...,exploit rich dataset systematic depth annotati...
4,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The result is a survey of impressive scope in ...,result survey impressive scope term annotation...


Create bigrams.

In [11]:
from gensim.models.phrases import Phrases, Phraser

INFO - 08:14:40: 'pattern' package not found; tag filters are not available for English


In [12]:
sent = [row.split() for row in df_cleaned['token']]

In [13]:
phrases_bi = Phrases(sent, min_count=30, progress_per=100000)

INFO - 08:14:40: collecting all words and their counts
INFO - 08:14:40: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 08:14:41: collected 180477 word types from a corpus of 254818 words (unigram + bigrams) and 22858 sentences
INFO - 08:14:41: using 180477 counts as vocab in Phrases<0 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>


In [14]:
bigram = Phraser(phrases_bi)

INFO - 08:14:41: source_vocab length 180477
INFO - 08:14:45: Phraser built with 55 phrasegrams


Create trigrams.

In [15]:
phrases_tri = Phrases(phrases_bi[sent], min_count=30, progress_per=100000)

INFO - 08:14:45: collecting all words and their counts
INFO - 08:14:45: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 08:14:48: collected 181304 word types from a corpus of 251730 words (unigram + bigrams) and 22858 sentences
INFO - 08:14:48: using 181304 counts as vocab in Phrases<0 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>


In [16]:
trigram = Phraser(phrases_tri)

INFO - 08:14:48: source_vocab length 181304
INFO - 08:14:52: Phraser built with 55 phrasegrams


Transform the corpus based on the bigrams & trigrams

In [17]:
sentences = trigram[bigram[sent]]

Run figure_conv() to convert bigrams (like fig_a etc.) found during last step.

In [18]:
def figure_conv(text):
    if text in ['figure','figure_a','figure_b','figure_c',
                'fig','fig_a','fig_b','fig_c','figure_figure_supplement']:
        return 'figure'
    else:
        return text
    
def figure_conv_array(doc):
    return [figure_conv(word) for word in doc]

In [19]:
sentences = [figure_conv_array(r) for r in sentences]

Do sanity check of the effectiveness of the cleaning and addition of bigrams & trigrams.

In [20]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

14657

In [21]:
sorted(word_freq, key=word_freq.get, reverse=True)[:20]

['author',
 'study',
 'datum',
 'result',
 'paper',
 'figure',
 'analysis',
 'include',
 'provide',
 'method',
 'need',
 'report',
 'cell',
 'present',
 'use',
 'different',
 'describe',
 'important',
 'research',
 'number']

Replace 'token' column with new, actual 'tokens'

In [22]:
df_cleaned['token'] = [r for r in sentences]
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22858 entries, 0 to 22857
Data columns (total 4 columns):
manuscript_ID    22858 non-null object
review_ID        22858 non-null object
sentences        22858 non-null object
token            22858 non-null object
dtypes: object(4)
memory usage: 714.4+ KB


Condense 'token' column as a string.

In [23]:
df_cleaned['token'] = df_cleaned['token'].str.join(',')
df_cleaned.head()

Unnamed: 0,manuscript_ID,review_ID,sentences,token
0,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The Deciphering Mechanisms of Developmental Di...,"deciphering,mechanisms,developmental,disorders..."
1,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The lines chosen for this study were selected ...,"line,choose,study,select,homozygous,lethal,sub..."
2,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The authors employ High Resolution Episcopic M...,"author,employ,high,resolution,episcopic,micros..."
3,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,They exploit this rich dataset with a systemat...,"exploit,rich,dataset,systematic,depth,annotati..."
4,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The result is a survey of impressive scope in ...,"result,survey,impressive,scope,term,annotation..."


Save dataframe with tokens

In [24]:
path_save_tsv = "../pickles/wellcome_tokenized_LDA_sentence_0.tsv"
df_cleaned.to_csv(path_save_tsv, sep='\t', quoting=csv.QUOTE_NONE)