# Tokenization of sentences (preparation for LDA)

N.B.! This notebook requires a dataframe of "sentencized" texts.

Run "Text_Sentencizer.ipynb" and create a "sentencized.pkl" before running this notebook.

## Configuration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
import itertools

from tqdm import tqdm_notebook as tqdm 

from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

# Data Loading

In [46]:
flatten_df = pd.read_pickle('../pickles/f1000_sentencized.pkl')
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130846 entries, 0 to 130845
Data columns (total 3 columns):
manuscript_ID    130846 non-null object
review_ID        130846 non-null object
sentences        130846 non-null object
dtypes: object(3)
memory usage: 3.0+ MB


Remove standard sentences.

In [47]:
std_sentence = 'I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.'
flatten_df = flatten_df[flatten_df.sentences != std_sentence]
flatten_df.reset_index(drop=True,inplace=True)
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126818 entries, 0 to 126817
Data columns (total 3 columns):
manuscript_ID    126818 non-null object
review_ID        126818 non-null object
sentences        126818 non-null object
dtypes: object(3)
memory usage: 2.9+ MB


Test preprocessing function.

In [48]:
from peertax.tokenizer_LDA import custom_tokenizer as ct
from random import randint
num = randint(0,len(flatten_df))
sent_test = [flatten_df.loc[num,'sentences']]
sent_after = ct(sent_test)

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))




In [49]:
print(sent_test)

['Please explain.']


In [50]:
print(sent_after)

[None]


Run tokenizer.

In [51]:
t = time()
txt = ct(flatten_df['sentences'])
print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

HBox(children=(IntProgress(value=0, max=126818), HTML(value='')))


Time to clean up everything: 8.89 mins


Put the results in a DataFrame to remove missing values.

Don't remove duplicates because they are still reviews!

In [52]:
df_clean = pd.DataFrame({'token': txt})
df_clean = df_clean.dropna()
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 114523 entries, 1 to 126817
Data columns (total 1 columns):
token    114523 non-null object
dtypes: object(1)
memory usage: 1.7+ MB


Merge with initial dataset to retain indexing. Assign a provisional 'token' column (will update after creating bigram and trigram)

In [53]:
df_cleaned = pd.concat([flatten_df, df_clean], axis=1, join='inner').reset_index(drop=True)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114523 entries, 0 to 114522
Data columns (total 4 columns):
manuscript_ID    114523 non-null object
review_ID        114523 non-null object
sentences        114523 non-null object
token            114523 non-null object
dtypes: object(4)
memory usage: 3.5+ MB


In [54]:
df_cleaned.head()

Unnamed: 0,manuscript_ID,review_ID,sentences,token
0,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r101,"However, I am sure there are some sections whe...",sure section reader like information adjustmen...
1,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,This paper has a number of serious flaws.,paper number flaw
2,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,a) The literature is quoted selectively and is...,literature quote selectively issue oversimplify
3,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,The paper states that blood exposure is import...,paper state blood exposure important state pre...
4,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,For example it doesn t reference articles such...,example doesn reference article unsafe injecti...


Create bigrams.

In [55]:
from gensim.models.phrases import Phrases, Phraser

In [56]:
sent = [row.split() for row in df_cleaned['token']]

In [57]:
phrases_bi = Phrases(sent, min_count=30, progress_per=100000)

INFO - 13:14:22: collecting all words and their counts
INFO - 13:14:22: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 13:14:25: PROGRESS: at sentence #100000, processed 1017374 words and 569315 word types
INFO - 13:14:26: collected 632062 word types from a corpus of 1161874 words (unigram + bigrams) and 114523 sentences
INFO - 13:14:26: using 632062 counts as vocab in Phrases<0 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>


In [58]:
bigram = Phraser(phrases_bi)

INFO - 13:14:26: source_vocab length 632062
INFO - 13:14:41: Phraser built with 387 phrasegrams


Create trigrams.

In [59]:
phrases_tri = Phrases(phrases_bi[sent], min_count=30, progress_per=100000)

INFO - 13:14:41: collecting all words and their counts
INFO - 13:14:41: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 13:14:53: PROGRESS: at sentence #100000, processed 973137 words and 580825 word types
INFO - 13:14:55: collected 645470 word types from a corpus of 1110962 words (unigram + bigrams) and 114523 sentences
INFO - 13:14:55: using 645470 counts as vocab in Phrases<0 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>


In [60]:
trigram = Phraser(phrases_tri)

INFO - 13:14:55: source_vocab length 645470
INFO - 13:15:12: Phraser built with 457 phrasegrams


Transform the corpus based on the bigrams & trigrams

In [61]:
sentences = trigram[bigram[sent]]

Run figure_conv() to convert bigrams (like fig_a etc.) found during last step.

In [62]:
def figure_conv(text):
    if text in ['figure','figure_a','figure_b','figure_c',
                'fig','fig_a','fig_b','fig_c','figure_figure_supplement']:
        return 'figure'
    else:
        return text
    
def figure_conv_array(doc):
    return [figure_conv(word) for word in doc]

In [63]:
sentences = [figure_conv_array(r) for r in sentences]

Do sanity check of the effectiveness of the cleaning and addition of bigrams & trigrams.

In [64]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

36621

In [65]:
sorted(word_freq, key=word_freq.get, reverse=True)[:20]

['author',
 'study',
 'datum',
 'result',
 'paper',
 'provide',
 'analysis',
 'figure',
 'use',
 'method',
 'article',
 'present',
 'need',
 'patient',
 'include',
 'different',
 'describe',
 'manuscript',
 'work',
 'case']

Replace 'token' column with new, actual 'tokens'

In [66]:
df_cleaned['token'] = [r for r in sentences]
df_cleaned.head()

Unnamed: 0,manuscript_ID,review_ID,sentences,token
0,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r101,"However, I am sure there are some sections whe...","[sure, section, reader, like, information, adj..."
1,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,This paper has a number of serious flaws.,"[paper, number, flaw]"
2,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,a) The literature is quoted selectively and is...,"[literature, quote, selectively, issue, oversi..."
3,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,The paper states that blood exposure is import...,"[paper, state, blood, exposure, important, sta..."
4,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,For example it doesn t reference articles such...,"[example, doesn, reference, article, unsafe, i..."


Save dataframe with tokens

In [67]:
#SAVE TO PICKLE. NEED TO PUT THIS INTO A FUNCTION THAT CHECKS FOR EXISTENCE
path_save_pickle = "../pickles/f1000_tokenized_LDA_sentence_0.pkl"
df_cleaned.to_pickle(path_save_pickle)