# Tokenization of sentences (preparation for LDA)

N.B.! This notebook requires a dataframe of "sentencized" texts.

Run "Text_Sentencizer.ipynb" and create a "sentencized.pkl" before running this notebook.

## Configuration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
import itertools

from tqdm import tqdm_notebook as tqdm 

from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

# Data Loading

In [2]:
flatten_df = pd.read_pickle('../pickles/f1000_sentencized.pkl')
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132419 entries, 0 to 132418
Data columns (total 3 columns):
manuscript_ID    132419 non-null object
review_ID        132419 non-null object
sentences        132419 non-null object
dtypes: object(3)
memory usage: 3.0+ MB


Remove standard sentences.

In [3]:
std_sentence = ['I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.',
                'I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.',
                'I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.']
flatten_df = flatten_df[~flatten_df.sentences.isin(std_sentence)]
flatten_df.reset_index(drop=True,inplace=True)
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125842 entries, 0 to 125841
Data columns (total 3 columns):
manuscript_ID    125842 non-null object
review_ID        125842 non-null object
sentences        125842 non-null object
dtypes: object(3)
memory usage: 2.9+ MB


Test preprocessing function.

In [4]:
from peertax.tokenizer_LDA import custom_tokenizer as ct
from random import randint
num = randint(0,len(flatten_df))
sent_test = [flatten_df.loc[num,'sentences']]
sent_after = ct(sent_test)

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))




In [5]:
print(sent_test)

['The author seeks to address this in part through an alternative Python implementation, but laments that external library dependencies required to do research in Python and the difficulties in tracking strong versioning available in the bytecode implementation.']


In [6]:
print(sent_after)

['author seek address alternative python implementation lament external library dependency require research python difficulty track strong versioning available bytecode implementation']


Run tokenizer.

In [7]:
t = time()
txt = ct(flatten_df['sentences'])
print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

HBox(children=(IntProgress(value=0, max=125842), HTML(value='')))


Time to clean up everything: 7.75 mins


Put the results in a DataFrame to remove missing values.

Don't remove duplicates because they are still reviews!

In [8]:
df_clean = pd.DataFrame({'token': txt})
df_clean = df_clean.dropna()
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112596 entries, 1 to 125841
Data columns (total 1 columns):
token    112596 non-null object
dtypes: object(1)
memory usage: 1.7+ MB


Merge with initial dataset to retain indexing. Assign a provisional 'token' column (will update after creating bigram and trigram)

In [9]:
df_cleaned = pd.concat([flatten_df, df_clean], axis=1, join='inner').reset_index(drop=True)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112596 entries, 0 to 112595
Data columns (total 4 columns):
manuscript_ID    112596 non-null object
review_ID        112596 non-null object
sentences        112596 non-null object
token            112596 non-null object
dtypes: object(4)
memory usage: 3.4+ MB


In [10]:
df_cleaned.head()

Unnamed: 0,manuscript_ID,review_ID,sentences,token
0,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r101,"However, I am sure there are some sections whe...",sure section reader like information adjustmen...
1,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,This paper has a number of serious flaws.,paper number flaw
2,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,a) The literature is quoted selectively and is...,literature quote selectively issue oversimplify
3,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,The paper states that blood exposure is import...,paper state blood exposure important state pre...
4,10.12688/f1000research.1-1.v1,10.5256/f1000research.50.r100,For example it doesn t reference articles such...,example doesn reference article


Create bigrams.

In [11]:
from gensim.models.phrases import Phrases, Phraser

INFO - 14:02:26: 'pattern' package not found; tag filters are not available for English


In [12]:
sent = [row.split() for row in df_cleaned['token']]

In [13]:
phrases_bi = Phrases(sent, min_count=30, progress_per=100000)

INFO - 14:02:27: collecting all words and their counts
INFO - 14:02:27: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 14:02:30: PROGRESS: at sentence #100000, processed 999083 words and 582204 word types
INFO - 14:02:30: collected 638689 word types from a corpus of 1122171 words (unigram + bigrams) and 112596 sentences
INFO - 14:02:30: using 638689 counts as vocab in Phrases<0 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>


In [14]:
bigram = Phraser(phrases_bi)

INFO - 14:02:30: source_vocab length 638689
INFO - 14:02:45: Phraser built with 389 phrasegrams


Create trigrams.

In [None]:
phrases_tri = Phrases(phrases_bi[sent], min_count=30, progress_per=100000)

INFO - 14:02:45: collecting all words and their counts
INFO - 14:02:45: PROGRESS: at sentence #0, processed 0 words and 0 word types


In [None]:
trigram = Phraser(phrases_tri)

Transform the corpus based on the bigrams & trigrams

In [None]:
sentences = trigram[bigram[sent]]

Run figure_conv() to convert bigrams (like fig_a etc.) found during last step.

In [None]:
def figure_conv(text):
    if text in ['figure','figure_a','figure_b','figure_c',
                'fig','fig_a','fig_b','fig_c','figure_figure_supplement']:
        return 'figure'
    else:
        return text
    
def figure_conv_array(doc):
    return [figure_conv(word) for word in doc]

In [None]:
sentences = [figure_conv_array(r) for r in sentences]

Do sanity check of the effectiveness of the cleaning and addition of bigrams & trigrams.

In [None]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

In [None]:
sorted(word_freq, key=word_freq.get, reverse=True)[:20]

Replace 'token' column with new, actual 'tokens'

In [None]:
df_cleaned['token'] = [r for r in sentences]

Save dataframe with tokens

In [None]:
#SAVE TO PICKLE. NEED TO PUT THIS INTO A FUNCTION THAT CHECKS FOR EXISTENCE
path_save_pickle = "../pickles/f1000_tokenized_LDA_sentence_0.pkl"
df_cleaned.to_pickle(path_save_pickle)