# Text sentencizer

## Configuration

In [1]:
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
import itertools
from tqdm import tqdm_notebook as tqdm 
from time import time  # To time our operations

## Data Extraction & Cleaning

In [2]:
df = pd.read_csv('../data/reviews/wellcome_reviews.csv')
df.drop(columns=['Unnamed: 0'],inplace=True)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1735 entries, 0 to 1734
Data columns (total 3 columns):
manuscript_ID    1735 non-null object
review_ID        1735 non-null object
review           1735 non-null object
dtypes: object(3)
memory usage: 40.7+ KB


Remove null entries from all columns except "minor comments". Remove duplicates from the 'major_comment' section.

In [4]:
# Replace string 'None' with actual None
df = df.replace(to_replace=['None','none'], value=np.nan)
d = df.isna().any()
data_df = df.dropna(subset = d[d.values == True].index.values).drop_duplicates(subset = ['review']).copy()
#Drop entries with 'major_comments' having less than two characters
data_df.drop(data_df.review[data_df.review.str.len() < 2].index,inplace=True)
data_df.reset_index(drop=True,inplace=True)

In [5]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1206 entries, 0 to 1205
Data columns (total 3 columns):
manuscript_ID    1206 non-null object
review_ID        1206 non-null object
review           1206 non-null object
dtypes: object(3)
memory usage: 28.3+ KB


In [6]:
data_df[data_df['review_ID'] == '10.21956/wellcomeopenres.13502.r26294']

Unnamed: 0,manuscript_ID,review_ID,review
466,10.12688/wellcomeopenres.12469.1,10.21956/wellcomeopenres.13502.r26294,\n This is a well-written report fr...


Test preprocessing function.

In [7]:
import re
boundary_d = re.compile(r'[0-9]')
boundary_d.match('2015')
            

<_sre.SRE_Match object; span=(0, 1), match='2'>

In [8]:
from peertax.sentencizer_LDA import custom_sentencizer as cs
from random import randint
#num = randint(0,len(data_df))
num = 466
txt_test = [data_df.loc[num,'review']]
txt_after = cs(txt_test)

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))




In [9]:
for row in txt_test:
    print(repr(row))

'\n            This is a well-written report from leading investigators in the field of HTLV-I replication. In this study, the authors screened more than 19 000 cells derived from five naturally infected T cell clones isolated from PBMCs of HTLV-1 healthy donors. They used single-molecule RNA FISH to quantify, at the single cell level, the transcripts of two main products of the HTLV-I pX region, Tax encoded by the plus-strand of the provirus and HBZ encoded by its minus-strand. They observed a strong intra- and inter-clonal heterogeneity in the expression of \n                tax and\n                 hbz genes with Tax being expressed at high levels in few cells whereas \n                hbz is expressed at lower levels in most cells. They also report that both genes are transcribed in intermittent bursts and that tax expression is enhanced in the absence of HBZ but that \n                hbz expression is enhanced by Tax. Finally, they show that HBZ expression is mostly associated w

In [10]:
for i in txt_after[0]:
    print(i)
    print('\n')
#txt_after

This is a well-written report from leading investigators in the field of HTLV-I replication.


In this study, the authors screened more than 19 000 cells derived from five naturally infected T cell clones isolated from PBMCs of HTLV-1 healthy donors.


They used single-molecule RNA FISH to quantify, at the single cell level, the transcripts of two main products of the HTLV-I pX region, Tax encoded by the plus-strand of the provirus and HBZ encoded by its minus-strand.


They observed a strong intra- and inter-clonal heterogeneity in the expression of tax and hbz genes with Tax being expressed at high levels in few cells whereas hbz is expressed at lower levels in most cells.


They also report that both genes are transcribed in intermittent bursts and that tax expression is enhanced in the absence of HBZ but that hbz expression is enhanced by Tax.


Finally, they show that HBZ expression is mostly associated with G2 M phases of the cell cycle, and that its abundance correlates with the

Run sentencizer.

In [11]:
t = time()
data_df['sentences'] = cs(data_df['review'])
print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

HBox(children=(IntProgress(value=0, max=1206), HTML(value='')))


Time to clean up everything: 0.98 mins


In [12]:
#Replace 'nan' with proper NaN
data_df.replace(to_replace=['nan'], value=np.nan, inplace=True)
#Drop NaNs
data_df.drop(data_df.sentences[data_df.sentences.isna() == True].index,inplace=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1206 entries, 0 to 1205
Data columns (total 4 columns):
manuscript_ID    1206 non-null object
review_ID        1206 non-null object
review           1206 non-null object
sentences        1206 non-null object
dtypes: object(4)
memory usage: 47.1+ KB


Flatten the dataframe.

In [13]:
flatten_df = pd.DataFrame({
        "manuscript_ID": np.repeat(data_df.manuscript_ID.values, data_df.sentences.str.len()),
        "review_ID": np.repeat(data_df.review_ID.values, data_df.sentences.str.len()), 
        "sentences": list(itertools.chain.from_iterable(data_df.sentences))})
flatten_df.head()

Unnamed: 0,manuscript_ID,review_ID,sentences
0,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The Deciphering Mechanisms of Developmental Di...
1,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The lines chosen for this study were selected ...
2,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The authors employ High Resolution Episcopic M...
3,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,They exploit this rich dataset with a systemat...
4,10.12688/wellcomeopenres.9899.1,10.21956/wellcomeopenres.10670.r18405,The result is a survey of impressive scope in ...


In [14]:
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25711 entries, 0 to 25710
Data columns (total 3 columns):
manuscript_ID    25711 non-null object
review_ID        25711 non-null object
sentences        25711 non-null object
dtypes: object(3)
memory usage: 602.7+ KB


Clean the results.

In [15]:
#Replace 'nan' with nan in sentences
flatten_df.replace(to_replace=['nan'], value=np.nan,inplace=True)
#Drop NaNs
flatten_df.drop(flatten_df.sentences[flatten_df.sentences.str.len().isna() == True].index,inplace=True)
#Drop entries with less than thirty characters (garbage)
flatten_df.drop(flatten_df.sentences[flatten_df.sentences.str.len() < 30].index,inplace=True)
#Reset index
flatten_df.reset_index(drop=True,inplace=True)
flatten_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24542 entries, 0 to 24541
Data columns (total 3 columns):
manuscript_ID    24542 non-null object
review_ID        24542 non-null object
sentences        24542 non-null object
dtypes: object(3)
memory usage: 575.3+ KB


Save dataframe with sentences.

In [16]:
path_save_tsv = "../pickles/wellcome_sentencized.tsv"
flatten_df.to_csv(path_save_tsv, sep = '\t', quoting=csv.QUOTE_NONE)