<a href="https://nbviewer.jupyter.org/github/alisonmitchell/Biomedical-Knowledge-Graph/blob/main/02_Exploratory_Data_Analysis/Preprocessing.ipynb"
   target="_parent">
   <img src="https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg"
      width="109" height="20" alt="render in nbviewer">
</a>

# Preprocessing


## 1. Introduction

The mainstream NLP approaches can be classified into three categories: rule-based, statistical, and neural NLP, and a traditional NLP pipeline would include preprocessing, feature extraction, and modelling.

NLP techniques have evolved over time beginning with rule-based approaches requiring several preprocessing steps such as pattern matching and parsing. This was superseded by statistical machine learning methods and featured-based approaches with supervised algorithms requiring general linguistic, orthographic or dictionary look-up features to describe the input data, which would then be ingested into a different model.

This feature engineering focus then shifted towards word embeddings, or encoding raw text data into meaningful numeric feature vector representations that machine learning algorithms could understand and process. Words that are closer in the vector space are expected to be semantically similar and various methods emerged to generate the mapping of words to vectors.

Frequency-based, or statistical-based, approaches such as Bag-of-Words, TF-IDF and n-grams were limited, giving way to pretrained representations using neural networks to learn word embeddings and obtain word vectors by context, co-occurrence of words, semantic and syntactic similarity. Models include prediction-based Word2Vec and fastText which use shallow neural networks, and machine learning-based GloVe, all representative algorithms of non-contextual embeddings where a word is static and does not dynamically change as its context changes.

Dimensionality reduction techniques for word embeddings include Principal Component Analysis (PCA), Singular Value Decomposition (SVD)-based Latent Semantic Analysis (LSA), T-distributed Stochastic Neighbour Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). Underlying patterns in the data can then be visualised by training a clustering model such as K-means.

The Transformer architecture, based on the self-attention mechanism, revolutionised the field of NLP and laid the foundation for Large Language Models (LLMs) starting with encoder-only BERT (2018) used for discriminative tasks. Deep neural language models could now produce dynamic, bidirectional, context-aware representations of text by considering all words in a sequence simultaneously. The self-attention mechanism allowed the model to 'pay more attention' to the most relevant parts of the input, resulting in different vectors for the same word depending on its context.

GPT (2018) was the first autoregressive, decoder-only model based on the Transformer architecture, with a unidirectional approach focusing on left-to-right text processing. Trained in causal language modelling it learned to predict the next word in a sentence and excelled at generative tasks.  

Modifications to the encoder and decoder components of the Transformer architecture have given rise to a diverse range of LLMs optimised for contextual understanding, text-to-text tasks, and text generation. One of the most popular use cases for embeddings is in Retrieval Augmented Generation (RAG) applications which reduce hallucination in LLM-generated responses by grounding the model in external domain-specific data sources.

We will build preprocessing, feature extraction and modelling pipelines as part of the Exploratory Data Analysis process before progressing to the Information Extraction stage.



## 2. Install/import libraries

In [None]:
!pip install spacy scispacy

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz

In [None]:
import pandas as pd
import re
import pickle
import spacy
import scispacy
import warnings
warnings.filterwarnings("ignore")

from concurrent.futures import ProcessPoolExecutor
from spacy.lang.en import stop_words
from spacy.attrs import ORTH
from spacy.tokenizer import Tokenizer

## 3. Import data




In [None]:
with open('2024-02-24_pmc_arxiv_full_text_merged_plus_cleaned_article_titles_updated.pickle', 'rb') as f:
    pmc_arxiv_full_text_merged_plus_cleaned_article_titles_updated = pickle.load(f)

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_updated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11687 entries, 0 to 11686
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   article_id     11687 non-null  object
 1   published      11687 non-null  object
 2   revised        11687 non-null  object
 3   title          11687 non-null  object
 4   title_cleaned  11687 non-null  object
 5   journal        11426 non-null  object
 6   authors        11687 non-null  object
 7   doi            11450 non-null  object
 8   pdf_url        11687 non-null  object
 9   text           11687 non-null  object
 10  word_count     11687 non-null  int64 
 11  sent_count     11687 non-null  int64 
dtypes: int64(2), object(10)
memory usage: 1.1+ MB


## 4. Preprocessing

The first stage in any NLP project is text preprocessing which facilitates feature extraction and modelling by making natural language normalised and machine-readable.

A preprocessing pipeline would typically include upstream tasks such as lowercasing, punctuation and stopword removal, tokenisation, part-of-speech tagging, dependency parsing, and stemming or lemmatisation.

![spaCy processing pipeline](images/spacy_pipeline.svg)


[spaCy's processing pipeline](https://spacy.io/usage/processing-pipelines) contains the following components by default:


["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]












### 4.1 Clean title text

We will use the functions defined previously for customising stopwords and special cases, and add a clean text function with a regex pattern to match alphanumeric words (including both English and Greek characters), hyphens, and full stops, but which avoids matching sequences that are just numbers by themselves. This will ensure that hyphenated words, and terms such as 'methyl-β-cyclodextrin' and 'BA.2.75' (Omicron BA.2.75 variant) are included.

In [None]:
# load small scispacy model with unused components disabled
nlp = spacy.load("en_core_sci_sm", disable=['parser', 'ner'])

In [None]:
def preprocess_stop_words():
    stop = set(word.lower() for word in nlp.Defaults.stop_words)
    custom_stop_words = ["et", "al", "al."]
    for item in custom_stop_words:
        stop.add(item)
        nlp.vocab[item].is_stop = True
    return stop

def add_special_cases():
    special_cases = [
        {ORTH: "in vitro"},
        {ORTH: "in silico"},
        {ORTH: "in vivo"}
    ]
    for special_case in special_cases:
        nlp.tokenizer.add_special_case(special_case[ORTH], [special_case])


def clean_text(text):
    pattern = re.compile(r"(?![0-9]+\b)[A-Za-z0-9\-α-ωΑ-Ω.]+")
    cleaned_text = " ".join(pattern.findall(text))
    return cleaned_text

We will create a preprocessing pipeline function to apply to the title_cleaned column.

In [None]:
def preprocess_pipe(df):
    stop = preprocess_stop_words()

    add_special_cases()

    # Apply cleaning and tokenisation to the title column and remove stopwords
    df['title_preproc'] = df['title_cleaned'].apply(lambda x: " ".join([token.text.lower()
    for token in nlp(clean_text(x)) if token.text.lower() not in stop]))
    return df

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc = pmc_arxiv_full_text_merged_plus_cleaned_article_titles_updated.copy()

In [None]:
%%time
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc = preprocess_pipe(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc)

CPU times: user 44.1 s, sys: 138 ms, total: 44.2 s
Wall time: 44.5 s


In [None]:
with open('2024-03-08_pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc.pickle', 'wb') as f:
  pickle.dump(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc, f)

We will validate that the clean text function has worked by checking a few titles.

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc.title_preproc[0]

'drug repositioning bibliometric analysis .'

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc.title_preproc[13]

'targeting sars-cov-2 nsp13 helicase assessment druggability pockets identification potent inhibitors multi-site in silico drug repurposing approach .'

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc.title_preproc[71]

'sars-cov-2 targeted human rna binding proteins network biology investigate covid-19 associated manifestations .'

### 4.2 Lemmatisation

Lemmatisation normalises tokens and uses part-of-speech (POS) tags to identify a word's root lemma and return the base form of the allowed parts of speech.



In [None]:
def lemmatization(texts, allowed_postags=['NOUN', 'VERB', 'ADJ', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for doc in nlp.pipe(texts, batch_size=20):
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
%%time
title_preproc_lemm = lemmatization(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc.title_preproc)

CPU times: user 13.8 s, sys: 53 ms, total: 13.8 s
Wall time: 15.5 s


In [None]:
title_preproc_lemm[0:10]

[['drug', 'reposition', 'bibliometric', 'analysis'],
 ['review',
  'computer-aided',
  'chemogenomic',
  'drug',
  'reposition',
  'rational',
  'covid-19',
  'drug',
  'discovery'],
 ['repurpose', 'molnupiravir', 'new', 'opportunity', 'treat', 'covid-19'],
 ['scope',
  'repurpose',
  'drug',
  'potential',
  'target',
  'late',
  'variant',
  'sars-cov-2'],
 ['drug',
  'repurpose',
  'gene',
  'co-expression',
  'module',
  'preservation',
  'analysis',
  'acute',
  'respiratory',
  'distress',
  'syndrome',
  'ard',
  'systemic',
  'inflammatory',
  'response',
  'syndrome',
  'sir',
  'sepsis',
  'covid-19'],
 ['novel',
  'drug',
  'design',
  'treatment',
  'covid-19',
  'systematic',
  'review',
  'preclinical',
  'study'],
 ['repurpose',
  'fda-approved',
  'drug',
  'cetilistat',
  'abiraterone',
  'diiodohydroxyquinoline',
  'bexarotene',
  'remdesivir',
  'potential',
  'inhibitor',
  'rna',
  'dependent',
  'rna',
  'polymerase',
  'sars-cov-2',
  'comparative',
  'in silico'

Here the model has reduced 'repositioning' to 'reposition' and 'repurposing' to 'repurpose'. This is because 'drug repositioning' and 'drug repurposing' are not being treated as noun chunks, and the code is word tokenising on unigrams without taking into account bigrams and trigrams.

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc['title_preproc_lemm'] = title_preproc_lemm

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11687 entries, 0 to 11686
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   article_id          11687 non-null  object
 1   published           11687 non-null  object
 2   revised             11687 non-null  object
 3   title               11687 non-null  object
 4   title_cleaned       11687 non-null  object
 5   journal             11426 non-null  object
 6   authors             11687 non-null  object
 7   doi                 11450 non-null  object
 8   pdf_url             11687 non-null  object
 9   text                11687 non-null  object
 10  word_count          11687 non-null  int64 
 11  sent_count          11687 non-null  int64 
 12  title_preproc       11687 non-null  object
 13  title_preproc_lemm  11687 non-null  object
dtypes: int64(2), object(12)
memory usage: 1.2+ MB


In [None]:
with open('2024-03-08_pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc.pickle', 'wb') as f:
  pickle.dump(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_preproc, f)

### 4.3 Clean article text

The full text for the first 20 articles will be cleaned by calling the same functions as before.



In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test = pmc_arxiv_full_text_merged_plus_cleaned_article_titles_updated.head(20).copy()

In [None]:
def preprocess_pipe(df):
    stop = preprocess_stop_words()
    add_special_cases()

    # Apply cleaning and tokenisation to the text column
    df['text_cleaned'] = df['text'].apply(lambda x: " ".join([token.text.lower() for token in nlp(clean_text(x))
                                                         if token.text.lower() not in stop]))
    # Remove spaces before full stops
    df['text_cleaned'] = df['text_cleaned'].str.replace(" .", ".")

    return df

One fix has been made to the `process_pipe()` function to ensure that spaces before full stops are removed as this appeared previously in the output for titles. We will validate on a test sentence with empty parentheses added at the end.

In [None]:
test = "Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery ()."

In [None]:
clean_text(test)

'Sir James Black a winner of the 1988 Nobel Prize clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery.'

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test = preprocess_pipe(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test)

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   article_id     20 non-null     object
 1   published      20 non-null     object
 2   revised        20 non-null     object
 3   title          20 non-null     object
 4   title_cleaned  20 non-null     object
 5   journal        20 non-null     object
 6   authors        20 non-null     object
 7   doi            20 non-null     object
 8   pdf_url        20 non-null     object
 9   text           20 non-null     object
 10  word_count     20 non-null     int64 
 11  sent_count     20 non-null     int64 
 12  text_cleaned   20 non-null     object
dtypes: int64(2), object(11)
memory usage: 2.2+ KB


In [None]:
# First article text before cleaning
print(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.text[0])

Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery (). In 2004, Ted T. Ashburn et al. () summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted (). The definition of the term “drug repurposing” has been endorsed by scholars () and used by them (; ). It should be pointed out that the synonyms of “drug repurposing” often used by academics also include drug repositioning (), drug rediscovery (), drug redirecting (), drug retasking (), and therapeutic switching (; ). After the research study by Ashburn et al., 

In [None]:
# First article text after cleaning
print(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.text_cleaned[0])

sir james black winner nobel prize clearly recognized 21st century drug repurposing strategies occupy important place future new drug discovery. ted t. ashburn summarized previous research developed general approach drug development drug repurposing retrospectively looking new indications approved drugs molecules waiting approval new pathways action targets. molecules usually safe clinical trials sufficient efficacy treatment disease originally targeted. definition term drug repurposing endorsed scholars. pointed synonyms drug repurposing academics include drug repositioning drug rediscovery drug redirecting drug retasking therapeutic switching. research study ashburn allarakhia expanded starting materials drug repositioning include products discontinued commercial reasons expired patents candidates laboratory testing. discovery process completely new drug difficulty usually lies safety efficacy main potential causes failure drugs approval clinical development stage. existing knowledge

Comparing the text column before and after preprocessing we can see that the cleaning functions have removed stopwords and punctuation, and standalone numbers such as 1988.

In [None]:
# Reposition text_cleaned column next to text column in DataFrame.
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.insert(10, 'text_cleaned', pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.pop('text_cleaned'))

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   article_id     20 non-null     object
 1   published      20 non-null     object
 2   revised        20 non-null     object
 3   title          20 non-null     object
 4   title_cleaned  20 non-null     object
 5   journal        20 non-null     object
 6   authors        20 non-null     object
 7   doi            20 non-null     object
 8   pdf_url        20 non-null     object
 9   text           20 non-null     object
 10  text_cleaned   20 non-null     object
 11  word_count     20 non-null     int64 
 12  sent_count     20 non-null     int64 
dtypes: int64(2), object(11)
memory usage: 2.2+ KB


In [None]:
with open('2024-03-02_pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.pickle', 'wb') as f:
  pickle.dump(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test, f)

### 4.4 Segmentation

We will perform segmentation to parse the text into sentences. The senter component of spaCy's SentenceRecognizer will be used, after disabling unused components. This is to speed up the process and avoid unwanted interactions with components that set sentence boundaries, in particular the parser.



In [None]:
# Load model and disable unused pipeline components
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
# Enable senter pipeline component
nlp.enable_pipe("senter")

def sent_tok_text(doc):
    # Create spaCy Doc object
    doc = nlp(doc)
    # Return the sentences in the document
    return [sent.text for sent in doc.sents]

In [None]:
def apply_sent_tok_text(text):
    # Create a ProcessPoolExecutor
    with ProcessPoolExecutor() as executor:
        # Apply the sent_tok_text function to the text column
        sentences = list(executor.map(sent_tok_text, text))

    return sentences

In [None]:
%%time
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test['sent_tok'] = apply_sent_tok_text(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.text_cleaned)

CPU times: user 23.8 ms, sys: 97.8 ms, total: 122 ms
Wall time: 989 ms


### 4.5 Word tokenisation

Tokenisation is the process of splitting text into smaller syntactic units such as words, characters or sub-words. We will word tokenise and lowercase the text from the text_cleaned column before encoding the tokens into numeric representations that a machine learning model can process.

In [None]:
def word_tok_text(texts):
    word_tok = []

    for doc in nlp.pipe(texts, batch_size=20):
        # Tokenise each document into words
        tokenized_words = [token.text.lower() for token in doc]
        word_tok.append(tokenized_words)

    return word_tok

In [None]:
%%time
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test['word_tok'] = word_tok_text(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test['text_cleaned'])

CPU times: user 2.95 s, sys: 704 ms, total: 3.66 s
Wall time: 3.73 s


In [None]:
# Reposition word_tok and sent_tok columns in DataFrame.
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.insert(11, 'word_tok', pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.pop('word_tok'))
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.insert(12, 'sent_tok', pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.pop('sent_tok'))

In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.head(20)

Unnamed: 0,article_id,published,revised,title,title_cleaned,journal,authors,doi,pdf_url,text,text_cleaned,word_tok,sent_tok,word_count,sent_count
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr...",sir james black winner nobel prize clearly rec...,"[sir, james, black, winner, nobel, prize, clea...",[sir james black winner nobel prize clearly re...,7309,253
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...,tight selective interaction ligands target pro...,"[tight, selective, interaction, ligands, targe...",[tight selective interaction ligands target pr...,7582,251
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...,severe acute respiratory syndrome coronavirus-...,"[severe, acute, respiratory, syndrome, coronav...",[severe acute respiratory syndrome coronavirus...,3421,129
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...,sudden outbreak sars-cov-2 took world storm de...,"[sudden, outbreak, sars-cov-2, took, world, st...",[sudden outbreak sars-cov-2 took world storm d...,8465,383
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co...",novel coronavirus dubbed sars-cov-2 led global...,"[novel, coronavirus, dubbed, sars-cov-2, led, ...",[novel coronavirus dubbed sars-cov-2 led globa...,5499,226
5,PMC9527439,2022-09-25,2022-10-07,Novel Drug Design for Treatment of COVID-19: A...,Novel Drug Design for Treatment of COVID-19: A...,The Canadian journal of infectious diseases & ...,"Mousavi S, Zare S, Mirzaei M, Feizi A.",10.1155/2022/2044282,https://europepmc.org/articles/PMC9527439?pdf=...,"Coronavirus disease 2019 (COVID-19), which was...",coronavirus disease covid-19 identified decemb...,"[coronavirus, disease, covid-19, identified, d...",[coronavirus disease covid-19 identified decem...,4041,158
6,PMC9729590,2022-12-08,2023-01-03,"Repurposing FDA-approved drugs cetilistat, abi...","Repurposing FDA-approved drugs cetilistat, abi...",Informatics in medicine unlocked,"Shahabadi N, Zendehcheshm S, Mahdavi M, Khadem...",10.1016/j.imu.2022.101147,https://europepmc.org/articles/PMC9729590?pdf=...,COVID-19 is an infectious disease caused by Co...,covid-19 infectious disease caused coronavirus...,"[covid-19, infectious, disease, caused, corona...",[covid-19 infectious disease caused coronaviru...,2685,145
7,PMC9236981,2022-06-28,2022-12-21,A comprehensive review of artificial intellige...,A comprehensive review of artificial intellige...,Biomedicine & pharmacotherapy = Biomedecine & ...,"Ahmed F, Soomro AM, Chethikkattuveli Salih AR,...",10.1016/j.biopha.2022.113350,https://europepmc.org/articles/PMC9236981?pdf=...,A novel coronavirus (CoV) first appeared by th...,novel coronavirus cov appeared end wuhan china...,"[novel, coronavirus, cov, appeared, end, wuhan...",[novel coronavirus cov appeared end wuhan chin...,11172,458
8,PMC9694939,2022-11-10,2022-12-13,Structural Homology-Based Drug Repurposing App...,Structural Homology-Based Drug Repurposing App...,"Molecules (Basel, Switzerland)","Aljuaid A, Salam A, Almehmadi M, Baammi S, Als...",10.3390/molecules27227732,https://europepmc.org/articles/PMC9694939?pdf=...,Drug discovery is a time-consuming and costly ...,drug discovery time-consuming costly process i...,"[drug, discovery, time-consuming, costly, proc...",[drug discovery time-consuming costly process ...,4482,183
9,PMC9556799,2022-10-13,2022-11-01,Rational drug repositioning for coronavirus-as...,Rational drug repositioning for coronavirus-as...,iScience,"Wang J, Liu J, Luo M, Cui H, Zhang W, Zhao K, ...",10.1016/j.isci.2022.105348,https://europepmc.org/articles/PMC9556799?pdf=...,"Coronavirus disease 2019 (COVID-19), caused by...",coronavirus disease covid-19 caused severe acu...,"[coronavirus, disease, covid-19, caused, sever...",[coronavirus disease covid-19 caused severe ac...,5345,253


In [None]:
pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   article_id     20 non-null     object
 1   published      20 non-null     object
 2   revised        20 non-null     object
 3   title          20 non-null     object
 4   title_cleaned  20 non-null     object
 5   journal        20 non-null     object
 6   authors        20 non-null     object
 7   doi            20 non-null     object
 8   pdf_url        20 non-null     object
 9   text           20 non-null     object
 10  text_cleaned   20 non-null     object
 11  word_tok       20 non-null     object
 12  sent_tok       20 non-null     object
 13  word_count     20 non-null     int64 
 14  sent_count     20 non-null     int64 
dtypes: int64(2), object(13)
memory usage: 2.5+ KB


In [None]:
with open('2024-03-02_pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test_tokenized.pickle', 'wb') as f:
  pickle.dump(pmc_arxiv_full_text_merged_plus_cleaned_article_titles_test, f)

### References

*  https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad

* https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/#Sentence_Tokenization