# Text Processing Pipeline

## Introduction

After extracting the text data from the file, the next step is to process it by chunking the text into manageable pieces, generating embeddings, and enriching the chunks with keywords. This notebook walks you through each step and the reasoning behind it. We conclude with a discussion of benefits, limitations, and next steps.

## Setup

Before running this notebook ensure that the local model server is running as we will use it to generate embeddings.

In [10]:
from time import time
from pathlib import Path
import os
import sys
import spacy
import pytextrank

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Ensure the src directory is in the Python path
notebook_dir = Path(os.getcwd())
project_root = notebook_dir.parents[0]
sys.path.append(str(project_root))

from src.backend.textExtraction import Article

article = Article("../corpus/SlamonetalSCIENCE1987.pdf")


start = time()
article = Article("../corpus/SlamonetalSCIENCE1987.pdf")
end = time()

print(f"Processing time: {end - start:.2f} seconds\n")
print(f"Title: {article.title}")
print(f"Authors: {article.authors}")

print(f"Body:\n\n{article.body[:200]}...")

/home/elena/Projects/humana/.venv/lib/python3.12/site-packages
Processing time: 0.03 seconds

Title: Human Breast Cancer: Correlation of Relapse and Survival with Amplification of the HER-2lneu Oncogene
Authors: DENNIS J. SLAMON,* GARY M. CLARK, STEVEN G. WONG, WENDY J. LEVIN, AxEL ULLRICH, WILLiAM L. McGuIRE
Body:

The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt ca...


## Chunk Article Body

### Why chunk the text?

#### 1. Context limit

LLM's have a context limit, i.e., a maximum number of tokens they can process during inference. While recent advances have brought about commercial and open-source models with context limits in the 10's or 100's of thousands, many smaller models may not be able to fit the entire article in context. Breaking the article into chunks allows us to use smaller models which only need to read the relevant portions of the document at inference time.

In order to serve the model the right portion of the document for a given query, we will implement a retrieval pipeline which will use an embedding model. Similar to LLM's, these embedding models have their own context windows.

#### 2. Latency

When generating a response, the LLM will need to process each input token before generating its first output token. Even if a model could fit the entire article in context, it would take much longer to respond if we asked it to read the entire article instead of just a relevant snippet.


### Chunking Strategy

We will be using a fairly standard Recursive Character Text Splitting algorithm, outlined below.

The goal is to split text into chunks with:
- A maximum chunk size (e.g., 1000 characters)
- A minimum amount of overlap between chunks (e.g., 200 characters)
- Preserving natural language structure, like paragraphs or sentences, wherever possible

You initialize the text splitter with:
- `separators`: Ordered list of characters to try splitting on, ordered in descending priority

- `chunk_size`: Maximum size of a chunk.

- `chunk_overlap`: Number of characters that overlap between adjacent chunks.

Algorithm:
1. Split the text using the first separator.
2. Check the size of each split. For each candidate chunk:
    - If it's smaller than or equal to chunk_size, accept it.
    - If it's larger than chunk_size, recursively split it using the next separator in the list.
3. Repeat Step 2 recursively until all candidate chunks are of an acceptable size. This usually leads to over-splitting.
4. Iteratively merge candidate chunks to get as close to the `chunk_size` as possible (without going over) and meet the minimum chunk overlap requirements.

#### How to set parameters?

**Chunk Size**

Downstream, I want to generate embeddings for each chunk using the [MedCPT-Article-Encoder](https://huggingface.co/ncbi/MedCPT-Article-Encoder), an embedding model that was trained on over 255M abstracts from PubMed (more on this below). Because this model was trained on abstracts, I want to make each chunk roughly the same length as an abstract, which are usually between 150-250 words. Taking an average of 5 characters per word this give us between 750-1250 characters on average. This agrees with [this article](https://urds.uoregon.edu/symposium/abstracts) I found from the University of Oregon which gives a typical upper bound of 1500 characters for abstracts. 

Hence, I will use 1500 maximum chunk size.

**Separators**

I used the standard list of separators for this library, except I added `". "` to try to force more splits at the end of sentences. The goal being to make sure each chunk is interpretable on its own, since the llm won't be able to look back/ahead to get more context.

**Chunk Overlap**

I set this to 150 characters which roughly equals a sentence or two of overlap. Again, the goal being to make each chunk interpretable on its own.

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    separators=["\n", ". ", " ", ""],
    keep_separator="end",
)
chunks = text_splitter.split_text(article.body)

print(f"\nNumber of chunks: {len(chunks)}")
print(f"Average chunk length: {sum(len(chunk) for chunk in chunks) / len(chunks):.0f} characters\n")

for chunk in chunks[:5]:
    print(f"Chunk length: {len(chunk)} characters")
    print("Content:")
    print(chunk + "\n")


Number of chunks: 14
Average chunk length: 1260 characters

Chunk length: 1018 characters
Content:
The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt cancer cell lines. In the current study, alterations of the gene in 189 primary human breast cancers were instigated HER-2/ neu was found to be amplified frm 2- to eater than 20- fold in 30% of the tumors. Correlation ofgene amplifica- tion with several disease parameters was evaluated Am- plification of the HER-2/neu gene was a significant pre- dictor of both overall survival and time to relapse in patients with breast cancer. It retained its significance even when adjustments were made for other known prognostic factors. Moreover, HER-2/neu amplification had greater prognostic value than most currently used prognostic factors, incuding hormonal-receptor status, in lymph node-positive disease

Success!

- The average chunk size is right around the upper bound of what is typical for abstracts. 
- From inspecting the first few chunks, successive chunks typically share one or more sentences or occur at a paragraph break in the article.

## Keyword Extraction

The next step is to extract keywords or phrases from each chunk. We will use these keywords/phrases to perform a hybrid search when a query comes in.

We compute the keywords using the TextRank algorithm. At a high level, this algorithm works by creating a graph where "words" (after some pre-processing) are nodes and their connections are based on co-occurrence. The algorithm then applies PageRank to rank the nodes, identifying the most important words. These words can be aggreated into phrases or sentences in a post processing step. This is one of the main benefits of this algorithm over other common approaches (such as KeyBERT, RAKE, YAKE) which only return single words. For example, in our use-case we would want to identify "breast cancer" as a keyword, not just "breast" or "cancer".

In this implementation we will use spacy's `en_core_web_sm` for the pre-processing (tokenization, lemmatization) and the PyTextRank library's implementation of the TextRank algorithm.

Let's test it below on the first chunk, which happens to be exactly the abstract of the article.


In [16]:
chunk = chunks[0]

# load a spaCy model, used for tokenization and lemmatization
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(chunk)

print("\nTop 10 phrases in the chunk:\n")
for phrase in doc._.phrases[:10]:
    print(f"{phrase.text} (rank: {phrase.rank:.4f}, count: {phrase.count})")

print("\nOriginal text:")
print(chunk)


Top 10 phrases in the chunk:

other known prognostic factors (rank: 0.1058, count: 1)
breast cancer (rank: 0.1049, count: 1)
thbis gene (rank: 0.1003, count: 1)
HER-2/ neu (rank: 0.0973, count: 1)
greater prognostic value (rank: 0.0911, count: 1)
several disease parameters (rank: 0.0886, count: 1)
distinct firom (rank: 0.0729, count: 1)
189 primary human breast cancers (rank: 0.0713, count: 1)
HER-2 (rank: 0.0704, count: 3)
Am- plification (rank: 0.0698, count: 1)

Original text:
The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt cancer cell lines. In the current study, alterations of the gene in 189 primary human breast cancers were instigated HER-2/ neu was found to be amplified frm 2- to eater than 20- fold in 30% of the tumors. Correlation ofgene amplifica- tion with several disease parameters was evaluated Am- plification of the HER-2/