# Text Processing Pipeline

## Introduction

After extracting the text data from the file, the next step is to process it by chunking the text into manageable pieces, generating embeddings, and enriching the chunks with keywords. This notebook walks you through each step and the reasoning behind it. We conclude with a discussion of benefits, limitations, and next steps.

## Setup

Before running this notebook ensure that the local model server is running as we will use it to generate embeddings.

In [1]:
from time import time
from pathlib import Path
from typing import List
import os
import sys

import spacy
import pytextrank
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Ensure the src directory is in the Python path
notebook_dir = Path(os.getcwd())
project_root = notebook_dir.parents[0]
sys.path.append(str(project_root))

from src.backend.textExtraction import Article

article = Article("../corpus/SlamonetalSCIENCE1987.pdf")


start = time()
article = Article("../corpus/SlamonetalSCIENCE1987.pdf")
end = time()

print(f"Processing time: {end - start:.2f} seconds\n")
print(f"Title: {article.title}")
print(f"Authors: {article.authors}")

print(f"Body:\n\n{article.body[:200]}...")

/home/elena/Projects/humana/.venv/lib/python3.12/site-packages


  from .autonotebook import tqdm as notebook_tqdm


Processing time: 0.03 seconds

Title: Human Breast Cancer: Correlation of Relapse and Survival with Amplification of the HER-2lneu Oncogene
Authors: DENNIS J. SLAMON,* GARY M. CLARK, STEVEN G. WONG, WENDY J. LEVIN, AxEL ULLRICH, WILLiAM L. McGuIRE
Body:

The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt ca...


## Chunk Article Body

### Why chunk the text?

#### 1. Context limit

LLM's have a context limit, i.e., a maximum number of tokens they can process during inference. While recent advances have brought about commercial and open-source models with context limits in the 10's or 100's of thousands, many smaller models may not be able to fit the entire article in context. Breaking the article into chunks allows us to use smaller models which only need to read the relevant portions of the document at inference time.

In order to serve the model the right portion of the document for a given query, we will implement a retrieval pipeline which will use an embedding model. Similar to LLM's, these embedding models have their own context windows.

#### 2. Latency

When generating a response, the LLM will need to process each input token before generating its first output token. Even if a model could fit the entire article in context, it would take much longer to respond if we asked it to read the entire article instead of just a relevant snippet.


### Chunking Strategy

We will be using a fairly standard Recursive Character Text Splitting algorithm, outlined below.

The goal is to split text into chunks with:
- A maximum chunk size (e.g., 1000 characters)
- A minimum amount of overlap between chunks (e.g., 200 characters)
- Preserving natural language structure, like paragraphs or sentences, wherever possible

You initialize the text splitter with:
- `separators`: Ordered list of characters to try splitting on, ordered in descending priority

- `chunk_size`: Maximum size of a chunk.

- `chunk_overlap`: Number of characters that overlap between adjacent chunks.

Algorithm:
1. Split the text using the first separator.
2. Check the size of each split. For each candidate chunk:
    - If it's smaller than or equal to chunk_size, accept it.
    - If it's larger than chunk_size, recursively split it using the next separator in the list.
3. Repeat Step 2 recursively until all candidate chunks are of an acceptable size. This usually leads to over-splitting.
4. Iteratively merge candidate chunks to get as close to the `chunk_size` as possible (without going over) and meet the minimum chunk overlap requirements.

#### How to set parameters?

**Chunk Size**

Downstream, I want to generate embeddings for each chunk using the [MedCPT-Article-Encoder](https://huggingface.co/ncbi/MedCPT-Article-Encoder), an embedding model that was trained on abstracts from PubMed (more on this below). Because this model was trained on abstracts, I want to make each chunk roughly the same length as an abstract, which are usually between 150-250 words. Taking an average of 5 characters per word this give us between 750-1250 characters on average. This agrees with [this article](https://urds.uoregon.edu/symposium/abstracts) I found from the University of Oregon which gives a typical upper bound of 1500 characters for abstracts. 

Hence, I will use 1500 maximum chunk size.

**Separators**

I used the standard list of separators for this library, except I added `". "` to try to force more splits at the end of sentences. The goal being to make sure each chunk is interpretable on its own, since the llm won't be able to look back/ahead to get more context.

**Chunk Overlap**

I set this to 150 characters which roughly equals a sentence or two of overlap. Again, the goal being to make each chunk interpretable on its own.

In [2]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    separators=["\n", ". ", " ", ""],
    keep_separator="end",
)
chunks = text_splitter.split_text(article.body)

print(f"\nNumber of chunks: {len(chunks)}")
print(
    f"Average chunk length: {sum(len(chunk) for chunk in chunks) / len(chunks):.0f} characters\n"
)

for chunk in chunks[:5]:
    print(f"Chunk length: {len(chunk)} characters")
    print("Content:")
    print(chunk + "\n")


Number of chunks: 14
Average chunk length: 1260 characters

Chunk length: 1018 characters
Content:
The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt cancer cell lines. In the current study, alterations of the gene in 189 primary human breast cancers were instigated HER-2/ neu was found to be amplified frm 2- to eater than 20- fold in 30% of the tumors. Correlation ofgene amplifica- tion with several disease parameters was evaluated Am- plification of the HER-2/neu gene was a significant pre- dictor of both overall survival and time to relapse in patients with breast cancer. It retained its significance even when adjustments were made for other known prognostic factors. Moreover, HER-2/neu amplification had greater prognostic value than most currently used prognostic factors, incuding hormonal-receptor status, in lymph node-positive disease

Success!

- The average chunk size is right around the upper bound of what is typical for abstracts. 
- From inspecting the first few chunks, successive chunks typically share one or more sentences or occur at a paragraph break in the article.

## Keyword Extraction

The next step is to extract keywords or phrases from each chunk. We will use these keywords/phrases to perform a hybrid search when a query comes in.

We compute the keywords using the TextRank algorithm. At a high level, this algorithm works by creating a graph where "words" (after some pre-processing) are nodes and their connections are based on co-occurrence. The algorithm then applies PageRank to rank the nodes, identifying the most important words. These words can be aggreated into phrases or sentences in a post processing step. This is one of the main benefits of this algorithm over other common approaches (such as KeyBERT, RAKE, YAKE) which only return single words. For example, in our use-case we would want to identify "breast cancer" as a keyword, not just "breast" or "cancer".

In this implementation we will use spacy's `en_core_web_sm` fmodel and the `PyTextRank` library's implementation of the TextRank algorithm.

Let's test it below on the first chunk, which happens to be exactly the abstract of the article.


In [3]:
chunk = chunks[0]

# load spaCy model
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(chunk)

print("\nTop 10 phrases in the chunk:\n")
for phrase in doc._.phrases[:10]:
    print(f"{phrase.text} (rank: {phrase.rank:.4f}, count: {phrase.count})")

print("\nOriginal text:")
print(chunk)


Top 10 phrases in the chunk:

other known prognostic factors (rank: 0.1058, count: 1)
breast cancer (rank: 0.1049, count: 1)
thbis gene (rank: 0.1003, count: 1)
HER-2/ neu (rank: 0.0973, count: 1)
greater prognostic value (rank: 0.0911, count: 1)
several disease parameters (rank: 0.0886, count: 1)
distinct firom (rank: 0.0729, count: 1)
189 primary human breast cancers (rank: 0.0713, count: 1)
HER-2 (rank: 0.0704, count: 3)
Am- plification (rank: 0.0698, count: 1)

Original text:
The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt cancer cell lines. In the current study, alterations of the gene in 189 primary human breast cancers were instigated HER-2/ neu was found to be amplified frm 2- to eater than 20- fold in 30% of the tumors. Correlation ofgene amplifica- tion with several disease parameters was evaluated Am- plification of the HER-2/

## Generate Embeddings

Downstream, when a user asks a question I plan to perform a hybrid search using keyword matching (above) and vector similarity. To generate embeddings for user questions and source chunks I plan to use the `MedCPT` family of models.

`MedCPT` consists of two embedding models: [MedCPT-Article-Encoder](https://huggingface.co/ncbi/MedCPT-Article-Encoder) and [MedCPT-Query-Encoder](https://huggingface.co/ncbi/MedCPT-Query-Encoder), as well as [MedCPT-Cross-Encoder](https://huggingface.co/ncbi/MedCPT-Cross-Encoder), a cross-encoder model for reranking the results of a vector similarity nearest neighbor search. These models were trained on PubMed article abstracts using 255M + query-articles pairs drawn from real user data. 

The benefits of MedCPT are:
- **Trained on medical language**: the given article contains lots of medical jargon that general purpose embedding models may not have encountered as much during training. As a result, these models may struggle when used for retrieval on our data.
- **Includes CrossEncoder for ReRanking**: Many [studies]((https://ieeexplore.ieee.org/abstract/document/10711642)) and [guides](https://weaviate.io/blog/cross-encoders-as-reranker?utm_source=chatgpt.com) suggest that using cross-encoders to do this reranking improve search result accurracy for RAG systems. I have also seen this improvement firsthand in my prior work.
- **Lightweight**: the models are only 109M parameters, making them much lighter than state of the art general purpose embedding models. For reference, of the Top 10 embedding models on the [MTED Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), 9 of the 10 models are confirmed or suspected to be over *7 Billion parameters* (with some greatly exceeding that number). This is particularly important for this prototype as we possess limited GPU RAM, most of which will be consumed by the text generation model.

Read more about MedCPT in their paper [here](https://arxiv.org/abs/2307.00589).

In [4]:
# Load Models

article_model = AutoModel.from_pretrained(
    "../models/MedCPT-Article-Encoder", device_map="cuda"
)
article_tokenizer = AutoTokenizer.from_pretrained("../models/MedCPT-Article-Encoder")

query_model = AutoModel.from_pretrained(
    "../models/MedCPT-Query-Encoder", device_map="cuda"
)
query_tokenizer = AutoTokenizer.from_pretrained("../models/MedCPT-Query-Encoder")


cross_encoder_model = AutoModelForSequenceClassification.from_pretrained(
    "../models/MedCPT-Cross-Encoder", device_map="cuda"
)
cross_encoder_tokenizer = AutoTokenizer.from_pretrained("../models/MedCPT-Cross-Encoder")

In [5]:
def encode_texts(
    texts: List[str], titles: List[str], model: AutoModel, tokenizer: AutoTokenizer
) -> torch.Tensor:

    inputs = tokenizer(
        [[text, title] for text, title in zip(texts, titles)],
        padding=True,
        truncation=True,
        return_tensors="pt",
        max_length=512,
    ).to("cuda")

    with torch.no_grad():
        outputs = model(**inputs)

    # Use the last hidden state as the embedding
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings

In [6]:
# Encode the chunks

titles = [article.title] * len(chunks)

chunk_embeddings = encode_texts(chunks, titles, article_model, article_tokenizer)

print(f"Output shape: {chunk_embeddings.shape}")
print(f"Sample embedding: {chunk_embeddings[0][:5]} ...")  # Print first 5 dimensions of the first embedding

Output shape: torch.Size([14, 768])
Sample embedding: tensor([-0.0844, -0.1338, -0.1089, -0.3224, -0.1751], device='cuda:0') ...


In [7]:
# Encode a query

query = "How is HER-2/neu related to EGFR?"
query_embedding = encode_texts([query], [article.title], query_model, query_tokenizer)

print(f"Output shape: {query_embedding.shape}")

Output shape: torch.Size([1, 768])


In [8]:
# Compute similarity

from torch.nn.functional import cosine_similarity

similarities = cosine_similarity(
    query_embedding, chunk_embeddings, dim=1
)
print(f"Similarities: {similarities}")


Similarities: tensor([0.7249, 0.6910, 0.6763, 0.7295, 0.7566, 0.7152, 0.7299, 0.7500, 0.7545,
        0.7392, 0.7609, 0.7401, 0.7498, 0.7657], device='cuda:0')


In [9]:
# Get the 10 most similar chunks
top_indices = similarities.topk(10).indices
top_chunks = [chunks[i] for i in top_indices]

In [10]:
# Rerank using cross-encoder
def rerank_chunks(
    query: str, chunks: List[str], model: AutoModel, tokenizer: AutoTokenizer
) -> List[float]:
    
    pairs = [[query, chunk] for chunk in chunks]

    encoded = tokenizer(
        pairs,
        truncation=True,
        padding=True,
        return_tensors="pt",
        max_length=512,
    ).to("cuda")

    logits = model(**encoded).logits.squeeze(dim=1)

    return logits

In [11]:
scores = rerank_chunks(query, top_chunks, cross_encoder_model, cross_encoder_tokenizer)

# Get highest score indices (higher means more similar)
top_indices = scores.topk(5).indices

# Print the top 5 chunks with their scores
for i in top_indices:
    print(f"Score: {scores[i]:.4f}")
    print(f"Chunk: {top_chunks[i]}\n")

Score: 13.3224
Chunk: The incidence of EGFR amplification and rearrangement was too small to attempt statistical correlation. Comparison of HER-2/ neu amplification (53/189 or 28%) with that of the EGFR gene reveals the incidence of the former to be 14 times greater than that of the latter, indicating that the phenomenon of gene amplification is not a general one for a related tyrosine kinase-specific receptor in human breast cancer. Moreover, studies examining alterations of two other tyrosine kinase-specific proto-oncogenes, abi andfes, in breast cancer did not show amplification of these genes (33). Alterations of non-tyrosine kinase-related proto-oncogenes in thesereceptors in the biology of breast cancer is well established (29, 39, 40). It is easy to speculate that a gene encoding a putative growth factor receptor, when expressed in inappropriate amounts, may give a growth advantage to the cells expressing it. Alternatively, alter- ation in the gene product itself may lead to a c

Success!

Note that the model did a good job filtering the article to only 2-4 important chunks (depending on where one draws the similarity cutoff) and, after inspecting the second chunk more closely we, see that it contains our answer (emphasis mine).

**Query**: "How is HER-2/neu related to EGFR?"

**Chunk**:

Subsequent sequence analysis and chromo- somal mapping studies revealed all three genes (neu, c-erbB-2, and HER-2) to be the same (22, 24, 25). A fourth group, also using v- erbB as a probe, identified the same gene in a mammary carcinoma cell line, MAC 117, where it was found to be amplified five- to ten- fold (26). This gene, which we will call HER-2/neu, encodes a new member of the tyrosine kinase family; and is dosely related to, but distinct from, the EGFR gene (22). ***HER-2/neu differs from EGFR in that itis found on band q21 of chromosome 17 (22, 24, 25), as compared to band pll.-p13 of chromosome 7, where the EGFR gene is located (27). Also, the HER-2/neu gene generates a messenger RNA (mRNA) of 4.8 kb (22), which differs from the 5.8- and 10- kb transcipts for the EGFR gene (28). Finally, the protein encoded by the HER-2/neu gene is 185,000 daltons (21), as compared to the 170,000-dalton protein encoded by the EGFR gene.*** Conversely, on the basis of sequence data, HER-2/neu is more closely related to the EGFR gene than to other members of the tyrosine kinase family (22). Like the EGFR protem, HER-2/neu has an extracellular domain, a transmembrane domain that includes two cysteine-rich repeat dusters, and an intracellular kinase domain (21), indicatingthat it too is likely to be a cellular receptor for an as yet unidentified ligand.
