# Semantic Retrieval from Text Data

## Fundamental Aspects
#### How does a computer understand semantic information from written language?

I want to essentially create a short/basic non-scientific text sample to initially make sure that the computer is properly breaking down and processing each element of the text the way that I need it to

**SAMPLE TEXT DATA:**
create a short paragraph (6-8 sentences):

- all the sentences belong in the same larger story; however, they do not all share the same subtopic (AKA the keyword)
- have a key word in mind
- sentence which directly contain the key word---> 2 sentences 
    - The rabbit looked at the goose.
    - It lived near the rabbit.
- sentence which is completely unrelated and does NOT contain the key word---> 2 sentences
    - There was a goose in the garden
    - A snake was slithering in the sun slowly.
- sentence which is somewhat semantically related but does NOT contain the key word---> 2 sentences
    - It then hopped over the fence at 1:15 PM.
    - The former was 1/5 the height of the latter.
- sentence which incorporates a synonym of the key word but not explicitly the keyword itself
    - A hare was relaxing in the shade.
- sentence containing numerical characters
    - time: "at 1:15 PM"
    - "The year was 2004."
    - statistical in nature: "1/5 the height"



EX: "There was a goose in the garden. The rabbit looked at the goose. It then hopped over the fence at 1:15 PM. A hare was relaxing in the shade. It lived near the rabbit. The former was 1/5 the height of the latter. A snake was slithering in the sun slowly. The year was 2004."

In [None]:
# enter sample text data as input
text = "There was a goose in the garden. The rabbit looked at the goose. It then hopped over the fence at 1:15 PM. A hare was relaxing in the shade. It lived near the rabbit. The former was 1/5 the height of the latter. A snake was slithering in the sun slowly. The year was 2004."

### Natural Language Processing using SpaceCy

Industrial Strength Cython Tool for NLP

the main question is which English pipeline do we want to use? There is a trade between how complex the pipeline is and the speed that it takes to process data

In [None]:
import spacy

# Load the large model
nlp_lg = spacy.load("en_core_web_lg")
doc_lg = nlp_lg("Sally speaks English")


# Print named entities 
print("Large model named entities:", [(ent.text, ent.label_) for ent in doc_lg.ents])


### Tokenization vs Embedding

as you can see, the output for embedding is printing a vector representation for each token instead

In [None]:
# TOKENIZATION
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The cat sat on the mat.")
tokens = [token.text for token in doc]
print(tokens)

# EMBEDDING
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The cat sat on the mat.")
embeddings = [token.vector for token in doc]
print(embeddings)


## Introducing BERT
### Implementing a Transformer rather than Tokenizer

After seeing the flaws in Named Entity Recognition from the Tokenizers in SpaCy

In [91]:
import torch
from transformers import BertTokenizer, BertForTokenClassification
# Adam optimizer with weight decay regularization
# prevents overfitting and helps the model generalize better
from transformers import AdamW, get_linear_schedule_with_warmup
# PyTorch utility to handle and batch data better
from torch.utils.data import DataLoader

In [92]:
# set a number of labels
NUM_LABELS = 52

In [93]:

# Load the BERT tokenizer and model with whole word masking
tokenizer = BertTokenizer.from_pretrained("bert-large-cased-whole-word-masking")
model = BertForTokenClassification.from_pretrained("bert-large-cased-whole-word-masking", num_labels=NUM_LABELS)  # Set NUM_LABELS according to your dataset


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large-cased-whole-word-masking and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Label Creation for NER

(with B and I Tags- beginning and end of the category)


### GENERAL/BASICS


**People:**
- B-PER
- I-PER


**Locations (Country):**
- B-LOC
- I-LOC


**Geographic/Geological Formations:**
- B-GEO
- I-GEO


**Animals:**
- B-ANIM
- I-ANIM


**Numbers:**
- B-NUM
- I-NUM


**Colors:**
- B-COLOR
- I-COLOR


**University/Institution/Company:**
- B-INSTIT
- I-INSTIT


**Scientific Terminology/Technique:**
- B-SCI_TERM
- I-SCI_TERM



**General Subject/Topic/Field**
- B-FIELD
- I-FIELD


**Statistical Tests or Analysis:**
- B-STAT
- I-STAT


**Mathematical Concepts:**
- B-MATH
- I-MATH


**Programming Concepts:**
- B-CODE
- I-CODE


**Percentages/Fractions:**
- B-PERCENT
- I-PERCENT


-------


#### NEURO RESEARCH SPECIFIC


**Neuroimaging Modalities:**
- B-NEUROIMG
- I-NEUROIMG


-----

#### BASIC SCIENCE FIELDS/CONCEPTS


**Chemical Compound, Ion Names, Neurotransmitters:**
- B-CHEM
- I-CHEM


**Drugs:**
- B-DRUG
- I-DRUG



**Diseases/Disorders:**
- B-DISORD
- I-DISORD


**Biology (or Neuro) Concepts/Pathology/Anatomy:**
- B-BIO
- I-BIO


**Physics Concepts:**
- B-PHYSICS
- I-PHYSICS

----


#### LANGUAGE/LINGUISTICS


**Linguistic Concepts and Theories:**
- B-LING
- I-LING


**Languages:**
- B-LANG
- I-LANG


----


#### Additional Labels:


**Citations:**
- B-CITE
- I-CITE


**Padding:**
- PAD


**Outside:**
- O


In [94]:
# Create a list of all labels
labels = [
    "B-PER", "I-PER", "B-LOC", "I-LOC", "B-GEO", "I-GEO",
    "B-ANIM", "I-ANIM", "B-NUM", "I-NUM", "B-COLOR", "I-COLOR",
    "B-INSTIT", "I-INSTIT", "B-SCI_TERM", "I-SCI_TERM",
    "B-FIELD", "I-FIELD", "B-STAT", "I-STAT", "B-MATH", "I-MATH",
    "B-CODE", "I-CODE", "B-PERCENT", "I-PERCENT", "B-NEUROIMG", "I-NEUROIMG",
    "B-CHEM", "I-CHEM", "B-DRUG", "I-DRUG", "B-DISORD", "I-DISORD",
    "B-BIO", "I-BIO", "B-PHYSICS", "I-PHYSICS", "B-LING", "I-LING",
    "B-LANG", "I-LANG", "B-CITE", "I-CITE", "PAD", "O"
]

# Mapping labels to integers
# we do this by creating a dictionary {}
# start with the empty dict, label_map
    # looping through the list 'labels'
    # key:value so that label is the key and idx is the value
    # iterate over pairs of idx, label
label_map = {label: idx for idx, label in enumerate(labels)}


# Reverse mapping (optional, useful for decoding predictions)
idx_to_label = {idx: label for label, idx in label_map.items()}

## POS Tagging

So that the algorithm can detect the Part Of Speech

In [95]:
# Using the Natural Language ToolKit for this!
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

from nltk import pos_tag, word_tokenize

# Creating a function, pos_tagging()
# input is text data
def pos_tagging(texts):
    # Create a list for storing the text post- POS tagging
    tagged_texts = []
    # for a given iteration in the text input
    for text in texts:
        # tokenize the string into each word
        tokens = word_tokenize(" ".join(text))
        # apply POS tagging to each token
        pos_tags = pos_tag(tokens)
        # add each tag to the initial list created
        tagged_texts.append(pos_tags)
    return tagged_texts

# Example usage
texts = [["The patient has Alzheimer's disease."]]
pos_tagged_texts = pos_tagging(texts)
print(pos_tagged_texts)

[[('The', 'DT'), ('patient', 'NN'), ('has', 'VBZ'), ('Alzheimer', 'NNP'), ("'s", 'POS'), ('disease', 'NN'), ('.', '.')]]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/KeerthiStanley/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/KeerthiStanley/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Creating A Function to Load in Text Data, Have it Tokenized and the Label Each Corresponding Token

## Training with a Small Sample Sentence/Phrase Input and Its Labels

" John Doe has Alzheimer's disease, which affects his brain. They used an MRI scan to diagnose this. "

In [101]:

# Tokenize sentences into words using NLTK
def tokenize_sentences(sentences):
    return [word_tokenize(sentence) for sentence in sentences]


# Example sentences and their labels
sentences = [
    "John Doe has Alzheimer's disease, which affects his brain. They used an MRI scan to diagnose this."
]

tokenized_sentences = tokenize_sentences(sentences)

# Example labels corresponding to tokens (manually for demo purposes)
labels = [
    ["B-PER", "I-PER", "O", "B-DISORD", "I-DISORD", "I-DISORD", "O", "O", "O", "O", "B-BIO", "O",
    "O", "O", "O", "B-NEUROIMG", "I-NEUROIMG", "O", "O", "O", "O"]
]


# # Print the tokens and labels 
# for token, label in zip(tokens, label_sequence):
#     print(f"Token: {token}, Label: {label}")

# Print the tokens and labels 
for sentence_tokens, sentence_labels in zip(tokenized_sentences, labels):
    print(f"Tokens: {sentence_tokens}")
    print(f"Labels: {sentence_labels}")
    print("\n" + "="*50 + "\n")

Tokens: ['John', 'Doe', 'has', 'Alzheimer', "'s", 'disease', ',', 'which', 'affects', 'his', 'brain', '.', 'They', 'used', 'an', 'MRI', 'scan', 'to', 'diagnose', 'this', '.']
Labels: ['B-PER', 'I-PER', 'O', 'B-DISORD', 'I-DISORD', 'I-DISORD', 'O', 'O', 'O', 'O', 'B-BIO', 'O', 'O', 'O', 'O', 'B-NEUROIMG', 'I-NEUROIMG', 'O', 'O', 'O', 'O']




In [104]:
# Function to tokenize and preserve labels

# INPUTS:
    # sentences: a list of sentences where each sentence is a list of words
    # corresponding labels for each token of the sequence
def tokenize_and_preserve_labels(sentences, text_labels):
    # creates two empty storage lists for tokenized sentences and for the labels
    tokenized_sentences = []
    label_list = []
    
    # for each iteration of the pairs of sentences and their corresponding labels...
    for sentence, labels in zip(sentences, text_labels):
        # store the tokenized sentence and it's label
        tokenized_sentence = []
        sentence_labels = []
        # for each iteration of the pairs of sentences and their corresponding labels...
        for word, label in zip(sentence, labels):
            # using BERT's tokenizer to break down the word into subwords
            tokenized_word = tokenizer.tokenize(word)
            # then we measure the length of the tokenized word list to get the # of subwords
            n_subwords = len(tokenized_word)
            # add each tokenized word list to the tokenized sentence list
            tokenized_sentence.extend(tokenized_word)
            # [label] = the label for THAT iteration of the loop
            # multiply it by the number of subwords so that even if the word is split into smaller parts, it maintains the same label throughout
            sentence_labels.extend([label] * n_subwords)
        tokenized_sentences.append(tokenized_sentence) # append the tokenized sentence from that iteration onto the longer list
        label_list.append(sentence_labels) # append the label from that iteration onto the longer list
    return tokenized_sentences, label_list

# Call the function with example data
tokenized_sentences, preserved_labels = tokenize_and_preserve_labels(tokenized_sentences, labels)


## NERDataset Class: Format to Make Training Compatible with BERT

In [115]:
# maximum length of each input sentence accepted by the model
# will be padded if shorter than this length
MAX_LEN = 128
# the number of samples processed simultaneously in a forward/backward pass during training or inference
# larger batch sizes = faster training but also require more memory
BATCH_SIZE = 60
# number of complete passes through the training data
EPOCHS = 10

#### now that my parameters are established, I'm going to establish the NER Dataset class

#### the Dataset class imported, is a base class that you can inherit from to create your own custom datasets

There are other common datasets which can be used to training our model, but this allows us to train the model with our own design

Some examples of specfic pre-made datasets:
- SemEval Datasets: Semantic Analysis
- TREC (Text REtrieval Conference): Question Classification
- SQuAD (Stanford Question Answering Dataset)
- Common Crawl: Used for Chat GPT

Bio specific:
- PubMed Dataset
- BioNLP / BioBERT
- NeuroSynth
- ScispaCy: these models are trained on scientific texts and include pretrained NER models that can identify entities like chemicals, genes, and organisms. The datasets used to train these models (like CRAFT or JNLPBA) are labeled *

NCBI Disease Corpus:

dataset specifically for disease named entity recognition, containing clinical and biomedical texts with annotations for disease names.

MedMentions:

dataset that includes a large number of medical mentions, with annotations for entities such as diseases, drugs, and genes.

PubTator Central:
provides annotations for a large number of biomedical articles, including entities like genes, diseases, and chemicals.
Use: Can be used to train models for biomedical text mining and entity recognition.

In [116]:
from torch.utils.data import Dataset
import torch

class NERDataset(Dataset):
    def __init__(self, sentences, labels, tokenizer, max_len):
        """
        Initialize the NERDataset class.

        Parameters:
        - sentences (list of list of str): Tokenized sentences.
        - labels (list of list of str): Corresponding labels for each token in the sentences.
        - tokenizer (transformers.PreTrainedTokenizer): Tokenizer used to convert text into token IDs.
        - max_len (int): Maximum sequence length for padding/truncation.
        """
        self.sentences = sentences  # Store the list of tokenized sentences.
        self.labels = labels  # Store the list of corresponding labels.
        self.tokenizer = tokenizer  # Store the tokenizer for later use.
        self.max_len = max_len  # Store the maximum length for padding/truncation.
    
    def __len__(self):
        """
        Return the number of sentences in the dataset.

        Returns:
        - int: Number of sentences in the dataset.
        """
        return len(self.sentences)
    
    def __getitem__(self, idx):
        """
        Fetch the item at the given index, tokenize it, and align the labels.

        Parameters:
        - idx (int): Index of the item to fetch.

        Returns:
        - dict: A dictionary containing 'input_ids', 'attention_mask', and 'labels' tensors.
        """
        sentence = self.sentences[idx]  # Get the sentence at the specified index.
        label = self.labels[idx]  # Get the corresponding labels for the sentence.
        
        # Tokenize the sentence and align labels
        tokens = self.tokenizer(
            sentence, 
            max_length=self.max_len, 
            truncation=True, 
            padding='max_length', 
            is_split_into_words=True
        )
        
        input_ids = tokens['input_ids']  # Extract input token IDs from tokenized output.
        attention_mask = tokens['attention_mask']  # Extract attention mask from tokenized output.
        
        # Initialize labels with -100 for padding (ignored in loss calculation).
        labels = [-100] * self.max_len
        # Align labels with the tokenized words.
        for i, label_id in enumerate(label):
            if i < self.max_len:
                labels[i] = label_map.get(label_id, -100)  # Map label to its integer representation.
        
        return {
            'input_ids': torch.tensor(input_ids, dtype=torch.long),  # Convert input IDs to a PyTorch tensor.
            'attention_mask': torch.tensor(attention_mask, dtype=torch.long),  # Convert attention mask to a PyTorch tensor.
            'labels': torch.tensor(labels, dtype=torch.long)  # Convert labels to a PyTorch tensor.
        }

# Example usage:
# Create your dataset
dataset = NERDataset(tokenized_sentences, preserved_labels, tokenizer, MAX_LEN)

# DataLoader
# Initialize DataLoader to handle batching and shuffling of the dataset.
data_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)


In [117]:
from transformers import get_linear_schedule_with_warmup

# Define the loss function
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=-100)

# Define the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Set up a learning rate scheduler
total_steps = len(data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

In [118]:

## TRAINING LOOP
## generate the loss per batch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    
    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        optimizer.zero_grad()
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        
        loss.backward()
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
    
    print(f"Epoch {epoch + 1}/{EPOCHS}, Loss: {total_loss / len(data_loader)}")


Epoch 1/10, Loss: 1.6293100118637085
Epoch 2/10, Loss: 2.0066909790039062
Epoch 3/10, Loss: 1.3336238861083984
Epoch 4/10, Loss: 0.7771637439727783
Epoch 5/10, Loss: 0.5990918874740601
Epoch 6/10, Loss: 0.54462069272995
Epoch 7/10, Loss: 0.4330599904060364
Epoch 8/10, Loss: 0.314517080783844
Epoch 9/10, Loss: 0.2986332178115845
Epoch 10/10, Loss: 0.22897063195705414


In [119]:
# EVALUATE THE MODEL
# let's use an Evaluation Loop

model.eval()
total_loss = 0

with torch.no_grad():
    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        
        total_loss += loss.item()

print(f"Validation Loss: {total_loss / len(data_loader)}")


Validation Loss: 0.13250534236431122


In [127]:
# GENERATE PREDICTIONS

def predict(sentence):
    tokens = tokenizer(sentence, return_tensors='pt', truncation=True, padding='max_length', max_length=MAX_LEN)
    input_ids = tokens['input_ids'].to(device)
    attention_mask = tokens['attention_mask'].to(device)
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    
    # Decode predictions
    predictions = torch.argmax(outputs.logits, dim=-1).cpu().numpy()[0]  # Get the first (and only) sequence
    predicted_labels = [idx_to_label[id] for id in predictions if id != -100]  # Ignore padding tokens
    
    return predicted_labels



In [132]:
# RETRIEVE RELEVANT SENTENCES

def retrieve_relevant_sentences(text, keyword):
    # Split text into sentences based on periods
    sentences = text.split('.') 
    relevant_sentences = []
    
    # Map keyword to its label if it exists in your label set
    keyword_label = None
    for label in labels:
        if keyword in label:
            keyword_label = label
            break
    
    # Ensure that the keyword_label is valid
    if not keyword_label:
        print("Keyword not found in label set.")
        return relevant_sentences
    
    # Process each sentence
    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue
        
        # Predict the labels for the sentence
        predictions = predict(sentence)
        
        # Map prediction indices back to labels
        predicted_labels = [idx_to_label[idx] for idx in predictions[0]]
        
        # Check if the keyword_label is in the predicted labels
        if keyword_label in predicted_labels:
            relevant_sentences.append(sentence)
    
    return relevant_sentences



In [134]:
text = """
John Doe has Alzheimer's disease, which affects his brain. They used an MRI scan to diagnose this.
The treatment involves medication and therapy.
"""
keyword = "John"  # Example keyword for testing

relevant_sentences = retrieve_relevant_sentences(text, keyword)

print("Relevant Sentences:")
for sentence in relevant_sentences:
    print(sentence)

RuntimeError: Tensor.__contains__ only supports Tensor or scalar, but you passed in a <class 'str'>.