# Sentence Transformers

When working with both `TF-IDF` and `GloVe`, the focus was on the individual words within the Psalms. We were not getting the models to approach understanding the meaning of an entire psalm or a verse of a psalm. An example of this is 
> “The cat sat on the mat.” ≈ “A feline is resting on a rug.”  
> → **Sentence Transformers** give these two nearly identical vectors, unlike **TF-IDF** or **GloVe**.

We can look at other methods, such as `BERT`, to start to train a model to understand the meanings of text. Once we train and test out a **BERT** model, we can extend it to focus one sentences or verses by using **SBERT**.

**Source:** *Medium* - [Mastering BERT Model: Building it from Scratch with Pytorch](https://medium.com/data-and-beyond/complete-guide-to-building-bert-model-from-sratch-3e6562228891)


In [2]:
# Importing the needed libraries
import os
from transformers import BertTokenizer, BertModel
import pandas as pd
import numpy as np
import torch

In [21]:
# Importing the Psalms data

# Psalms (Bible & Psalter)
full_psalms = pd.read_csv("../../../data/csv/grouped_psalm.csv")


# Verses
psalms_verses = pd.read_csv("../../../data/csv/cleaned_psalm_verses.csv")

In [22]:
full_psalms.head()

Unnamed: 0.1,Unnamed: 0,tradition,text,psalm_num,verse,cleaned_verse
0,0,Orthodox,Bible,1,Blessed is the man Who walks not in the counse...,blessed man walk counsel ungodly stand way sin...
1,1,Orthodox,Bible,2,Why do the nations rage And the people meditat...,nation rage people meditate vain thing king ea...
2,2,Orthodox,Bible,3,A psalm by David when he fled from the face of...,psalm david fled face son absalom olord afflic...
3,3,Orthodox,Bible,4,For the End in psalms an ode by David You hear...,end psalm ode david heard icalled god righteou...
4,4,Orthodox,Bible,5,For the End concerning the inheritance a psalm...,end concerning inheritance psalm david give ea...


In [5]:
psalms_verses.head()

Unnamed: 0,tradition,text,psalm_num,verse_num,verse
0,Orthodox,Bible,1,1,Blessed is the man Who walks not in the counse...
1,Orthodox,Bible,1,2,But his will is in the law of the Lord And in ...
2,Orthodox,Bible,1,3,He shall be like a tree Planted by streams of ...
3,Orthodox,Bible,1,4,Not so are the ungodly not so But they are lik...
4,Orthodox,Bible,1,5,Therefore the ungodly shall not rise in the ju...


In [6]:
# Importing the clean `txt` files of the psalms.

# psalms_bert.py
from transformers import BertTokenizer, BertModel
import torch

# Load your text data
with open("../word_embeddings/corpus.txt", "r", encoding="utf-8") as f:
    psalms = [line.strip() for line in f if line.strip()]

print(f"Loaded {len(psalms)} psalms.")


Loaded 301 psalms.


In [7]:
import os

# Define the folder and file paths
data_dir = "../data"
verses_path = os.path.join(data_dir, "verses.txt")
verses_index_path = os.path.join(data_dir, "verses_index.txt")

# Ensure the directory exists
os.makedirs(data_dir, exist_ok=True)


# Building the text file above but as each line is a verse. 
with open("../data/verses.txt", "w", encoding="utf-8") as verses_file, \
     open("../data/verses_index.txt", "w", encoding="utf-8") as verse_index_file:
    
    for line_number, row in enumerate(psalms_verses.itertuples(index=False), start=1):
        # 1. Corpus: cleaned text, one Psalm per line
        verses_file.write(str(row.verse).strip().replace("\n", " ") + "\n")
    
        
        # 2. Index file: line number → Psalm ## + tradition
        verse_index_file.write(f"{line_number}\tPsalm {row.psalm_num}, Verse {row.verse_num}\t{row.text}\n")

We now have the data prepared in into a text file of psalms and a separate one of verses. We can now build a `BERT` encoder.

## Encoding the Psalms
Testing that the encoder works. 

In [33]:
import torch
import transformers

print("Torch version:", torch.__version__)
print("Transformers version:", transformers.__version__)


Torch version: 2.2.2
Transformers version: 4.57.1


After Confirmation of the package versions, lets test it out. I think I had problems with the differing kernels that I was working in. The following code is for testing from `Chat GPT`

In [34]:
from transformers import AutoTokenizer, AutoModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
model = AutoModel.from_pretrained("bert-base-uncased").to(device)
model.eval()

text = "Hello world"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

print("Embedding shape:", embedding.shape)


Using device: cpu
Embedding shape: (768,)


With the Code above finally working, we can build the embeddings for the psalms and then for each of the verses. First we need to write the algorithm to do the encoding. 

In [50]:
# --- Clean Psalm Encoder using BERT ---
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# 1️⃣ Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# 2️⃣ Load tokenizer and model (fresh instances)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
bert_model = AutoModel.from_pretrained("bert-base-uncased").to(device)
bert_model.eval()  # evaluation mode

# 3️⃣ Encoding function with attention-mask weighted pooling
def encode_text_bert(text: str) -> np.ndarray:
    """
    Encode a single text string into a 1D numpy array (hidden_size,)
    Uses attention-mask weighted mean to ignore padding.
    """
    # Tokenize
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    )
    
    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)  # last_hidden_state: (1, seq_len, hidden)
        hidden = outputs.last_hidden_state
        mask = inputs.get("attention_mask")
        
        if mask is None:
            pooled = hidden.mean(dim=1)
        else:
            mask = mask.unsqueeze(-1)  # (1, seq_len, 1)
            masked_hidden = hidden * mask
            summed = masked_hidden.sum(dim=1)
            counts = mask.sum(dim=1).clamp(min=1e-9)
            pooled = summed / counts
    
    return pooled.squeeze(0).cpu().numpy()

# 4️⃣ Example usage: encode a list of psalms
# Make sure `psalms` is defined, e.g.
# psalms = ["Psalm 1 text...", "Psalm 2 text...", ...]



Using device: cpu


In [51]:

for i, psalm in enumerate(psalms, start=1):
    embedding = encode_text_bert(psalm)
    np.save(f"../data/bert/psalm_{i}_embedding.npy", embedding)
    if i % 10 == 0:
        print(f"Encoded {i} psalms.")

print("✅ All psalms encoded successfully.")

Encoded 10 psalms.
Encoded 20 psalms.
Encoded 30 psalms.
Encoded 40 psalms.
Encoded 50 psalms.
Encoded 60 psalms.
Encoded 70 psalms.
Encoded 80 psalms.
Encoded 90 psalms.
Encoded 100 psalms.
Encoded 110 psalms.
Encoded 120 psalms.
Encoded 130 psalms.
Encoded 140 psalms.
Encoded 150 psalms.
Encoded 160 psalms.
Encoded 170 psalms.
Encoded 180 psalms.
Encoded 190 psalms.
Encoded 200 psalms.
Encoded 210 psalms.
Encoded 220 psalms.
Encoded 230 psalms.
Encoded 240 psalms.
Encoded 250 psalms.
Encoded 260 psalms.
Encoded 270 psalms.
Encoded 280 psalms.
Encoded 290 psalms.
Encoded 300 psalms.
✅ All psalms encoded successfully.


In [52]:
import numpy as np
import os

output_dir = "../data/bert"
psalm_embeddings = []

# Load all saved embeddings
for filename in sorted(os.listdir(output_dir)):
    if filename.endswith(".npy") and "psalm_" in filename:
        emb = np.load(os.path.join(output_dir, filename))
        psalm_embeddings.append(emb)

psalm_embeddings = np.stack(psalm_embeddings)  # shape: (num_psalms, 768)
print("Loaded psalm embeddings:", psalm_embeddings.shape)


Loaded psalm embeddings: (301, 768)


In [49]:
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two 1D numpy arrays."""
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b)
    return float(np.dot(a_norm, b_norm))

In [71]:
def query_bert():
    query = "for the peace of the world"

    print("Query: ", query)

    if not query:
        print("Empty query. Exiting.")
        return
    
    query_emb = encode_text(query)

    similarities = []

    for psalm_emb in psalm_embeddings:
        sim = cosine_similarity(query_emb, psalm_emb)
        similarities.append(round(sim*100, 2))

    
    top_indices = np.argsort(similarities)[-5:][::-1]

    # Checking the Output
    # print(top_indices)
    
    print("\nTop 5 matching psalms:")
    for rank, idx in enumerate(top_indices, start=1):
        text = "Bible"
        if idx > 151:
            num = idx - 151
            text = "Psalter"
        else: 
            num = idx
        print(f"{rank}. {text} Psalm {num + 1} - Similarity: {similarities[idx]}% \n {full_psalms.iloc[idx]['verse']}")
    
query_bert()


Query:  for the peace of the world

Top 5 matching psalms:
1. Psalter Psalm 34 - Similarity: 59.61% 
 Judge Thou, O Lord, them that do me injustice; fight against them that fight against me. Take hold of weapon and buckler, and rise up for mine help. Draw out the sword, and stop the way against them that persecute me; say unto my soul, I am Thy salvation. Let them be ashamed and confounded that seek after my soul; let them be turned back and put to shame that devise evil against me. Let them become as chaff before the face of the wind, and let the angel of the Lord afflict them. Let their way be dark and slippery, and let the angel of the Lord pursue them. For without cause have they hid for me destruction in their snare; without cause have they reproached my soul. Let a snare come upon him unawares, and let the snare that he hath hid catch himself; and into that very snare let him fall. But my soul shall exult in the Lord; it shall delight in His salvation. All my bones shall say, Lor

Now that `BERT` is working, We can further extend it to apply `SBERT`. Similar to `Bert`, `SBERT` focuses more on sentences of documents compared to just individual words. 
I will start to take the embeddeing code and change it to use `SBERT`'s methods. 

In [54]:
from sentence_transformers import SentenceTransformer

# Use a pretrained SBERT model
sbert_model = SentenceTransformer('all-mpnet-base-v2')  # or any SBERT variant


After reading some of the documentation for `SBERT`, there is code I had to write for `BERT` that is already being handled, such as, tokenization, attention masks, and pooling. Lets write the function. 

In [59]:
def encode_text_SBERT(text):
    return sbert_model.encode(text, convert_to_numpy=True)



In [None]:
# Encoding each full Psalm
for i, psalm in enumerate(psalms, start=1):
    try:
        embedding = encode_text_SBERT(psalm)
        np.save(f"../data/sbert/psalm_{i}_embedding.npy", embedding)
        if i % 10 == 0:
            print(f"Encoded {i} psalms.")
    except Exception as e:
        print(f"Failed at psalm {i}: {e}")
        


Encoded 10 psalms.
Encoded 20 psalms.
Encoded 30 psalms.
Encoded 40 psalms.
Encoded 50 psalms.
Encoded 60 psalms.
Encoded 70 psalms.
Encoded 80 psalms.
Encoded 90 psalms.
Encoded 100 psalms.
Encoded 110 psalms.
Encoded 120 psalms.
Encoded 130 psalms.
Encoded 140 psalms.
Encoded 150 psalms.
Encoded 160 psalms.
Encoded 170 psalms.
Encoded 180 psalms.
Encoded 190 psalms.
Encoded 200 psalms.
Encoded 210 psalms.
Encoded 220 psalms.
Encoded 230 psalms.
Encoded 240 psalms.
Encoded 250 psalms.
Encoded 260 psalms.
Encoded 270 psalms.
Encoded 280 psalms.
Encoded 290 psalms.
Encoded 300 psalms.


Condensing the embedding in the format to reference them. 

In [57]:
output_dir = "../data/sbert"
psalm_SBERT_embeddings = []

# Load all saved embeddings
for filename in sorted(os.listdir(output_dir)):
    if filename.endswith(".npy") and "psalm_" in filename:
        emb = np.load(os.path.join(output_dir, filename))
        psalm_SBERT_embeddings.append(emb)

psalm_SBERT_embeddings = np.stack(psalm_SBERT_embeddings)  # shape: (num_psalms, 768)
print("Loaded psalm embeddings:", psalm_SBERT_embeddings.shape)

Loaded psalm embeddings: (301, 768)


In [68]:
def query_sbert():
    query = "for the peace of the world"

    print("Query: ", query)

    if not query:
        print("Empty query. Exiting.")
        return
    
    query_emb = encode_text_SBERT(query)

    similarities = []

    for psalm_emb in psalm_SBERT_embeddings:
        sim = cosine_similarity(query_emb, psalm_emb)
        similarities.append(round(sim*100, 2))

    
    top_indices = np.argsort(similarities)[-5:][::-1]

    # Checking the Output
    # print(top_indices)
    
    print("\nTop 5 matching psalms:")
    for rank, idx in enumerate(top_indices, start=1):
        text = "Bible"
        if idx > 151:
            num = idx - 151
            text = "Psalter"
        else: 
            num = idx
        print(f"{rank}. {text} Psalm {num + 1} - Similarity: {similarities[idx]}% \n {full_psalms.iloc[idx]['verse']}")
        
    
query_sbert()


Query:  for the peace of the world

Top 5 matching psalms:
1. Psalter Psalm 85 - Similarity: 34.36% 
 Bow down Thine ear, O Lord, and hear me, for I am poor and needy. Preserve my soul, for I am holy; O Thou my God, save Thy servant that hopeth in Thee. Be merciful unto me, O Lord; for unto Thee will I cry all the day long. Rejoice the soul of Thy servant, for unto Thee have I lifted up my soul. For Thou, Lord, art good and gentle, and plenteous in mercy unto all them that call upon Thee. Give ear, O Lord, unto my prayer, and attend to the voice of my supplication. In the day of my trouble I cried to Thee, for Thou hast heard me. Among the gods there is none like unto Thee, O Lord; neither are there any works like unto Thy works. All the nations whom Thou hast made shall come and worship before Thee, O Lord, and shall glorify Thy name. For Thou art great, and doest wondrous things, Thou art God alone. Guide me in Thy way, O Lord, and I will walk in Thy truth; let my heart rejoice that 

In [72]:
query_bert()


Query:  for the peace of the world

Top 5 matching psalms:
1. Psalter Psalm 34 - Similarity: 59.61% 
 Judge Thou, O Lord, them that do me injustice; fight against them that fight against me. Take hold of weapon and buckler, and rise up for mine help. Draw out the sword, and stop the way against them that persecute me; say unto my soul, I am Thy salvation. Let them be ashamed and confounded that seek after my soul; let them be turned back and put to shame that devise evil against me. Let them become as chaff before the face of the wind, and let the angel of the Lord afflict them. Let their way be dark and slippery, and let the angel of the Lord pursue them. For without cause have they hid for me destruction in their snare; without cause have they reproached my soul. Let a snare come upon him unawares, and let the snare that he hath hid catch himself; and into that very snare let him fall. But my soul shall exult in the Lord; it shall delight in His salvation. All my bones shall say, Lor