Quote from GLoVe github under [`scr/README.md`](https://github.com/stanfordnlp/GloVe/tree/master/src)

   > To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters. Cooccurrence contexts for words do not extend past newline characters. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in demo.sh, which you can modify as necessary.

# Importing the Data

In [4]:
import pandas as pd
import numpy as np

In [13]:
import os
print(os.getcwd())

/Users/caden/st_david-s-beacon/website/scripts/fall 2025/word_embeddings


In [14]:

psalms_verses = pd.read_csv("../../../data/csv/cleaned_psalm_verses.csv")

psalms_verses

Unnamed: 0,tradition,text,psalm_num,verse_num,verse
0,Orthodox,Bible,1,1,Blessed is the man Who walks not in the counse...
1,Orthodox,Bible,1,2,But his will is in the law of the Lord And in ...
2,Orthodox,Bible,1,3,He shall be like a tree Planted by streams of ...
3,Orthodox,Bible,1,4,Not so are the ungodly not so But they are lik...
4,Orthodox,Bible,1,5,Therefore the ungodly shall not rise in the ju...
...,...,...,...,...,...
5000,Orthodox,Psalter,150,64,"Butter of kine, and milk of sheep, with fat of..."
5001,Orthodox,Psalter,150,65,"So Jacob ate, and was filled; and the beloved ..."
5002,Orthodox,Psalter,150,66,"They provoked Me to anger with strange gods, a..."
5003,Orthodox,Psalter,150,67,"They sacrificed unto demons, not to God; to go..."


In [16]:
# Grouped Psalms (Bible & Psalter)
psalms = pd.read_csv("../../../data/csv/grouped_psalm.csv")
psalms

Unnamed: 0.1,Unnamed: 0,tradition,text,psalm_num,verse,cleaned_verse
0,0,Orthodox,Bible,1,Blessed is the man Who walks not in the counse...,blessed man walk counsel ungodly stand way sin...
1,1,Orthodox,Bible,2,Why do the nations rage And the people meditat...,nation rage people meditate vain thing king ea...
2,2,Orthodox,Bible,3,A psalm by David when he fled from the face of...,psalm david fled face son absalom olord afflic...
3,3,Orthodox,Bible,4,For the End in psalms an ode by David You hear...,end psalm ode david heard icalled god righteou...
4,4,Orthodox,Bible,5,For the End concerning the inheritance a psalm...,end concerning inheritance psalm david give ea...
...,...,...,...,...,...,...
296,296,Orthodox,Psalter,146,The Lord doth build up Jerusalem; He shall gat...,lord doth build jerusalem ; shall gather toget...
297,297,Orthodox,Psalter,147,"Praise the Lord, O Jerusalem; praise thy God, ...","praise lord , jerusalem ; praise thy god , zio..."
298,298,Orthodox,Psalter,148,Praise ye the Lord from the heavens; praise Hi...,praise ye lord heaven ; praise highest . prais...
299,299,Orthodox,Psalter,149,"Sing unto the Lord a new song, His praise is i...","sing unto lord new song , praise congregation ..."


In [None]:
# Renaming the last two columns as it should be psalm
psalms = psalms.rename(columns={'verse':'psalm',"cleaned_verse": "cleaned_psalm"})
psalms

Unnamed: 0.1,Unnamed: 0,tradition,text,psalm_num,psalm,cleaned_psalm
0,0,Orthodox,Bible,1,Blessed is the man Who walks not in the counse...,blessed man walk counsel ungodly stand way sin...
1,1,Orthodox,Bible,2,Why do the nations rage And the people meditat...,nation rage people meditate vain thing king ea...
2,2,Orthodox,Bible,3,A psalm by David when he fled from the face of...,psalm david fled face son absalom olord afflic...
3,3,Orthodox,Bible,4,For the End in psalms an ode by David You hear...,end psalm ode david heard icalled god righteou...
4,4,Orthodox,Bible,5,For the End concerning the inheritance a psalm...,end concerning inheritance psalm david give ea...
...,...,...,...,...,...,...
296,296,Orthodox,Psalter,146,The Lord doth build up Jerusalem; He shall gat...,lord doth build jerusalem ; shall gather toget...
297,297,Orthodox,Psalter,147,"Praise the Lord, O Jerusalem; praise thy God, ...","praise lord , jerusalem ; praise thy god , zio..."
298,298,Orthodox,Psalter,148,Praise ye the Lord from the heavens; praise Hi...,praise ye lord heaven ; praise highest . prais...
299,299,Orthodox,Psalter,149,"Sing unto the Lord a new song, His praise is i...","sing unto lord new song , praise congregation ..."


# Converting Data to `txt` files. 

Based on the Github repo, we need to do the training on a single txt file. I am considering each psalm to be a single document. Therefore we need to take the column of **cleaned_verse** and combined them into a single tx file. Since adding the label of each document would get in the way, I am making 2 parallel files

**Corpus for GloVe - `corpus.txt`**
1. Blessed is the man who walks not in the counsel of the ungodly...
2. Why do the nations rage, and the people plot in vain...
3. Blessed is the man that walketh not in the counsel of the ungodly...
4. Why do the heathen rage, and the people imagine a vain thing...

**Psalm Index - `corpus_index.txt`**

| Line | Psalm   | Tradition |
|------|---------|-----------|
| 1    | Psalm 1 | Psalter   |
| 2    | Psalm 2 | Psalter   |
| 3    | Psalm 1 | Bible     |
| 4    | Psalm 2 | Bible     |



In [19]:
with open("../data/psalms.txt", "w", encoding="utf-8") as corpus_file, \
     open("../psalms_index.txt", "w", encoding="utf-8") as index_file:
    
    for line_number, row in enumerate(psalms.itertuples(index=False), start=1):
        # 1. Corpus: cleaned text, one Psalm per line
        corpus_file.write(str(row['cleaned_psalm']).strip().replace("\n", " ") + "\n")
    
        
        # 2. Index file: line number → Psalm ## + tradition
        index_file.write(f"{line_number}\tPsalm {row.psalm_num}\t{row.text}\n")

TypeError: tuple indices must be integers or slices, not str

# Generating GloVe Models

After reading through the Github and using a bit of ChatGPT, I was able to compile,m som functions to generate the `GloVe` components needed. This is the same thing as using the command line arguments via the terminal. Lets test them out. 


In [None]:
import glove_utils as gu

# Make sure corpus.txt already exists and contains your training text

vectors_file = gu.train_glove("corpus.txt", glove_path="./glove")


BUILDING VOCABULARY
Processed 0 tokens.[0GProcessed 50280 tokens.
Counted 3473 unique words.
Using vocabulary of size 3473.

COUNTING COOCCURRENCES
window size: 10
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 3473 words.
Building lookup table...table contains 12061730 elements.
Processing token: 0[0GProcessed 50280 tokens.
Writing cooccurrences to disk.......2 files in total.
Merging cooccurrence files: processed 0 lines.[39G0 lines.[39G100000 lines.[39G200000 lines.[39G300000 lines.[0GMerging cooccurrence files: processed 311775 lines.

Using random seed 1759931794
SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 0 lines.[22Gprocessed 311775 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 0 lines.[31G311775 lines.[0GMerging temp files: processed 311775 lines.

TRAINING MODEL
Read 311775 lines.
Initializing parameters...Using random seed 1759931795
done.
vector size:

Vectors saved to vectors.txt


In [None]:
# Generating vectors
glove_vectors = gu.load_glove(vectors_file)

In [None]:
# Testing 
# User input
query = input("Enter something to search for: ")
print(glove_vectors.get(query))


KeyboardInterrupt: Interrupted by user

#
With the GloVe Model trained, let's prototype the the searcn results just like the `TF-IDF` results.  

In [None]:
def psalm_embedding(text):
    words = text.lower().split()  # simple tokenization
    vectors = []
    for w in words:
        if w in glove_vectors:
            vectors.append(glove_vectors[w])

    if len(vectors) == 0:
        return np.zeros(next(iter(glove_vectors.values())).shape)
    return np.mean(vectors, axis=0)

# Add a column with embeddings
psalms['glove_vec'] = psalms['cleaned_psalm'].apply(psalm_embedding)


NameError: name 'psalms' is not defined

In [None]:
psalms

Unnamed: 0.1,Unnamed: 0,tradition,text,psalm_num,psalm,cleaned_psalm,glove_vec
0,0,Orthodox,Bible,1,Blessed is the man Who walks not in the counse...,blessed man walk counsel ungodly stand way sin...,"[0.23288535, 0.2566583, -0.24532774, -0.296843..."
1,1,Orthodox,Bible,2,Why do the nations rage And the people meditat...,nation rage people meditate vain thing king ea...,"[0.22580087, -0.0008831446, -0.08571188, -0.06..."
2,2,Orthodox,Bible,3,A psalm by David when he fled from the face of...,psalm david fled face son absalom olord afflic...,"[0.2679192, 0.015625622, -0.117098495, 0.11953..."
3,3,Orthodox,Bible,4,For the End in psalms an ode by David You hear...,end psalm ode david heard icalled god righteou...,"[0.21202159, -0.019333934, -0.13325356, 0.1017..."
4,4,Orthodox,Bible,5,For the End concerning the inheritance a psalm...,end concerning inheritance psalm david give ea...,"[0.28035945, -0.024700804, -0.1479579, -0.0643..."
...,...,...,...,...,...,...,...
296,296,Orthodox,Psalter,146,The Lord doth build up Jerusalem; He shall gat...,lord doth build jerusalem ; shall gather toget...,"[0.22897509, 0.043269653, -0.20230761, 0.11760..."
297,297,Orthodox,Psalter,147,"Praise the Lord, O Jerusalem; praise thy God, ...","praise lord , jerusalem ; praise thy god , zio...","[0.19396736, 0.06622269, -0.22698525, 0.069075..."
298,298,Orthodox,Psalter,148,Praise ye the Lord from the heavens; praise Hi...,praise ye lord heaven ; praise highest . prais...,"[0.113614544, -0.10495038, -0.21939266, 0.0929..."
299,299,Orthodox,Psalter,149,"Sing unto the Lord a new song, His praise is i...","sing unto lord new song , praise congregation ...","[0.22333209, 0.012654127, -0.19451733, 0.08808..."


$$
\text{cosine\_similarity}(\mathbf{q}, \mathbf{d}) =
\frac{\mathbf{q} \cdot \mathbf{d}}
{\|\mathbf{q}\| \, \|\mathbf{d}\|}
$$

$$
\text{cosine\_similarity}(\mathbf{q}, \mathbf{d}) =
\frac{\sum_{i=1}^{n} q_i d_i}
{\sqrt{\sum_{i=1}^{n} q_i^2} \; \sqrt{\sum_{i=1}^{n} d_i^2}}
$$

In [None]:
def query_glove(query):
    q_vec = psalm_embedding(query)

    print(q_vec)
    
    results = []
    for _, row in psalms.iterrows():
        sim = np.dot(q_vec, row['glove_vec']) / (np.linalg.norm(q_vec) * np.linalg.norm(row['glove_vec']))
        results.append({
            "doc": row['text'],
            "psalm_num": row['psalm_num'],
            "psalm": row['psalm'],
            "similarity": round(sim*100, 2)
        })
    
    # Sort by similarity
    results.sort(key=lambda x: x['similarity'], reverse=True)
    
    return results[:6]  # top 6 results

In [6]:
query = input("Enter a query to search: ")
results = query_glove(query)

print("Query: ", query)
for psalm in results:
    print(f"{psalm['doc']} {psalm['psalm_num']} ({psalm['similarity']}%): \n  {psalm['psalm']}")

NameError: name 'psalm_embedding' is not defined

# Comparisons

Let's see if this has given us different results than the TF-IDF options. 



In [1]:
import sys
import os

notebook_dir = os.getcwd()
print("Notebook dir:", notebook_dir)

# Add scripts folder to sys.path
sys.path.append(os.path.abspath(os.path.join(notebook_dir, "..", "vectorization")))

print("Updated sys.path:", sys.path[-1])

# Import the function
from psalm_search import search_psalms
search_psalms(query)

Notebook dir: /Users/caden/st_david-s-beacon/website/scripts/word_embeddings
Updated sys.path: /Users/caden/st_david-s-beacon/website/scripts/vectorization


ModuleNotFoundError: No module named 'data_pipeline'