# Handling Queries

When preforming information retrival, the majority of the time the user queries will be signifgantly shorter than the documents. This create problems in terms of weighting for the queries vs. documents. The book *Speech and Language Processing*, conducted research in examining the length of over 1,000,000 queries and found the average length of a query was 2-3 words. 
> page 654,1st edition.


## A Mitgation 
As a mitigation, the same book reccomends the following equation taken from Staton and Buckley (1988). The Equation is below:
$$
w_{kj} = \left(0.5 + \frac{0.5 \cdot tf_{jk}}{\max_j tf_{jk}}\right) \cdot idf_j
$$


This formula represents a weighting scheme for term frequency-inverse document frequency (TF-IDF), often used in information retrieval. The key components are:
- **$tf_{jk}$**: Term frequency of term **$j$** in document **$k$**.  
- **$max_j tf_{jk}$**: Maximum term frequency of any term in document **$k$** (normalization factor).  
- **$idf_j$**: Inverse document frequency of term **$j$**, which accounts for how common or rare a term is across the entire corpus.  
- The term **$0.5 + \left(0.5 \cdot \frac{tf_{jk}}{\max_j tf_{jk}}\right)$** is known as **augmented term frequency**, which helps reduce the impact of large differences in raw term frequencies.



### Cosine Similarity
Cosine similarity measures the similarity between two text vectors by calculating the cosine of the angle between them in a multi-dimensional space. It is commonly used in **TF-IDF** to compare query vectors with document vectors and find the most relevant matches.

>Within the textbook, *Speech and Language Processing*, an equating for computing this cosine similarity:
>
>$$
sim(\vec{q_k},\vec{d_j}) = \vec{q_k}\cdot \vec{d_j} = \sum_{i=1}^N w_{i,k} \times w_{i,j}
$$
> Where:
    > - $\vec{q_k}$ is a specific query's vector
    > - $\vec{d_j}$ is a specific document's vector 
    > - $w_{i,k}$ is the weight of term $i$ in the query
    > - $w_{i,j}$ is the wieght of term $i$ in the document

#### Fully Normalized
With all of that being said, we can come to a fianl cosine similarity equation, that takes into account the normalization that has been discussed. 

$$
sim(\vec{q_k},\vec{d_j}) = \frac{\sum_{i=1}^N w_{i,k} \times w_{i,j}}{\sqrt{\sum_{i=1}^N w_{i,k}^2} \times \sqrt{\sum_{i=1}^N w_{i,j}^2}}
$$


Within the formula above, the dot product is invoked. In most cases, this is not useful as a similarity, due to the sensitvity to actuals magnitudes of the different documents and queries. On the other hand, the dot product can help compute the cosine of the different **normalized vectors**. This concept of normalized vectors is the concept intruduced and explained right above this. 

As another note to this, when the cosine of two vecotrs is one, the documents are identical. When the cosine of 2 document is 0, the documents are completely different from each other. 

> This has all come from the text book *Speech and Language Processing*: $1^{st}$ edition
### Sentence Transformers
Sentence Transformers is a Python framework for computing dense vector representations (embeddings) of sentences, paragraphs, and documents. It extends BERT (Bidirectional Encoder Representations from Transformers) and related transformer-based models to generate meaningful numerical representations of text, enabling tasks such as semantic similarity, clustering, and retrieval.

This is going to be applied to each query after the the query runs through the prviously created data pipline that all the psalms went through originally. 

Enough has been said. Let's get back to work.

# Importing Some Pickles

## Data Pipeline

In [17]:
# Define the load directory where the pickle file is stored
load_dir = "../../data/pickles/"

!pip install num2words

[0m

In [18]:
import sys
import os
import pickle

# Get the current working directory and construct the cleaning path
current_dir = os.getcwd()
cleaning_dir = os.path.abspath(os.path.join(current_dir, "../cleaning"))

# Add the directory to sys.path
sys.path.append(cleaning_dir)

# Now import the class
from data_pipeline import TextPreprocessingPipeline

import pickle

with open("../../data/pickles/pipeline.pickle", "rb") as f:
    pipeline = pickle.load(f)

## Vectorizer

In [19]:
os.getcwd()

'/opt/notebooks/website/scripts/vectorization'

In [20]:
current = os.getcwd()

# Load the pickled vectorizer
vectorizer_path = os.path.join("../../data/pickles/psalms_tfidf_vectorizer.pickle")
with open(vectorizer_path, "rb") as file:
    psalm_vectorizer = pickle.load(file)

## TF_IDF  Models

### Full Psalms Model
This model applies TF-IDF to entire psalms, treating each psalm as a single document. It captures the overall thematic and linguistic patterns across different traditions and translations.

In [21]:
# Loading the Pretrained Model
current = os.getcwd()

with open("../../data/pickles/psalms_tfidf_matrix.pickle", "rb") as file:
    tf_idf_psalms = pickle.load(file)

### Individual Verses Model

 This model applies TF-IDF at the verse level, treating each verse as an individual document. It highlights local word usage variations and allows for a more granular analysis of textual similarities between specific verses.

In [22]:
with open("../../data/pickles/psalm_verses_tfidf_matrix.pickle", "rb") as file:
    tf_idf_psalm_verses = pickle.load(file)

## Fucntion to Proccess Queries and Find Similarities

Alot of this process and code comes from going through the proccess of getting to running the model. The cosine similarity work comes as inspiration from the *Medium* atricle: [TF-IDF Vectorization with Cosine Similarity](https://medium.com/@anurag-jain/tf-idf-vectorization-with-cosine-similarity-eca3386d4423).

In [23]:
from sklearn.metrics.pairwise import cosine_similarity


# sample query
query = "Praise the Lord with all my soul"

# Displaying the query
print(f"\033[1mSearching for:\033[0m '{query}'.\n")

# Running the query through the data pipeline
#clean_query = pipeline.pipeline(query)


# Transform the query using the loaded vectorizer
clean_vec = psalm_vectorizer.transform([query])

# Calculate the cosine similarity between the query vector and the TF-IDF matrix
cosine_similarities = cosine_similarity(clean_vec, tf_idf_psalms).flatten()

# Get the indices of the top_n most similar Psalms (for example, top 5)
num_top = 5
top_indices = cosine_similarities.argsort()[-5:][::-1]

# Retrieve the text and Psalm numbers of the top matching Psalms
top_psalms = tf_idf_psalms.iloc[top_indices]

n = 1
# Print the results with proper handling of the MultiIndex
for idx in top_indices:
    psalm_index = tf_idf_psalms.index[idx]
    psalm_num = psalm_index[1]  # Assuming psalm_num is the second part of the tuple

    # Adjust the index lookup based on top_psalms structure
    try:
        # If top_psalms uses a single index (psalm_num), use that for lookup
        print(f"#{n} -> Psalm {psalm_index}: Similarity: {cosine_similarities[idx]}")
        print(f"Text: {top_psalms.loc[psalm_num]['text']}")
    except KeyError:
        print(f"KeyError: {psalm_num} not found in top_psalms index.")
    
    n +=1

[1mSearching for:[0m 'Praise the Lord with all my soul'.

#1 -> Psalm ('Bible', 150): Similarity: 0.2741997878367659
KeyError: 150 not found in top_psalms index.
#2 -> Psalm ('Bible', 148): Similarity: 0.21750387527655166
KeyError: 148 not found in top_psalms index.
#3 -> Psalm ('Psalter', 148): Similarity: 0.19865598629932982
KeyError: 148 not found in top_psalms index.
#4 -> Psalm ('Psalter', 33): Similarity: 0.19586927071031507
KeyError: 33 not found in top_psalms index.
#5 -> Psalm ('Bible', 116): Similarity: 0.19307330426963282
KeyError: 116 not found in top_psalms index.


In [24]:
def search_psalms(query, vectorizer, model, num_results):
    # Displaying the query
    print(f"\033[1mSearching for:\033[0m {query}.\n")
    
    # Running the query through the data pipeline
    clean_query = pipeline.pipeline(query)

    # Transform the query using the loaded vectorizer
    clean_vec = vectorizer.transform([clean_query])

    # Calculate the cosine similarity between the query vector and the TF-IDF matrix
    cosine_similarities = cosine_similarity(clean_vec, model).flatten()

    # Get the indices of the top_n most similar Psalms
    top_indices = cosine_similarities.argsort()[-int(num_results):][::-1]

    # Retrieve the corresponding Psalm indexes
    top_psalm_indices = model.index[top_indices]

    # For the ranking of the results
    n = 1
    
    return(top_indices)


## Indexing Function to Show Found Psalms
This is going to index and print the first part of each psalm in thhe search results. 

In [25]:
import pandas as pd
psalms = pd.read_csv("../../data/csv/cleaned_psalm_verses.csv")

psalms.head()

Unnamed: 0,tradition,text,psalm_num,verse_num,verse
0,Orthodox,Bible,1,1,Blessed is the man Who walks not in the counse...
1,Orthodox,Bible,1,2,But his will is in the law of the Lord And in ...
2,Orthodox,Bible,1,3,He shall be like a tree Planted by streams of ...
3,Orthodox,Bible,1,4,Not so are the ungodly not so But they are lik...
4,Orthodox,Bible,1,5,Therefore the ungodly shall not rise in the ju...


In [26]:
found_indices = [24, 143, 69]
# Looping through the given indices
for index in found_indices:
    target_documents = tf_idf_psalms.index[index]
    
    # debug print statement
    print(target_documents, "\n")
    
    # Separating the index from "('<DOC>', ##)" -> <DOC> & ##
    doc, psalm_num = tf_idf_psalms.index[index]
    
    print(f"    Text: {doc}")
    print(f"    Psalm Number: {psalm_num}\n")
    
    # Using these to index the psalm text to print
    print("     "," ".join(psalms.loc[(psalms['text'] == doc)
                                      & (psalms['psalm_num'] == psalm_num), 'verse'].tolist())[:250] + "...")



('Bible', 25) 

    Text: Bible
    Psalm Number: 25

      Of David Judge me O Lord for I walk in my innocence And by hoping in the Lord I shall not weaken Prove me O Lord and test me Try my reins and my heart in the fire For Your mercy is before my eyes And I was wellpleasing in Your truth I have not sat do...
('Bible', 144) 

    Text: Bible
    Psalm Number: 144

      A praise by David Ishall exalt You my God and my King And I shall bless Your name forever and unto ages of ages Every day I shall bless You And praise Your name forever and unto ages of ages Great is the Lord and exceedingly praiseworthy And His grea...
('Bible', 70) 

    Text: Bible
    Psalm Number: 70

      By David of the sons of Jonadab and the first ones taken captive OGod in You I hope may I never be put to shame Deliver me in Your righteousness and set me free Incline Your ear to me and save me Be to me a God for protection And a strong place for s...


Since this has worked, lets turn it into a function. Then we can add it to the `get_sim_psalm` function. 

In [27]:
def search_psalms(query, model, num_results):
    # Displaying the query
    print(f"\033[1mSearching for:\033[0m {query}.\n")
    
    # Reporting the number of results
    print(f"Top {num_results} results.")
    # Running the query through the data pipeline
    clean_query = pipeline.pipeline(query)

    # Transform the query using the loaded vectorizer
    clean_vec = psalm_vectorizer.transform([clean_query])

    # Calculate the cosine similarity between the query vector and the TF-IDF matrix
    cosine_similarities = cosine_similarity(clean_vec, model).flatten()
    
    # Get the indices of the top_n most similar Psalms
    top_indices = cosine_similarities.argsort()[-num_results:][::-1]

    # For the ranking of the results
    n = 1
    
    
    '''Building the results based on the model given'''
    # for psalms
    if model is tf_idf_psalms:
        # Looping through the indices to print them out
        for index in top_indices:
            # Debuging the index given
            # print(index)
             # Prepare results
            result = {"doc": model.index[index][0], 
                       "psalm_num": model.index[index][1], 
                       "similarity": round(cosine_similarities[index] * 100, 2)
                      }
            # Confirming the Output"
            # print(result)
            
            # Getting the Psalm verse            
            retrieve_psalm(result["doc"], result["psalm_num"])

    # for psalm verse 
    elif model is tf_idf_psalm_verses:
        # Looping through the indices to print them out
        for index in top_indices:
            # Debuging the index given
            # print(index)
             # Prepare results
            result = {"doc": model.index[index][0], 
                       "psalm_num": model.index[index][1], 
                       "verse_num": model.index[index][2], 
                       "similarity": round(cosine_similarities[index] * 100, 2)
                      }

            # Confirming the Output"
            # print(result)
            
            # Getting the Psalm verse
            retrieve_psalm_verses(result["doc"], result["psalm_num"], result["verse_num"])



In [28]:
def retrieve_psalm(doc, psalm_num):

    print(f"    Text: {doc}")
    print(f"    Psalm Number: {psalm_num}\n")

    # Retrieve and format the verse text as a paragraph
    matching_verses = psalms.loc[(psalms['text'] == doc) & (psalms['psalm_num'] == psalm_num), 'verse']

    if matching_verses.empty:
        print("    No matching Psalm found.")
        return

    # removing trails spaces
    verse_text = " ".join(matching_verses.tolist()).strip()

    # Ensure the last full word is displayed within the first 200 characters
    if len(verse_text) > 200:
        verse_text = verse_text[:200]  # Slice to the first 200 characters
        last_space = verse_text.rfind(' ')  # Find the last space in the first 200 characters
        verse_text = verse_text[:last_space]  # Trim to the last full word

    
    # Print the first 200 characters (or last full word if it's too long)
    print("   " + verse_text + "...\n")



Now that the function is working, we are going to put it into the `search_psalms` function.  We are also going to take the above function and taylor a copy of it to to the the indivual verses. I am also trying to print the mathing verse in the context of where it is found within the target psalm itself. This is similar to seeing the bold text in the content of a result on **google**. An example of this can be seen in the image below
![bolded_text_ex](https://storage.googleapis.com/support-forums-api/attachment/thread-99245237-2782387974984291760.png)


In [29]:
def retrieve_psalm_verses(doc, psalm_num, verse_num):

    print(f"    Text: {doc}")
    print(f"    Psalm Number: {psalm_num}")
    print(f"    Verse Number: {verse_num}\n")
    
    # Finding the number of verses for the specfic psalm
    last_verse = psalms.loc[(psalms['text'] == doc) & 
                            (psalms['psalm_num'] == psalm_num),
                            'verse_num'].max()

    # Confirming the `max_verse' is working
    #print(f"The last verse of the Psalm is: {last_verse}.")

    # storage for the verse numbers
    verses = [2,3,4]
    
    # Define storage for the verse numbers
    # If it is the first verse
    if verse_num== 1:
        verses = [verse_num, verse_num + 1, verse_num + 2]
    elif verse_num == last_verse:
        verses = [verse_num-2, verse_num-1, verse_num]
    else:
        verses = [verse_num-1, verse_num, verse_num+1]

    # Checking the verse generated
    # print(f"The verses to print: {verses}")
    
    target_verses = {}

    # Loop to capture all the needed verses
    for verse in verses:
        # Retrieve and format the verse text as a paragraph
        matching_verse = psalms.loc[
            (psalms['text'] == doc) & 
            (psalms['psalm_num'] == psalm_num) &
            (psalms['verse_num'] == verse), 
            "verse"
        ]

        # Extract the first value from the Series if it exists
        if not matching_verse.empty:
            target_verses[verse] = matching_verse.iloc[0]  # Extract the first value
        else:
            target_verses[verse] = "Verse not found"

    
    
    # BUilding the Psalm excerpt
    excerpt = "    "
    # Print the verses
    for verse, verse_text in target_verses.items():
        # To Bold the main verse found
        if verse == verse_num:
            excerpt += f"\033[1m {verse_text.strip('.')}.\033[0m"
        else:
            excerpt += f" {verse_text.strip('.')}. "
            
    
    print(excerpt)
    '''
    # removing trails spaces
    verse_text = " ".join(matching_verses.tolist()).strip()

    # Ensure the last full word is displayed within the first 200 characters
    if len(verse_text) > 200:
        verse_text = verse_text[:200]  # Slice to the first 200 characters
        last_space = verse_text.rfind(' ')  # Find the last space in the first 200 characters
        verse_text = verse_text[:last_space]  # Trim to the last full word
        
    
    
    # Print the first 200 characters (or last full word if it's too long)
    print("   " + verse_text + "...\n")
    '''


## Testing Verses

In [30]:
query = "For the peace of the world."

search_psalms(query, tf_idf_psalm_verses, 5)

[1mSearching for:[0m For the peace of the world..

Top 5 results.
    Text: Psalter
    Psalm Number: 95
    Verse Number: 11

     Let the heavens rejoice, and let the earth be glad; let the sea be moved, and the fulness thereof. The fields shall rejoice, and all that is therein.  Then shall all the trees of the wood rejoice before the Lord; for He cometh, for He cometh to judge the earth. [1m He shall judge the world with righteousness, and the peoples with His truth.[0m
    Text: Psalter
    Psalm Number: 97
    Verse Number: 8

     Shout with jubilation before the Lord our King; let the sea be moved, and the fulness thereof; the world, and all they that dwell therein.  The rivers shall clap their hands, the mountains shall rejoice at the presence of the Lord; for He cometh, He cometh to judge the earth. [1m He shall judge the world with righteousness, and the peoples with equity.[0m
    Text: Psalter
    Psalm Number: 18
    Verse Number: 4

     There is no speech nor langu


## Testing Full Psalms

In [31]:
import nltk
# nltk.download('omw-1.4')

In [32]:
#query = input("Enter to text to search the Psalms: ")
query = "For the peace of the world"

search_psalms(query, tf_idf_psalms, 5)

[1mSearching for:[0m For the peace of the world.

Top 5 results.
    Text: Psalter
    Psalm Number: 121

   I was glad because of them that said to me, Let us go into the house of the Lord. Our feet have stood within thy courts, O Jerusalem. Jerusalem is builded as a city that is compact together. For...

    Text: Bible
    Psalm Number: 121

   1An ode of ascents Iwas glad when they said to me Let us go into the house of the Lord Our feet stand in your courts O Jerusalem Jerusalem is built as a city Whose compactness is complete There the...

    Text: Psalter
    Psalm Number: 97

   O sing unto the Lord a new song, for the Lord hath done marvellous things. His right hand, and His holy arm, have wrought salvation for Him. The Lord hath made known His salvation; His righteousness...

    Text: Bible
    Psalm Number: 97

   A psalm by David Sing a new song to the Lord For He did wondrous things His right hand and His holy arm Saved peoples for Him The Lord made known His salvation