# Harry Potter corpus with transformers

This notebook explores an advanced approach to text analysis using transformer models. We aim to leverage state-of-the-art techniques to extract meaningful insights from a Harry Potter-themed text corpus. 

## Objectives:
1. **Embedding Generation**: Utilize the `all-MiniLM-L6-v2` transformer model from HuggingFace to create embeddings for the text corpus.
2. **Inverted Index Construction**: Build an inverted index for efficient text retrieval.
3. **Similarity Computation**: Employ cosine similarity to evaluate query-to-document relationships.
4. **Interactive Search Application**: Integrate the developed system into an interactive app.


In [1]:
import pandas as pd
import re
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
hp1 = pd.read_csv('Harry Potter 1.csv', delimiter=';', encoding='latin1')
corpus = hp1['Sentence'].tolist()  # Defining the corpus as a list of sentences

In [20]:
hp2 = pd.read_csv('Harry Potter 2.csv', delimiter=';', encoding='latin1')
hp2.rename(columns={'Sentence,,,,,,': 'Sentence'}, inplace=True)

corpus = hp2['Sentence'].tolist()  # Defining the corpus as a list of sentences

In [36]:
hp3 = pd.read_csv('Harry Potter 3.csv', delimiter=';', encoding='latin1')
#hp3.rename(columns={'SENTENCE': 'Sentence'}, inplace=True)

corpus = hp3['SENTENCE'].tolist()  # Defining the corpus as a list of sentences

In [33]:
# Step 3: Load the Sentence Transformer Model
# This model will be used to transform sentences in the corpus into embeddings:


from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [37]:
# Step 4: Encode the Corpus into Embeddings
# Transform each sentence in the corpus into a high-dimensional vector:


embeddings = model.encode(corpus)

In [38]:
# Step 5: Build an Inverted Index
# An inverted index will facilitate quick look-ups by keywords found in the corpus:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
matrix = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names_out()

# Create the inverted index
inverted_index = {}
for i, word in enumerate(features):
    indices = matrix[:, i].nonzero()[0]
    inverted_index[word] = list(indices)

In [39]:
# Step 6: Define a Function for Cosine Similarity Searches
# This function will rank sentences in the corpus based on their semantic similarity to the query:


from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_similar_sentences(query, embeddings, corpus, top_k=5):
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    return [(corpus[i], similarities[i]) for i in top_indices]

In [40]:
# Step 7: Search Function Using the Inverted Index
# This function first finds relevant documents using the inverted index and then ranks them using cosine similarity:


def search_query(query):
    query_words = query.lower().split()
    relevant_docs = set()
    for word in query_words:
        if word in inverted_index:
            relevant_docs.update(inverted_index[word])

    relevant_corpus = [corpus[i] for i in relevant_docs]
    relevant_embeddings = [embeddings[i] for i in relevant_docs]

    if relevant_embeddings:
        return find_similar_sentences(query, np.array(relevant_embeddings), relevant_corpus)
    else:
        return "No relevant sentences found."

In [41]:
# Step 8: Execute a Query
# Run a search to see how the system performs:


query = "Harry Potter and the Philosopher's Stone"
result = search_query(query)
print(result)

[('Harry Potter', np.float32(0.75656706)), ('Harry Potter.', np.float32(0.7229435)), ('Harry Potter.', np.float32(0.7229435)), ('Potter.', np.float32(0.7177903)), ('Harry Potter?', np.float32(0.713891))]


In [42]:

query = "Harry Potter"
result = search_query(query)
print(result)

[('Harry Potter', np.float32(1.0000001)), ('Harry Potter.', np.float32(0.96186876)), ('Harry Potter.', np.float32(0.96186876)), ('Harry Potter?', np.float32(0.881592)), ('Potter.', np.float32(0.86562777))]


In [43]:
query = "McGonagall"
result = search_query(query)
print(result)

[('Professor McGonagall!', np.float32(0.828662)), ('McGonagall gave it to me first term.', np.float32(0.697512)), ("He won't keep it. He'll turn it over to Professor McGonagall. Aren't you?", np.float32(0.5410545))]


In [44]:
query = "Hedwig"
result = search_query(query)
print(result)

[('Hedwig.', np.float32(0.9656972)), ('Hedwig.', np.float32(0.9656972))]


## The procedure to run the app

- conda install -n harry ipywidgets
- pip install pandas scikit-learn sentence-transformers
- pip install nltk
- python=3.11
- pip install flask
- python app.py(to run the app)