
# Install libraries
*   **Purpose**: Install necessary Python libraries.
*   **Operations**: Uses `!pip install` to install `sentence-transformers`, `faiss-cpu`, `pandas`, `transformers`, and `torch`.
*   **Contribution**: Ensures all required external dependencies are available for the subsequent steps, enabling specialized functionalities like sentence embeddings, vector search, and data manipulation essential for the quote retrieval system.

In [1]:
!pip install sentence-transformers faiss-cpu pandas transformers torch faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m87.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


# Import modules
*   **Purpose**: Import all required Python modules and set up an environment variable.
*   **Operations**: Imports `torch`, `load_dataset` (from `datasets`), `SentenceTransformer`, `InputExample`, `losses`, `models`, `util` (from `sentence_transformers`), `AutoTokenizer`, `AutoModelForSequenceClassification` (from `transformers`), `faiss`, `numpy`, `pickle`, `os`. Sets `os.environ["WANDB_DISABLED"] = "true"` to disable Weights & Biases logging.
*   **Contribution**: Makes all necessary functions and classes available for use throughout the notebook, preparing the environment for data loading, model definition, training, and indexing operations.

In [8]:
import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, InputExample, losses, models, util
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss

import numpy as np
import pickle

import numpy
import os
os.environ["WANDB_DISABLED"] = "true"

# Data Preparation
*   **Purpose**: Load the quotes dataset into a pandas DataFrame.
*   **Operations**: Uses `load_dataset("Abirate/english_quotes", split='train')` to fetch data from Hugging Face and then `data.to_pandas()` to convert it into a DataFrame named `quotes_data`.
*   **Contribution**: Provides the raw text data that will be cleaned, processed, and used to train and build the quote retrieval system.

In [9]:
data = load_dataset("Abirate/english_quotes", split='train')

quotes_data = data.to_pandas()

In [10]:
quotes_data


Unnamed: 0,quote,author,tags
0,“Be yourself; everyone else is already taken.”,Oscar Wilde,"[be-yourself, gilbert-perreira, honesty, inspi..."
1,"“I'm selfish, impatient and a little insecure....",Marilyn Monroe,"[best, life, love, mistakes, out-of-control, t..."
2,“Two things are infinite: the universe and hum...,Albert Einstein,"[human-nature, humor, infinity, philosophy, sc..."
3,"“So many books, so little time.”",Frank Zappa,"[books, humor]"
4,“A room without books is like a body without a...,Marcus Tullius Cicero,"[books, simile, soul]"
...,...,...,...
2503,“Morality is simply the attitude we adopt towa...,"Oscar Wilde,","[morality, philosophy]"
2504,“Don't aim at success. The more you aim at it ...,"Viktor E. Frankl,","[happiness, success]"
2505,"“In life, finding a voice is speaking and livi...",John Grisham,[inspirational-life]
2506,"“Winter is the time for comfort, for good food...",Edith Sitwell,"[comfort, home, winter]"


#EDA
*   **Purpose**: Illustrate potential data inconsistencies or encoding issues in the raw quote text.
*   **Operations**: Iterates and prints a small sample of quotes (`quotes_data.quote[480:490]`).
*   **Contribution**: Highlights the need for data cleaning by showcasing problematic characters (e.g., "Donâ€™") that require preprocessing for accurate text analysis.

In [108]:
# this data contains Standardize apostrophes example  in 6 line "Donâ€™
for i in quotes_data.quote[480:490]:
  print(i,"\n")


“Sometimes you climb out of bed in the morning and you think, I'm not going to make it, but you laugh inside â€” remembering all the times you've felt that way.” 

“The world is a book and those who do not travel read only one page.” 

“There are no good girls gone wrong - just bad girls found out.” 

“You get a little moody sometimes but I think that's because you like to read. People that like to read are always a little fucked up.” 

“Never go to bed mad. Stay up and fight.” 

“Donâ€™t go around saying the world owes you a living. The world owes you nothing. It was here first.” 

“If we knew what it was we were doing, it would not be called research, would it?” 

“One day I will find the right words, and they will be simple.” 

“Life can only be understood backwards; but it must be lived forwards.” 

“Faithless is he that says farewell when the road darkens.” 



*   **Purpose**: Demonstrate a standalone example of the text cleaning logic for a single string.
*   **Operations**: Takes a sample string `text`, checks for 'â' for `cp1252` encoding, normalizes Unicode, replaces various quotation/apostrophe characters, and applies a dictionary of artifacts for replacement. The `unicodedata` and `re` modules are imported implicitly from previously run cells or standard library availability.
*   **Contribution**: Provides a clear, isolated example of the cleaning steps, making it easier to understand the transformation applied to the text before it's integrated into a reusable function.

In [109]:
text = "I guess thatâ€™s just part of loving people: You have to give things up. Sometimes you even have to give them up."

if 'â' in text:
    text = text.encode('cp1252').decode('utf-8')
else:
  pass
text = unicodedata.normalize('NFKD', text)
text = re.sub(r"[\u201c\u201d\u201e\u201f\u2033\u2036]", '"', text)
text = re.sub(r"[\u2018\u2019\u201a\u201b\u2032\u2035\u0060]", "'", text)

artifacts = {
    "â€™": "'", "â€œ": '"', "â€": '"',
    "â€˜": "'", "â€”": "—", "â€“": "-"
}
for art, fix in artifacts.items():
    text = text.replace(art, fix)
text

"I guess that's just part of loving people: You have to give things up. Sometimes you even have to give them up."

*   **Purpose**: Define a comprehensive function for cleaning quote text.
*   **Operations**: Defines `clean_quote_text(quote_text)` which performs HTML unescaping, handles `cp1252` decoding issues, Unicode normalization, standardizes various quote and apostrophe characters, replaces specific unicode artifacts, removes non-ASCII characters, and finally converts the text to lowercase and strips extra spaces. It imports `re`, `html`, `unicodedata`, and `numpy` locally for its scope.
*   **Contribution**: Encapsulates all necessary text preprocessing steps into a reusable function, ensuring consistent and robust cleaning of the entire dataset.

In [11]:
import re
import html
import unicodedata
import numpy as np

def clean_quote_text(quote_text):
    if not isinstance(quote_text, str) or quote_text is None:
        return ""

    text = html.unescape(quote_text)


    try:
        if 'â' in text:
            text = text.encode('cp1252').decode('utf-8')
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass


    text = unicodedata.normalize('NFKD', text)
    text = re.sub(r"[\u201c\u201d\u201e\u201f\u2033\u2036]", '"', text)
    text = re.sub(r"[\u2018\u2019\u201a\u201b\u2032\u2035\u0060]", "'", text)

    artifacts = {
        "â€™": "'", "â€œ": '"', "â€": '"',
        "â€˜": "'", "â€”": "—", "â€“": "-"
    }
    for art, fix in artifacts.items():
        text = text.replace(art, fix)

    text = re.sub(r'[^\x00-\x7f]', r'', text)
    text = re.sub(r'\s+', ' ', text).strip()

    return text.lower().replace("..."," ")



*   **Purpose**: Apply the cleaning function to the dataset's quote and tags columns.
*   **Operations**: Creates new columns `quotes_data['quote_clean']` and `quotes_data['tags_clean']` by applying the `clean_quote_text` function to the respective raw columns. It also prints an example of the transformation for the first row.
*   **Contribution**: Transforms the raw text data into a clean, standardized format suitable for model training and embedding generation, removing noise and inconsistencies.

In [12]:
quotes_data['quote_clean'] = quotes_data['quote'].apply(clean_quote_text)

quotes_data['tags_clean'] = quotes_data['tags'].apply(lambda t_list: [clean_quote_text(t) for t in t_list] if isinstance(t_list, (list, np.ndarray)) else [])

print(quotes_data[['quote', 'quote_clean']].iloc[0].values)

['“Be yourself; everyone else is already taken.”'
 '"be yourself; everyone else is already taken."']


In [13]:
quotes_data

Unnamed: 0,quote,author,tags,quote_clean,tags_clean
0,“Be yourself; everyone else is already taken.”,Oscar Wilde,"[be-yourself, gilbert-perreira, honesty, inspi...","""be yourself; everyone else is already taken.""","[be-yourself, gilbert-perreira, honesty, inspi..."
1,"“I'm selfish, impatient and a little insecure....",Marilyn Monroe,"[best, life, love, mistakes, out-of-control, t...","""i'm selfish, impatient and a little insecure....","[best, life, love, mistakes, out-of-control, t..."
2,“Two things are infinite: the universe and hum...,Albert Einstein,"[human-nature, humor, infinity, philosophy, sc...","""two things are infinite: the universe and hum...","[human-nature, humor, infinity, philosophy, sc..."
3,"“So many books, so little time.”",Frank Zappa,"[books, humor]","""so many books, so little time.""","[books, humor]"
4,“A room without books is like a body without a...,Marcus Tullius Cicero,"[books, simile, soul]","""a room without books is like a body without a...","[books, simile, soul]"
...,...,...,...,...,...
2503,“Morality is simply the attitude we adopt towa...,"Oscar Wilde,","[morality, philosophy]","""morality is simply the attitude we adopt towa...","[morality, philosophy]"
2504,“Don't aim at success. The more you aim at it ...,"Viktor E. Frankl,","[happiness, success]","""don't aim at success. the more you aim at it ...","[happiness, success]"
2505,"“In life, finding a voice is speaking and livi...",John Grisham,[inspirational-life],"""in life, finding a voice is speaking and livi...",[inspirational-life]
2506,"“Winter is the time for comfort, for good food...",Edith Sitwell,"[comfort, home, winter]","""winter is the time for comfort, for good food...","[comfort, home, winter]"


*   **Purpose**: Initialize a pre-trained Sentence Transformer model.
*   **Operations**: Loads the 'all-MiniLM-L6-v2' model using `SentenceTransformer("all-MiniLM-L6-v2")`.
*   **Contribution**: Provides a powerful pre-trained model capable of generating high-quality sentence embeddings, which is a foundational component for the retrieval system's ability to understand text meaning.


In [116]:
model = SentenceTransformer("all-MiniLM-L6-v2")

*   **Purpose**: Prepare training examples for fine-tuning the Sentence Transformer model.
*   **Operations**: Iterates through the cleaned `quotes_data`, constructs two types of context-rich queries (e.g., `"quotes about {tags_str} by {author}"`) for each quote, and creates `InputExample` objects pairing these queries with the cleaned quote text. These examples are stored in `train_examples`.
*   **Contribution**: Generates a structured dataset for contrastive learning, enabling the fine-tuning of the Sentence Transformer to better understand the semantic relationship between descriptive queries and relevant quotes.

In [114]:
from sentence_transformers import InputExample

train_examples = []

for _, row in quotes_data.iterrows():
    author = row['author']

    tags_str = ", ".join(row['tags_clean'][:2])
    quote_text = row['quote_clean']


    query_1 = f"quotes about {tags_str} by {author}"

    query_2 = f"{tags_str} wisdom from {author}"

    train_examples.append(InputExample(texts=[query_1, quote_text]))
    train_examples.append(InputExample(texts=[query_2, quote_text]))



*   **Purpose**: Confirm the number of training pairs generated.
*   **Operations**: Prints the length of the `train_examples` list.
*   **Contribution**: Provides a quick check of the dataset size for fine-tuning, which can be useful for estimating training time and resource allocation.

In [115]:
print(f"Created {len(train_examples)} training pairs.")

Created 5016 training pairs.


*   **Purpose**: Set up the DataLoader and loss function for model fine-tuning.
*   **Operations**: Creates a `DataLoader` from the `train_examples` with shuffling and a specified batch size. Initializes `MultipleNegativesRankingLoss`, a common loss function for training re-rankers/retrieval models.
*   **Contribution**: Configures the training pipeline by preparing data for efficient batch processing and selecting an appropriate loss function to optimize the model's ability to rank relevant items higher.

In [117]:
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=24)

train_loss = losses.MultipleNegativesRankingLoss(model=model)


# Fit Model
*   **Purpose**: Fine-tune the Sentence Transformer model.
*   **Operations**: Calls the `model.fit()` method, training the model for one epoch using the `train_dataloader` and `train_loss`, with a warm-up phase to stabilize training.
*   **Contribution**: Adapts the pre-trained Sentence Transformer model to the specific domain of quotes and their descriptive queries, significantly improving its performance for the quote retrieval task by making its embeddings more relevant to this dataset.

In [118]:
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=int(len(train_dataloader) * 0.1),
    show_progress_bar=True
)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss


*   **Purpose**: Save the fine-tuned Sentence Transformer model.
*   **Operations**: Uses `model.save('fine-tuned-quote-retriever')` to store the trained model to a local directory.
*   **Contribution**: Persists the fine-tuned model, allowing it to be reused later without needing to retrain, saving computational resources and time during future inference or deployment.

In [125]:
model.save('fine-tuned-quote-retriever')

*   **Purpose**: Generate embeddings for all quotes, build a FAISS index, and save all components of the retrieval system.
*   **Operations**: Encodes all `quote_clean` texts into numerical vector embeddings using the fine-tuned model. Initializes a `faiss.IndexFlatL2` index with the correct embedding dimension and adds these embeddings. Saves the FAISS index to 'quotes_vector_db.faiss' and the `quotes_data` DataFrame (metadata) to 'quotes_metadata.pkl'.
*   **Contribution**: Creates the core components of the efficient retrieval system: a searchable vector index for fast similarity lookups and a metadata file to map numerical results back to human-readable quotes and authors.

In [14]:
embeddings = model.encode(quotes_data['quote_clean'].tolist(), show_progress_bar=True)

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))


faiss.write_index(index, "quotes_vector_db.faiss")


quotes_data.to_pickle("quotes_metadata.pkl")


Generating embeddings for the dataset... (this may take a minute)


Batches:   0%|          | 0/79 [00:00<?, ?it/s]

Vector database saved as 'quotes_vector_db.faiss'
Metadata saved as 'quotes_metadata.pkl'

All components saved! You can now close this session.


*   **Purpose**: Implement and demonstrate the quote retrieval system.
*   **Operations**: Defines `load_rag_system()` to load the fine-tuned model, FAISS index, and metadata from their saved files. Defines `search(query, model, index, metadata, k=3)` which takes a user query, encodes it into a vector, performs a FAISS similarity search, and prints the top `k` most relevant quotes along with their authors. Finally, it loads the system and executes a sample search for "quotes about hope by Oscar Wilde."
*   **Contribution**: Provides the functional interface for the quote retrieval system, allowing users to interactively query and retrieve relevant quotes. This cell demonstrates the end-to-end functionality and practical application of the entire project.

In [18]:
import faiss
import pandas as pd
from sentence_transformers import SentenceTransformer

def load_rag_system():


    model = SentenceTransformer('fine-tuned-quote-retriever')


    print("Loading vector database...")
    index = faiss.read_index("quotes_vector_db.faiss")
    metadata = pd.read_pickle("quotes_metadata.pkl")

    return model, index, metadata

def search(query, model, index, metadata, k=3):

    query_vector = model.encode([query]).astype('float32')


    distances, indices = index.search(query_vector, k)


    print(f"\nTop {k} Results for: '{query}'")
    for i, idx in enumerate(indices[0]):
        quote = metadata.iloc[idx]['quote_clean']
        author = metadata.iloc[idx]['author']
        print(f"{i+1}. \"{quote}\" — {author}")


model_loaded, index_loaded, metadata_loaded = load_rag_system()


search("quotes about hope by Oscar Wilde", model_loaded, index_loaded, metadata_loaded)

Loading model...
Loading vector database...
Loading metadata...

Top 3 Results for: 'quotes about hope by Oscar Wilde'
1. ""hope is a waking dream."" — Aristotle
2. ""hope is the thing with feathers that perches in the soul and sings the tune without the words and never stops at all."" — Emily Dickinson
3. ""when you have lost hope, you have lost everything. and when you think all is lost, when all is dire and bleak, there is always hope."" — Pittacus Lore,
