<a href="https://colab.research.google.com/github/Zeaxanthin80/CAI2300C/blob/main/Assignments/Assignment%203.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 3**

## Assignment: Building an Embeddings-Based Recommender & Classifier


### Objective:
In this assignment, I will explore the power of **embeddings** by creating:

1. A **Recommender System** using OpenAI’s embedding models.
2. A **Classifier** that categorizes text data based on embeddings.
3. (**Extra Credit**) Integration of a **Vector Database** (AstraDB, Pinecone, or ChromaDB) for efficient similarity searches.



---



## Part 1: Understanding Embeddings
1. Research how OpenAI’s text embeddings work and their use cases.
2. Choose or generate a synthetic dataset relevant to a business or industry problem.

>**Example datasets:**
* Customer reviews for a product recommendation system
* News articles for topic classification
* Medical reports for categorizing health conditions
* Movie descriptions for genre recommendations

In [2]:
from openai import OpenAI

# Understanding OpenAI's Text Embeddings

# What are Text Embeddings?
# Text embeddings are numerical vector representations of text that capture its semantic meaning.
# Instead of treating words as isolated entities, embeddings transform text into a high-dimensional space
# where similar meanings are closer together.

# How Do OpenAI's Embeddings Work?
# 1. Tokenization: The input text is split into smaller units (tokens).
# 2. Neural Network Processing: A pre-trained model processes the tokens to generate dense numerical vectors.
# 3. Semantic Representation: Similar words, phrases, or documents have embeddings that are **closer** together in the vector space.

# Use Cases of OpenAI’s Embeddings:
# - Semantic Search: Find documents or responses based on meaning, not just keywords.
# - Text Similarity: Compare two texts to determine how similar they are.
# - Recommendation Systems: Suggest content based on similarity in meaning.
# - Clustering & Categorization: Group similar documents together for topic classification.
# - Chatbots & Virtual Assistants: Improve contextual understanding in conversations.
# - Anomaly Detection: Identify outliers in text data.

# Example of Embedding Usage:
# A query like "How to cook pasta?" and a document titled "Easy pasta recipes" will have similar embeddings,
# allowing a search system to match them based on meaning, even without exact word matches.

In [3]:
# This code is primarily used to retrieve and set up your OpenAI API key for use within a Google Colab environment.
# these lines are setting up secure access to my OpenAI API within your Colab notebook so I can
# use OpenAI's capabilities without exposing your API key directly in the code.
from google.colab import userdata
key = userdata.get('OPENAI_API_KEY')

client = OpenAI(api_key = key)

In [5]:
# This code generates a synthetic movie dataset with movie IDs, titles, descriptions, and genres and
# then neatly organizes it into a DataFrame and saves it for later use.
# This synthetic dataset can then be utilized for building and testing a recommendation system.

import pandas as pd
import random

# Sample words for generating random movie titles
adjectives = ["Lost", "Mysterious", "Dark", "Galactic", "Quantum", "Eternal", "Forbidden", "Secret", "Hidden", "Haunted"
"Adventurous", "Ambitious", "Brave", "Calm", "Charming", "Clever", "Cunning", "Daring", "Dark", "Delicate",
"Determined", "Elegant", "Enchanted", "Eternal", "Fierce", "Forbidden", "Gentle", "Glorious", "Graceful", "Grim",
"Heroic", "Hidden", "Hollow", "Humble", "Icy", "Legendary", "Lively", "Lonely", "Lost", "Luminous",
"Majestic", "Mysterious", "Noble", "Obscure", "Powerful", "Radiant", "Reckless", "Restless", "Rugged", "Sacred",
"Secret", "Shadowy", "Shimmering", "Silent", "Sinister", "Stubborn", "Swift", "Timeless", "Unbreakable", "Vengeful"
]
nouns = ["Journey", "Warrior", "Galaxy", "Legacy", "Escape", "Revolt", "Curse", "Paradox", "Fortune", "Mystery"
"Adventure", "Battle", "Castle", "Champion", "Chaos", "Chronicle", "City", "Conquest", "Courage", "Curse",
"Darkness", "Destiny", "Dragon", "Dream", "Empire", "Enemy", "Escape", "Explorer", "Fate", "Fortune",
"Galaxy", "Guardian", "Hero", "Honor", "Horizon", "Journey", "Knight", "Labyrinth", "Legacy", "Legend",
"Magic", "Master", "Maze", "Miracle", "Mission", "Mystery", "Nightmare", "Oracle", "Paradox", "Phantom",
"Prophecy", "Quest", "Realm", "Rebel", "Revenge", "Revolt", "Shadow", "Sorcerer", "Storm", "Treasure", "Warrior"
]

# Sample genres
genres = ["Action", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Cyberpunk", "Dark Fantasy", "Disaster", "Documentary",
"Drama", "Dystopian", "Epic", "Family", "Fantasy", "Film Noir", "Historical", "Horror", "Indie", "Inspirational",
"Martial Arts", "Medieval", "Melodrama", "Military", "Mockumentary", "Musical", "Mystery", "Mythology", "Neo-Noir", "Paranormal",
"Period Drama", "Political", "Post-Apocalyptic", "Psychological", "Reality", "Rom-Com", "Romance", "Samurai", "Satire", "Science Fiction",
"Slapstick", "Space Opera", "Sports", "Spy", "Steampunk", "Superhero", "Supernatural", "Survival", "Thriller", "Western"
]

# Sample phrases for generating random movie descriptions
description_templates = [
    "A {adj} {noun} embarks on an epic adventure to save the world.",
    "In a world filled with {adj} secrets, a {noun} uncovers a hidden truth.",
    "A group of {adj} explorers sets out on a mission to find the legendary {noun}.",
    "When a {noun} discovers a {adj} power, everything changes forever.",
    "A {adj} battle between good and evil takes place in the heart of {noun}.",
    "The fate of humanity rests in the hands of a {adj} {noun}.",
    "A {noun} with a {adj} past is forced to confront their destiny.",
    "An unexpected romance blossoms between a {adj} hero and a {noun}.",
    "A detective must solve the {adj} case of the missing {noun}.",
    "A haunted {noun} holds secrets of a {adj} past.",
    "In a {adj} kingdom, a young {noun} rises to claim their rightful throne.",
    "A {adj} scientist makes a groundbreaking discovery that changes history forever.",
    "A team of {adj} explorers stumbles upon an ancient {noun} with unimaginable power.",
    "A {noun} is chosen to fulfill a {adj} prophecy that could alter the fate of the universe.",
    "When a {adj} rebellion ignites, a lone {noun} must stand against the empire.",
    "A {adj} rivalry between two powerful {noun}s threatens to destroy the world.",
    "Trapped in a {adj} dimension, a {noun} fights to find their way home.",
    "A {adj} warrior is forced to fight in a deadly tournament of champions.",
    "A {adj} storm wipes out civilization, leaving a {noun} struggling for survival.",
    "After discovering a {adj} artifact, a {noun} is hunted by a secret organization.",
    "A {adj} stranger arrives in town, carrying a {noun} that holds a dark secret.",
    "A group of {adj} survivors bands together to fight against an army of {noun}s.",
    "A {adj} experiment goes wrong, unleashing chaos upon an unsuspecting {noun}.",
    "A {adj} hacker uncovers a conspiracy that could shake the foundations of {noun}.",
    "A {noun} must embrace their {adj} destiny to restore balance to the world.",
    "A {adj} time traveler finds themselves trapped in an endless loop of {noun}.",
    "A {adj} thief plans the ultimate heist to steal a priceless {noun}.",
    "A {adj} assassin is given an impossible mission—to eliminate the {noun}.",
    "A {adj} rivalry between two legendary {noun}s leads to a final showdown.",
    "A {noun} discovers a {adj} gateway to another dimension.",
    "A {adj} musician stumbles upon a cursed {noun} that changes their life forever.",
    "A {adj} creature awakens from the depths of the ocean, threatening humanity.",
    "A {noun} with a {adj} secret is pursued by a relentless group of hunters.",
    "A {adj} AI develops emotions and forms an unlikely bond with a {noun}.",
    "A {noun} must outwit a {adj} mastermind in a high-stakes game of survival.",
    "A {adj} bounty hunter is on the trail of the most wanted {noun} in the galaxy.",
    "A {adj} journalist uncovers the shocking truth about a mysterious {noun}.",
    "A {adj} orphan discovers they are the heir to a lost {noun}.",
    "A {adj} knight must retrieve a sacred {noun} to prevent the kingdom's fall.",
    "A {adj} scientist creates a {noun} that can alter reality itself.",
    "A {adj} soldier returns home to find their city ruled by a ruthless {noun}.",
    "A {adj} witch forms an unlikely alliance with a {noun} to defeat an ancient evil.",
    "A {noun} awakens to find themselves in a {adj} version of their own reality.",
    "A {adj} gambler makes a dangerous bet with a powerful {noun}.",
    "A {adj} explorer maps out an uncharted land filled with mythical {noun}s.",
    "A {adj} shipwreck survivor stumbles upon an island ruled by a {noun}.",
    "A {adj} outlaw must team up with a {noun} to escape a deadly pursuit.",
    "A {adj} prisoner is given a chance at freedom in exchange for hunting down a {noun}.",
    "A {adj} rebel leads an uprising against a corrupt {noun} in a dystopian city."
]

# Number of records to generate
num_movies = 250

# Generate movie dataset
# This loop iterates num_movies (100) times to create each movie entry
# The generated data for the movie is appended as a list to the movies list.
movies = []
for i in range(1, num_movies + 1):
    title = f" {random.choice(adjectives)} {random.choice(nouns)}"
    description = random.choice(description_templates).format(adj=random.choice(adjectives), noun=random.choice(nouns))
    genre = random.choice(genres)
    movies.append([i, title, description, genre])

# Create DataFrame
df = pd.DataFrame(movies, columns=["Movie_ID", "Title", "Description", "Genre"])

# Display first few rows
print(df.head())

# Save dataset
df.to_csv("synthetic_movies_dynamic.csv", index=False)

   Movie_ID              Title  \
0         1   Graceful Mystery   
1         2      Sinister Fate   
2         3    Quantum Warrior   
3         4   Restless Mystery   
4         5    Legendary Quest   

                                         Description        Genre  
0    A haunted Rebel holds secrets of a Secret past.      Romance  
1  In a world filled with Gentle secrets, a Treas...    Slapstick  
2  A Mysterious rebel leads an uprising against a...   Paranormal  
3  A Eternal rebel leads an uprising against a co...     Medieval  
4  A Grim AI develops emotions and forms an unlik...  Documentary  


In [6]:
df

Unnamed: 0,Movie_ID,Title,Description,Genre
0,1,Graceful Mystery,A haunted Rebel holds secrets of a Secret past.,Romance
1,2,Sinister Fate,"In a world filled with Gentle secrets, a Treas...",Slapstick
2,3,Quantum Warrior,A Mysterious rebel leads an uprising against a...,Paranormal
3,4,Restless Mystery,A Eternal rebel leads an uprising against a co...,Medieval
4,5,Legendary Quest,A Grim AI develops emotions and forms an unlik...,Documentary
...,...,...,...,...
245,246,HauntedAdventurous Nightmare,A Hollow hacker uncovers a conspiracy that cou...,Epic
246,247,Obscure Chaos,A Noble scientist creates a Treasure that can ...,Disaster
247,248,Vengeful Dream,"When a Clever rebellion ignites, a lone Shadow...",Adventure
248,249,Lonely Miracle,A Hollow battle between good and evil takes pl...,Historical


In [9]:
def create_movie_text(movie):
    return (
        f"Title: {movie['Title']}\n"
        f"Genre: {movie['Genre']}\n"
        f"Description: {movie['Description']}"
    )

# Assuming 'df' is your DataFrame containing the movie data
movie_texts = df.apply(create_movie_text, axis=1).tolist()
print(movie_texts)

['Title:  Graceful Mystery\nGenre: Romance\nDescription: A haunted Rebel holds secrets of a Secret past.', 'Title:  Sinister Fate\nGenre: Slapstick\nDescription: In a world filled with Gentle secrets, a Treasure uncovers a hidden truth.', 'Title:  Quantum Warrior\nGenre: Paranormal\nDescription: A Mysterious rebel leads an uprising against a corrupt Galaxy in a dystopian city.', 'Title:  Restless Mystery\nGenre: Medieval\nDescription: A Eternal rebel leads an uprising against a corrupt Chronicle in a dystopian city.', 'Title:  Legendary Quest\nGenre: Documentary\nDescription: A Grim AI develops emotions and forms an unlikely bond with a Fortune.', 'Title:  Eternal Escape\nGenre: Military\nDescription: Trapped in a Brave dimension, a Champion fights to find their way home.', 'Title:  Delicate Horizon\nGenre: Neo-Noir\nDescription: A team of Obscure explorers stumbles upon an ancient Revenge with unimaginable power.', 'Title:  Heroic Darkness\nGenre: Musical\nDescription: A Secret soldie

In [19]:
def create_movie_text(movie):
    # Remove newlines within movie attributes
    title = movie['Title'].replace('\n', ' ')
    genre = movie['Genre'].replace('\n', ' ')
    description = movie['Description'].replace('\n', ' ')

    return (
        f"Title: {title}\n"
        f"Genre: {genre}\n"
        f"Description: {description}"
    )

# Assuming 'df' is your DataFrame containing the movie data
movie_texts = df.apply(create_movie_text, axis=1).tolist()
print(movie_texts)

['Title:  Graceful Mystery\nGenre: Romance\nDescription: A haunted Rebel holds secrets of a Secret past.', 'Title:  Sinister Fate\nGenre: Slapstick\nDescription: In a world filled with Gentle secrets, a Treasure uncovers a hidden truth.', 'Title:  Quantum Warrior\nGenre: Paranormal\nDescription: A Mysterious rebel leads an uprising against a corrupt Galaxy in a dystopian city.', 'Title:  Restless Mystery\nGenre: Medieval\nDescription: A Eternal rebel leads an uprising against a corrupt Chronicle in a dystopian city.', 'Title:  Legendary Quest\nGenre: Documentary\nDescription: A Grim AI develops emotions and forms an unlikely bond with a Fortune.', 'Title:  Eternal Escape\nGenre: Military\nDescription: Trapped in a Brave dimension, a Champion fights to find their way home.', 'Title:  Delicate Horizon\nGenre: Neo-Noir\nDescription: A team of Obscure explorers stumbles upon an ancient Revenge with unimaginable power.', 'Title:  Heroic Darkness\nGenre: Musical\nDescription: A Secret soldie

In [26]:
# load & inspect dataset

df = df[["Title", "Genre", "Description"]]
df = df.dropna()
df["combined"] = (
    "Title: " + df.Title.str.strip() + "; Genre: " + df.Genre.str.strip() + "; Description: " + df.Description.str.strip()
)
df.head(2)

Unnamed: 0,Title,Genre,Description,combined
0,Graceful Mystery,Romance,A haunted Rebel holds secrets of a Secret past.,Title: Graceful Mystery; Genre: Romance; Descr...
1,Sinister Fate,Slapstick,"In a world filled with Gentle secrets, a Treas...",Title: Sinister Fate; Genre: Slapstick; Descri...




---



## Part 2: Build a Recommender System
3. Use OpenAI’s embedding models to **generate vector representations** of your dataset.
4. Implement a **similarity search** function to recommend similar items based on user input.
5. Test your system by providing different queries and observing the quality of recommendations.
**Deliverables for Recommender System:**

* Code implementation (Jupyter Notebook or Python script).
* Explanation of your use case, dataset, and how recommendations are generated.
* Sample queries and the system’s responses.

In [42]:
import tiktoken

from utils.embeddings_utils import get_embedding

ModuleNotFoundError: No module named 'tiktoken'

In [35]:
def get_embedding(text, model="text-embedding-3-small"):
    # text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding

df['embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-3-small'))

In [37]:
# Now you can save the DataFrame to CSV
df.to_csv('./embedded_250_movies.csv', index=False)

In [39]:
a = get_embedding("hi", model='text-embedding-3-small')
b = get_embedding("hello", model='text-embedding-3-small')

### Generate Embeddings

In [32]:
def create_embeddings(texts):
    """
    Generates embeddings for a list of texts using the OpenAI API.

    Args:
        texts (list): A list of text strings to generate embeddings for.

    Returns:
        list: A list of embedding vectors, one for each input text.
    """
    embeddings = []
    for text in texts:
        response = client.embeddings.create(
            input=text,
            model="text-embedding-3-small"  # You can change the embedding model if needed
        )
        embeddings.append(response.data[0].embedding)
    return embeddings

In [33]:
# movie_texts = [create_movie_text(movie) for movie in articles]
# article_embeddings = create_embeddings(article_texts)
# print(article_embeddings)

movie_texts = [create_movie_text(movie) for movie in df.to_dict('records')]
movies_embeddings = create_embeddings(movie_texts) # Using movie_texts instead of article_texts
print(movies_embeddings)

KeyboardInterrupt: 

### Similarity Search Function

In [31]:
def cosine_similarity(a, b):
    """
    Calculates the cosine similarity between two vectors.

    Args:
        a (list): The first vector.
        b (list): The second vector.

    Returns:
        float: The cosine similarity between the two vectors.
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def recommend_similar_items(query, items, embeddings, top_k=3):
    """
    Recommends similar items based on a user query.

    Args:
        query (str): The user's query.
        items (list): A list of items to search through.
        embeddings (list): A list of embeddings corresponding to the items.
        top_k (int, optional): The number of similar items to return. Defaults to 3.

    Returns:
        list: A list of the most similar items to the query.
    """
    query_embedding = create_embeddings([query])[0]  # Get embedding for the query

    similarities = [cosine_similarity(query_embedding, item_embedding) for item_embedding in embeddings]

    # Get the indices of the most similar items
    sorted_indices = np.argsort(similarities)[::-1]

    # Return the top_k most similar items
    return [items[i] for i in sorted_indices[:top_k]]

### Testing and Recommendations

In [39]:
import numpy as np

# User query
user_query = "Toy Story"

# Get recommendations, using df instead of articles and movies_embeddings instead of article_embeddings
recommendations = recommend_similar_items(user_query, df.to_dict('records'), movies_embeddings)

# Print recommendations
for recommendation in recommendations:
    print(recommendation['Title'])  # Accessing the 'Title' from the movie dictionary

# Try another query
user_query = "Something about sports"
recommendations = recommend_similar_items(user_query, df.to_dict('records'), movies_embeddings)
for recommendation in recommendations:
    print(recommendation['Title']) # Accessing the 'Title' from the movie dictionary

The Forbidden Journey
The Eternal Galaxy
The Lost Journey
The Haunted Warrior
The Quantum Legacy
The Hidden Escape




---



## Part 3: Build a Classifier
6. Build an embedding centric classifier.
7. Define **clear categories** (e.g., positive/negative sentiment, topic classification, fraud detection).
8. Evaluate its accuracy and effectiveness.

**Deliverables for Classifier:**
* Code implementation.
* Explanation of classification logic and dataset.
* Performance analysis (accuracy, precision, recall, or confusion matrix).



---



## Extra Credit: Integrate a Vector Database
🔹 **Challenge**: Instead of storing and retrieving embeddings in memory, integrate a **vector database** such as:

* **AstraDB** (built on Apache Cassandra)
* **Pinecone** (real-time vector search)
* **ChromaDB** (open-source vector store)

🔹 Store embeddings in the database and retrieve relevant results dynamically.

🔹 Explain how the integration improves scalability and search efficiency.