<a href="https://colab.research.google.com/github/Zeaxanthin80/CAI2300C/blob/main/Assignments/Assignment%203.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 3**

## Assignment: Building an Embeddings-Based Recommender & Classifier


>**Objective**:

>In this assignment, I will explore the power of **embeddings** by creating:

>1. A **Recommender System** using OpenAI’s embedding models.
2. A **Classifier** that categorizes text data based on embeddings.
3. (**Extra Credit**) Integration of a **Vector Database** (AstraDB, Pinecone, or ChromaDB) for efficient similarity searches.



---



## Part 1: Understanding Embeddings
1. Research how OpenAI’s text embeddings work and their use cases.
2. Choose or generate a synthetic dataset relevant to a business or industry problem.

>**Example datasets:**
* Customer reviews for a product recommendation system
* News articles for topic classification
* Medical reports for categorizing health conditions
* Movie descriptions for genre recommendations

In [2]:
from openai import OpenAI

# Understanding OpenAI's Text Embeddings

# What are Text Embeddings?
# Text embeddings are numerical vector representations of text that capture its semantic meaning.
# Instead of treating words as isolated entities, embeddings transform text into a high-dimensional space
# where similar meanings are closer together.

# How Do OpenAI's Embeddings Work?
# 1. Tokenization: The input text is split into smaller units (tokens).
# 2. Neural Network Processing: A pre-trained model processes the tokens to generate dense numerical vectors.
# 3. Semantic Representation: Similar words, phrases, or documents have embeddings that are **closer** together in the vector space.

# Use Cases of OpenAI’s Embeddings:
# - Semantic Search: Find documents or responses based on meaning, not just keywords.
# - Text Similarity: Compare two texts to determine how similar they are.
# - Recommendation Systems: Suggest content based on similarity in meaning.
# - Clustering & Categorization: Group similar documents together for topic classification.
# - Chatbots & Virtual Assistants: Improve contextual understanding in conversations.
# - Anomaly Detection: Identify outliers in text data.

# Example of Embedding Usage:
# A query like "How to cook pasta?" and a document titled "Easy pasta recipes" will have similar embeddings,
# allowing a search system to match them based on meaning, even without exact word matches.

In [3]:
# This code is primarily used to retrieve and set up your OpenAI API key for use within a Google Colab environment.
# these lines are setting up secure access to my OpenAI API within your Colab notebook so I can
# use OpenAI's capabilities without exposing your API key directly in the code.
from google.colab import userdata
key = userdata.get('OPENAI_API_KEY')

client = OpenAI(api_key = key)

In [4]:
# This code generates a synthetic movie dataset with movie IDs, titles, descriptions, and genres and
# then neatly organizes it into a DataFrame and saves it for later use.
# This synthetic dataset can then be utilized for building and testing a recommendation system.

import pandas as pd
import random

# Sample words for generating random movie titles
adjectives = ["Lost", "Mysterious", "Dark", "Galactic", "Quantum", "Eternal", "Forbidden", "Secret", "Hidden", "Haunted"
"Adventurous", "Ambitious", "Brave", "Calm", "Charming", "Clever", "Cunning", "Daring", "Dark", "Delicate",
"Determined", "Elegant", "Enchanted", "Eternal", "Fierce", "Forbidden", "Gentle", "Glorious", "Graceful", "Grim",
"Heroic", "Hidden", "Hollow", "Humble", "Icy", "Legendary", "Lively", "Lonely", "Lost", "Luminous",
"Majestic", "Mysterious", "Noble", "Obscure", "Powerful", "Radiant", "Reckless", "Restless", "Rugged", "Sacred",
"Secret", "Shadowy", "Shimmering", "Silent", "Sinister", "Stubborn", "Swift", "Timeless", "Unbreakable", "Vengeful"
]
nouns = ["Journey", "Warrior", "Galaxy", "Legacy", "Escape", "Revolt", "Curse", "Paradox", "Fortune", "Mystery"
"Adventure", "Battle", "Castle", "Champion", "Chaos", "Chronicle", "City", "Conquest", "Courage", "Curse",
"Darkness", "Destiny", "Dragon", "Dream", "Empire", "Enemy", "Escape", "Explorer", "Fate", "Fortune",
"Galaxy", "Guardian", "Hero", "Honor", "Horizon", "Journey", "Knight", "Labyrinth", "Legacy", "Legend",
"Magic", "Master", "Maze", "Miracle", "Mission", "Mystery", "Nightmare", "Oracle", "Paradox", "Phantom",
"Prophecy", "Quest", "Realm", "Rebel", "Revenge", "Revolt", "Shadow", "Sorcerer", "Storm", "Treasure", "Warrior"
]

# Sample genres
genres = ["Action", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Cyberpunk", "Dark Fantasy", "Disaster", "Documentary",
"Drama", "Dystopian", "Epic", "Family", "Fantasy", "Film Noir", "Historical", "Horror", "Indie", "Inspirational",
"Martial Arts", "Medieval", "Melodrama", "Military", "Mockumentary", "Musical", "Mystery", "Mythology", "Neo-Noir", "Paranormal",
"Period Drama", "Political", "Post-Apocalyptic", "Psychological", "Reality", "Rom-Com", "Romance", "Samurai", "Satire", "Science Fiction",
"Slapstick", "Space Opera", "Sports", "Spy", "Steampunk", "Superhero", "Supernatural", "Survival", "Thriller", "Western"
]

# Sample phrases for generating random movie descriptions
description_templates = [
    "A {adj} {noun} embarks on an epic adventure to save the world.",
    "In a world filled with {adj} secrets, a {noun} uncovers a hidden truth.",
    "A group of {adj} explorers sets out on a mission to find the legendary {noun}.",
    "When a {noun} discovers a {adj} power, everything changes forever.",
    "A {adj} battle between good and evil takes place in the heart of {noun}.",
    "The fate of humanity rests in the hands of a {adj} {noun}.",
    "A {noun} with a {adj} past is forced to confront their destiny.",
    "An unexpected romance blossoms between a {adj} hero and a {noun}.",
    "A detective must solve the {adj} case of the missing {noun}.",
    "A haunted {noun} holds secrets of a {adj} past.",
    "In a {adj} kingdom, a young {noun} rises to claim their rightful throne.",
    "A {adj} scientist makes a groundbreaking discovery that changes history forever.",
    "A team of {adj} explorers stumbles upon an ancient {noun} with unimaginable power.",
    "A {noun} is chosen to fulfill a {adj} prophecy that could alter the fate of the universe.",
    "When a {adj} rebellion ignites, a lone {noun} must stand against the empire.",
    "A {adj} rivalry between two powerful {noun}s threatens to destroy the world.",
    "Trapped in a {adj} dimension, a {noun} fights to find their way home.",
    "A {adj} warrior is forced to fight in a deadly tournament of champions.",
    "A {adj} storm wipes out civilization, leaving a {noun} struggling for survival.",
    "After discovering a {adj} artifact, a {noun} is hunted by a secret organization.",
    "A {adj} stranger arrives in town, carrying a {noun} that holds a dark secret.",
    "A group of {adj} survivors bands together to fight against an army of {noun}s.",
    "A {adj} experiment goes wrong, unleashing chaos upon an unsuspecting {noun}.",
    "A {adj} hacker uncovers a conspiracy that could shake the foundations of {noun}.",
    "A {noun} must embrace their {adj} destiny to restore balance to the world.",
    "A {adj} time traveler finds themselves trapped in an endless loop of {noun}.",
    "A {adj} thief plans the ultimate heist to steal a priceless {noun}.",
    "A {adj} assassin is given an impossible mission—to eliminate the {noun}.",
    "A {adj} rivalry between two legendary {noun}s leads to a final showdown.",
    "A {noun} discovers a {adj} gateway to another dimension.",
    "A {adj} musician stumbles upon a cursed {noun} that changes their life forever.",
    "A {adj} creature awakens from the depths of the ocean, threatening humanity.",
    "A {noun} with a {adj} secret is pursued by a relentless group of hunters.",
    "A {adj} AI develops emotions and forms an unlikely bond with a {noun}.",
    "A {noun} must outwit a {adj} mastermind in a high-stakes game of survival.",
    "A {adj} bounty hunter is on the trail of the most wanted {noun} in the galaxy.",
    "A {adj} journalist uncovers the shocking truth about a mysterious {noun}.",
    "A {adj} orphan discovers they are the heir to a lost {noun}.",
    "A {adj} knight must retrieve a sacred {noun} to prevent the kingdom's fall.",
    "A {adj} scientist creates a {noun} that can alter reality itself.",
    "A {adj} soldier returns home to find their city ruled by a ruthless {noun}.",
    "A {adj} witch forms an unlikely alliance with a {noun} to defeat an ancient evil.",
    "A {noun} awakens to find themselves in a {adj} version of their own reality.",
    "A {adj} gambler makes a dangerous bet with a powerful {noun}.",
    "A {adj} explorer maps out an uncharted land filled with mythical {noun}s.",
    "A {adj} shipwreck survivor stumbles upon an island ruled by a {noun}.",
    "A {adj} outlaw must team up with a {noun} to escape a deadly pursuit.",
    "A {adj} prisoner is given a chance at freedom in exchange for hunting down a {noun}.",
    "A {adj} rebel leads an uprising against a corrupt {noun} in a dystopian city."
]

# Number of records to generate
num_movies = 250

# Generate movie dataset
# This loop iterates num_movies (100) times to create each movie entry
# The generated data for the movie is appended as a list to the movies list.
movies = []
for i in range(1, num_movies + 1):
    title = f" {random.choice(adjectives)} {random.choice(nouns)}"
    description = random.choice(description_templates).format(adj=random.choice(adjectives), noun=random.choice(nouns))
    genre = random.choice(genres)
    movies.append([i, title, description, genre])

# Create DataFrame
df = pd.DataFrame(movies, columns=["Movie_ID", "Title", "Description", "Genre"])

# Display first few rows
print(df.head())

# Save dataset
df.to_csv("synthetic_movies_dynamic.csv", index=False)

   Movie_ID             Title  \
0         1    Fierce Warrior   
1         2   Heroic Conquest   
2         3    Glorious Dream   
3         4    Rugged Warrior   
4         5    Delicate Enemy   

                                         Description         Genre  
0  A Graceful gambler makes a dangerous bet with ...     Superhero  
1  A Grim scientist makes a groundbreaking discov...         Crime  
2  A team of Glorious explorers stumbles upon an ...  Dark Fantasy  
3  A Secret rivalry between two legendary Revenge...     Slapstick  
4  A Powerful prisoner is given a chance at freed...  Period Drama  


In [5]:
def create_movie_text(movie):
    # Remove newlines within movie attributes
    title = movie['Title'].replace('\n', ' ')
    genre = movie['Genre'].replace('\n', ' ')
    description = movie['Description'].replace('\n', ' ')

    return (
        f"Title: {title}\n"
        f"Genre: {genre}\n"
        f"Description: {description}"
    )

# Assuming 'df' is your DataFrame containing the movie data
movie_texts = df.apply(create_movie_text, axis=1).tolist()
print(movie_texts)

['Title:  Fierce Warrior\nGenre: Superhero\nDescription: A Graceful gambler makes a dangerous bet with a powerful Journey.', 'Title:  Heroic Conquest\nGenre: Crime\nDescription: A Grim scientist makes a groundbreaking discovery that changes history forever.', 'Title:  Glorious Dream\nGenre: Dark Fantasy\nDescription: A team of Glorious explorers stumbles upon an ancient Prophecy with unimaginable power.', 'Title:  Rugged Warrior\nGenre: Slapstick\nDescription: A Secret rivalry between two legendary Revenges leads to a final showdown.', 'Title:  Delicate Enemy\nGenre: Period Drama\nDescription: A Powerful prisoner is given a chance at freedom in exchange for hunting down a Warrior.', 'Title:  Powerful Battle\nGenre: Science Fiction\nDescription: When a Unbreakable rebellion ignites, a lone Oracle must stand against the empire.', 'Title:  Grim Treasure\nGenre: Steampunk\nDescription: A Luminous explorer maps out an uncharted land filled with mythical Oracles.', 'Title:  Glorious Fate\nGe

In [6]:
def combine_movie_info_to_string(movie_text):
  """Combines title, genre, and description within a movie text string."""
  # Extract title, genre, and description using string manipulation
  # Assuming the structure is "Title: ...\nGenre: ...\nDescription: ..."
  title = movie_text.split('\n')[0].split(': ')[1]  # Split by newline instead of comma
  genre = movie_text.split('\n')[1].split(': ')[1]  # Split by newline instead of comma
  description = movie_text.split('\n')[2].split(': ')[1]  # Split by newline instead of comma

  # Create the combined string with commas
  combined_string = f"Title: {title}, Genre: {genre}, Description: {description}"
  return combined_string

# Assuming movie_texts now contains strings with commas
# Example: "Title: Movie Title, Genre: Action, Description: Movie description here"
combined = [combine_movie_info_to_string(text) for text in movie_texts]

# Print the combined movie texts
print(combined)

['Title:  Fierce Warrior, Genre: Superhero, Description: A Graceful gambler makes a dangerous bet with a powerful Journey.', 'Title:  Heroic Conquest, Genre: Crime, Description: A Grim scientist makes a groundbreaking discovery that changes history forever.', 'Title:  Glorious Dream, Genre: Dark Fantasy, Description: A team of Glorious explorers stumbles upon an ancient Prophecy with unimaginable power.', 'Title:  Rugged Warrior, Genre: Slapstick, Description: A Secret rivalry between two legendary Revenges leads to a final showdown.', 'Title:  Delicate Enemy, Genre: Period Drama, Description: A Powerful prisoner is given a chance at freedom in exchange for hunting down a Warrior.', 'Title:  Powerful Battle, Genre: Science Fiction, Description: When a Unbreakable rebellion ignites, a lone Oracle must stand against the empire.', 'Title:  Grim Treasure, Genre: Steampunk, Description: A Luminous explorer maps out an uncharted land filled with mythical Oracles.', 'Title:  Glorious Fate, Ge



---



## Part 2: Build a Recommender System
3. Use OpenAI’s embedding models to **generate vector representations** of your dataset.
4. Implement a **similarity search** function to recommend similar items based on user input.
5. Test your system by providing different queries and observing the quality of recommendations.
**Deliverables for Recommender System:**

* Code implementation (Jupyter Notebook or Python script).
* Explanation of your use case, dataset, and how recommendations are generated.
* Sample queries and the system’s responses.

### Generate Embeddings

In [7]:
def create_embeddings(texts, model = "text-embedding-3-small"):
    """
    Generates embeddings for a list of texts using the OpenAI API.

    Args:
        texts (list): A list of text strings to generate embeddings for.

    Returns:
        list: A list of embedding vectors, one for each input text.
    """
    embeddings = []
    for text in texts:
        response = client.embeddings.create(
            input = text,
            model = model
        )
        embeddings.append(response.data[0].embedding)
    return embeddings

In [8]:
movies_texts = combined  # combined already contains the desired movie texts
movies_embeddings = create_embeddings(movies_texts)
# print(movies_embeddings)

In [None]:
movies_embeddings

### Similarity Search Function

In [10]:
from scipy.spatial import distance
def find_n_closest(query_vector, embeddings, n=3):
  distances = []
  for index, embedding in enumerate(embeddings):
    dist = distance.cosine(query_vector, embedding)
    distances.append({"distance": dist, "index": index})
  distances_sorted = sorted(distances, key=lambda x: x["distance"])
  return distances_sorted[0:n]

### Testing and Recommendations

In [16]:
query_text = "AI"
query_vector = create_embeddings(query_text)[0]
hits = find_n_closest(query_vector, movies_embeddings)
for hit in hits:
  movie = movies_texts[hit['index']]  # Use movies_texts
  # print(movie) # Print the entire movie text or extract specific info like title

  # Split the movie text into lines, using commas as delimiters
  lines = movie.split(',')  # Split by comma instead of newline

  # Extract title, genre, and description from the split lines
  title = lines[0].split(': ')[1] if len(lines) > 0 else "Title not found"
  genre = lines[1].split(': ')[1] if len(lines) > 1 else "Genre not found"
  description = lines[2].split(': ')[1] if len(lines) > 2 else "Description not found"

  # Print the title, genre, and description with aligned colons
  print(f"      Title:{title}")
  print(f"      Genre: {genre}")
  print(f"Description: {description}")

  print("\n")  # Add an empty line for better readability

      Title: Shimmering Destiny
      Genre: Martial Arts
Description: A Charming AI develops emotions and forms an unlikely bond with a Quest.


      Title: Stubborn Honor
      Genre: Musical
Description: A Lost AI develops emotions and forms an unlikely bond with a Master.


      Title: Rugged Galaxy
      Genre: Sports
Description: A detective must solve the Noble case of the missing Destiny.






---



## Part 3: Build a Classifier
6. Build an embedding centric classifier.
7. Define **clear categories** (e.g., positive/negative sentiment, topic classification, fraud detection).
8. Evaluate its accuracy and effectiveness.

**Deliverables for Classifier:**
* Code implementation.
* Explanation of classification logic and dataset.
* Performance analysis (accuracy, precision, recall, or confusion matrix).

In [None]:
topics = [
{'label': 'Tech'},
{'label': 'Science'},
{'label': 'Sport'},
{'label': 'Business'},
]
class_descriptions = [topic['label'] for topic in topics]
class_embeddings = create_embeddings(class_descriptions)

In [None]:
article = {"headline": "How NVIDIA GPUs Could Decide Who Wins the AI Race", "keywords": ["ai", "business", "computers"]}

 def create_article_text(article):
  return f"""Headline: {article['headline']} Keywords: {', '.join(article['keywords'])}"""

article_text = create_article_text(article)
article_embeddings = create_embeddings(article_text)[0]

In [None]:
def find_closest(query_vector, embeddings):
  distances = []
  for index, embedding in enumerate(embeddings):
    dist = distance.cosine(query_vector, embedding)
    distances.append({"distance": dist, "index": index})
return min(distances, key=lambda x: x["distance"])

closest = find_closest(article_embeddings, class_embeddings)



---



## Extra Credit: Integrate a Vector Database
🔹 **Challenge**: Instead of storing and retrieving embeddings in memory, integrate a **vector database** such as:

* **AstraDB** (built on Apache Cassandra)
* **Pinecone** (real-time vector search)
* **ChromaDB** (open-source vector store)

🔹 Store embeddings in the database and retrieve relevant results dynamically.

🔹 Explain how the integration improves scalability and search efficiency.