<a href="https://colab.research.google.com/github/Zeaxanthin80/CAI2300C/blob/main/Assignments/Assignment%203.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CAI2300C – Assignment: Building an Embeddings-Based Recommender & Classifier


## Objective:
In this assignment, you will explore the power of **embeddings** by creating:

1. A **Recommender System** using OpenAI’s embedding models.
2. A **Classifier** that categorizes text data based on embeddings.
3. (**Extra Credit**) Integration of a **Vector Database** (AstraDB, Pinecone, or ChromaDB) for efficient similarity searches.



---



## Part 1: Understanding Embeddings
1. Research how OpenAI’s text embeddings work and their use cases.
2. Choose or generate a synthetic dataset relevant to a business or industry problem.

>**Example datasets:**
* Customer reviews for a product recommendation system
* News articles for topic classification
* Medical reports for categorizing health conditions
* Movie descriptions for genre recommendations

In [9]:
from openai import OpenAI
from google.colab import userdata
key = userdata.get('OPENAI_API_KEY')

In [10]:
client = OpenAI(api_key = key)

response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-3-small"
)

print(response.data[0].embedding)

[0.005172153003513813, 0.017217181622982025, -0.018686940893530846, -0.01854696311056614, -0.047256264835596085, -0.03026304580271244, 0.027659472078084946, 0.003663900075480342, 0.011233161203563213, 0.006396952550858259, -0.0016980969812721014, 0.01585940271615982, -0.0012702919775620103, -0.007873711176216602, 0.05991019308567047, 0.05030776187777519, -0.02751949429512024, 0.00991037767380476, -0.040397386997938156, 0.04999981448054314, -0.00041380725451745093, 0.0302350502461195, -0.013717753812670708, 0.03295060619711876, 0.01728717051446438, 0.016783252358436584, -0.0017374654999002814, 0.02042265608906746, 0.040789321064949036, -0.03773782029747963, -0.026119723916053772, -0.05002781003713608, 0.0241740420460701, -0.0551229752600193, -0.03227871656417847, 0.04235706478357315, 0.06472540646791458, 0.01469759363681078, -0.01566343568265438, -0.04132123664021492, 0.022200364619493484, 0.00736279459670186, 0.044960640370845795, 0.007107336539775133, -0.024118050932884216, 0.05240742

In [11]:
import pandas as pd
import random

# Sample movie titles and descriptions
movies = [
    ("The Lost Adventure", "A group of explorers embarks on a journey to uncover ancient mysteries."),
    ("Cybernetic Revolt", "In a future dominated by AI, humans fight for their survival."),
    ("Love in Paris", "A heartwarming tale of two strangers finding love in the city of romance."),
    ("Mystery Manor", "A detective must solve a series of murders in an eerie mansion."),
    ("Galactic Wars", "An interstellar battle between rival factions for control of the galaxy."),
    ("Chef’s Delight", "A young chef competes in a world-renowned cooking competition."),
    ("The Haunted Lake", "A town discovers the dark secrets hidden beneath their local lake."),
    ("Speed Chase", "An elite driver is forced into high-stakes illegal racing."),
    ("The Quantum Paradox", "A scientist accidentally opens a portal to another dimension."),
    ("Last Stand", "A retired soldier is called back for one final mission."),
]

# Corresponding genres
genres = ["Adventure", "Sci-Fi", "Romance", "Mystery", "Sci-Fi", "Drama", "Horror", "Action", "Sci-Fi", "Action"]

# Generating dataset
data = []
for i in range(len(movies)):
    data.append([i + 1, movies[i][0], movies[i][1], genres[i]])

# Creating DataFrame
df = pd.DataFrame(data, columns=["Movie_ID", "Title", "Description", "Genre"])

# Display first few rows
print(df.head())

# Save dataset
df.to_csv("synthetic_movies.csv", index=False)


   Movie_ID               Title  \
0         1  The Lost Adventure   
1         2   Cybernetic Revolt   
2         3       Love in Paris   
3         4       Mystery Manor   
4         5       Galactic Wars   

                                         Description      Genre  
0  A group of explorers embarks on a journey to u...  Adventure  
1  In a future dominated by AI, humans fight for ...     Sci-Fi  
2  A heartwarming tale of two strangers finding l...    Romance  
3  A detective must solve a series of murders in ...    Mystery  
4  An interstellar battle between rival factions ...     Sci-Fi  


In [23]:
articles = [
{"headline": "Economic Growth Continues Amid Global Uncertainty",
"topic": "Business",
"keywords": ["economy", "business", "finance"]},

{"headline": "1.5 Billion Tune-in to the World Cup Final",
"topic": "Sport",
"keywords": ["soccer", "world cup", "tv"]}
]

In [36]:
# This function creates a single string in order to be used for the embedding process

def create_article_text(article):
    return (
        f"Headline: {article['headline']}\n"
        f"Topic: {article['topic']}\n"
        f"Keywords: {', '.join(article['keywords'])}"
    )

newtext = create_article_text(articles[0])
print(newtext)

Headline: Economic Growth Continues Amid Global Uncertainty
Topic: Business
Keywords: economy, business, finance


list



---



## Part 2: Build a Recommender System
3. Use OpenAI’s embedding models to **generate vector representations** of your dataset.
4. Implement a **similarity search** function to recommend similar items based on user input.
5. Test your system by providing different queries and observing the quality of recommendations.
**Deliverables for Recommender System:**

* Code implementation (Jupyter Notebook or Python script).
* Explanation of your use case, dataset, and how recommendations are generated.
* Sample queries and the system’s responses.

### Generate Embeddings

In [44]:
def create_embeddings(texts):
    """
    Generates embeddings for a list of texts using the OpenAI API.

    Args:
        texts (list): A list of text strings to generate embeddings for.

    Returns:
        list: A list of embedding vectors, one for each input text.
    """
    embeddings = []
    for text in texts:
        response = client.embeddings.create(
            input=text,
            model="text-embedding-3-small"  # You can change the embedding model if needed
        )
        embeddings.append(response.data[0].embedding)
    return embeddings

In [45]:
article_texts = [create_article_text(article) for article in articles]
article_embeddings = create_embeddings(article_texts)
print(article_embeddings)

[[0.03193935751914978, 0.02233048342168331, 0.038119714707136154, 0.04005953297019005, 0.028149941936135292, 0.03200702741742134, -0.0045506819151341915, 0.08476561307907104, 0.02855595201253891, -0.018236560747027397, 0.00818784348666668, -0.02887173555791378, -0.09581807255744934, -0.03272882103919983, 0.03209725022315979, 0.08305135369300842, -0.046691011637449265, 0.012924613431096077, -0.014221585355699062, 0.034262631088495255, 0.006868315394967794, 0.002132955938577652, 0.001904575969092548, 0.021484632045030594, -0.0017410446889698505, 0.01635313220322132, -0.045427873730659485, 0.02422519028186798, 0.01271033100783825, -0.018992189317941666, 0.06915683299303055, -0.006005546543747187, 0.030179986730217934, 0.04265347868204117, -0.025059765204787254, 0.02447330765426159, 0.02001848816871643, -0.029007071629166603, 0.049217287451028824, -0.02327783778309822, 0.004561959765851498, -0.018417010083794594, 0.009377675130963326, 0.023728957399725914, -0.01242838054895401, -0.01455992

### Similarity Search Function

In [46]:
def cosine_similarity(a, b):
    """
    Calculates the cosine similarity between two vectors.

    Args:
        a (list): The first vector.
        b (list): The second vector.

    Returns:
        float: The cosine similarity between the two vectors.
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def recommend_similar_items(query, items, embeddings, top_k=3):
    """
    Recommends similar items based on a user query.

    Args:
        query (str): The user's query.
        items (list): A list of items to search through.
        embeddings (list): A list of embeddings corresponding to the items.
        top_k (int, optional): The number of similar items to return. Defaults to 3.

    Returns:
        list: A list of the most similar items to the query.
    """
    query_embedding = create_embeddings([query])[0]  # Get embedding for the query

    similarities = [cosine_similarity(query_embedding, item_embedding) for item_embedding in embeddings]

    # Get the indices of the most similar items
    sorted_indices = np.argsort(similarities)[::-1]

    # Return the top_k most similar items
    return [items[i] for i in sorted_indices[:top_k]]

### Testing and Recommendations

In [48]:
import numpy as np

# User query
user_query = "News about the economy and finance"

# Get recommendations
recommendations = recommend_similar_items(user_query, articles, article_embeddings)

# Print recommendations
for recommendation in recommendations:
    print(recommendation['headline'])

# Try another query
user_query = "Something about sports"
recommendations = recommend_similar_items(user_query, articles, article_embeddings)
for recommendation in recommendations:
    print(recommendation['headline'])

Economic Growth Continues Amid Global Uncertainty
1.5 Billion Tune-in to the World Cup Final
1.5 Billion Tune-in to the World Cup Final
Economic Growth Continues Amid Global Uncertainty




---



## Part 3: Build a Classifier
6. Build an embedding centric classifier.
7. Define **clear categories** (e.g., positive/negative sentiment, topic classification, fraud detection).
8. Evaluate its accuracy and effectiveness.

**Deliverables for Classifier:**
* Code implementation.
* Explanation of classification logic and dataset.
* Performance analysis (accuracy, precision, recall, or confusion matrix).



---



## Extra Credit: Integrate a Vector Database
🔹 **Challenge**: Instead of storing and retrieving embeddings in memory, integrate a **vector database** such as:

* **AstraDB** (built on Apache Cassandra)
* **Pinecone** (real-time vector search)
* **ChromaDB** (open-source vector store)

🔹 Store embeddings in the database and retrieve relevant results dynamically.

🔹 Explain how the integration improves scalability and search efficiency.

In [49]:
import pandas as pd
import random

# Sample words for generating random movie titles
adjectives = ["Lost", "Mysterious", "Dark", "Galactic", "Quantum", "Eternal", "Forbidden", "Secret", "Hidden", "Haunted"]
nouns = ["Journey", "Warrior", "Galaxy", "Legacy", "Escape", "Revolt", "Curse", "Paradox", "Fortune", "Mystery"]

# Sample genres
genres = ["Action", "Adventure", "Sci-Fi", "Romance", "Horror", "Mystery", "Drama", "Comedy", "Thriller", "Fantasy"]

# Sample phrases for generating random movie descriptions
description_templates = [
    "A {adj} {noun} embarks on an epic adventure to save the world.",
    "In a world filled with {adj} secrets, a {noun} uncovers a hidden truth.",
    "A group of {adj} explorers sets out on a mission to find the legendary {noun}.",
    "When a {noun} discovers a {adj} power, everything changes forever.",
    "A {adj} battle between good and evil takes place in the heart of {noun}.",
    "The fate of humanity rests in the hands of a {adj} {noun}.",
    "A {noun} with a {adj} past is forced to confront their destiny.",
    "An unexpected romance blossoms between a {adj} hero and a {noun}.",
    "A detective must solve the {adj} case of the missing {noun}.",
    "A haunted {noun} holds secrets of a {adj} past."
]

# Number of movies to generate
num_movies = 100

# Generate movie dataset
movies = []
for i in range(1, num_movies + 1):
    title = f"The {random.choice(adjectives)} {random.choice(nouns)}"
    description = random.choice(description_templates).format(adj=random.choice(adjectives), noun=random.choice(nouns))
    genre = random.choice(genres)
    movies.append([i, title, description, genre])

# Create DataFrame
df = pd.DataFrame(movies, columns=["Movie_ID", "Title", "Description", "Genre"])

# Display first few rows
print(df.head())

# Save dataset
df.to_csv("synthetic_movies_dynamic.csv", index=False)


   Movie_ID                Title  \
0         1   The Secret Journey   
1         2     The Dark Warrior   
2         3  The Galactic Galaxy   
3         4   The Secret Journey   
4         5     The Hidden Curse   

                                         Description     Genre  
0  In a world filled with Hidden secrets, a Parad...   Fantasy  
1  A detective must solve the Forbidden case of t...  Thriller  
2  A Lost Curse embarks on an epic adventure to s...   Mystery  
3  The fate of humanity rests in the hands of a S...  Thriller  
4  An unexpected romance blossoms between a Haunt...    Action  
