## MLDS 424 GenAI: Assignment 1 Part 1

The code below will serves as a test to the work done on vector database.

### Setup Connections to Pinecone and MotherDuck

In [1]:
import polars as pl
from sentence_transformers import SentenceTransformer
from src.data_ingestion.mdutils import motherduck_setup
from src.data_ingestion.pcutils import pinecone_setup

# Get MotherDuck instance and Pinecone instance
md = motherduck_setup.MotherDucking("mlds-database", True)
pc = pinecone_setup.PineconeInstance("news-index", 768, "aws", "us-east-1", rebuild_index=False)
pc_index = pc.pinecone_setup()

### Query Data from MotherDuck

Query the last 10 rows of data (sort by index), from MotherDuck for testing purpose. Perform the same data cleaning and encoding strategies to encode these news headlines and short descriptions.

In [19]:
# Query the bottom 10 rows of data from MotherDuck table
query_string = 'SELECT * FROM "gen-ai".RawNewsCategory ORDER BY Id DESC LIMIT 10;'
df = (
    motherduck_setup.md_read_table(
        duck_engine=md.duckdb_engine,
        md_schema="gen-ai",
        md_table="RawNewsCategory",
        keep_columns=["Id", "NewsHeadline", "ShortDescription", "NewsDate", "NewsCategory"],
        custom_query=query_string
    )
    .with_columns(
        pl.concat_str([pl.col("NewsHeadline"), pl.col("ShortDescription")], separator= " ").alias("NewsDetails")
    )
    .select(pl.exclude("NewsHeadline", "ShortDescription"))
)

# Perform vector embeddings on test sentences
text_list = df.collect().select("NewsDetails").to_series().to_list()
eb_model = SentenceTransformer(model_name_or_path="sentence-transformers/all-mpnet-base-v2", device="mps")
embeddings = eb_model.encode(text_list, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Query Pinecone and Get Metadata

Using the embeddings from prior chunk, query the top 3 most similar vectors from Pinecone database. Using the metadata returned from Pinecone, we can query the exact text from MotherDuck.

In [20]:
# Save Pinecone results in a list
pinecone_results = []
for embedding in embeddings:
    results = pc.query_pinecone(pinecone_index=pc_index, vector_embedding=embedding.tolist(), top_n=3)
    pinecone_results.append(results)

# From each item in the list, create a list of queries to get data from MotherDuck
queries = []
for result in pinecone_results:
    ids = [x["id"] for x in result]
    query_string = f"""SELECT * FROM "gen-ai".RawNewsCategory WHERE Id IN ({', '.join(f"'{new_id}'" for new_id in ids)})"""
    queries.append(query_string)

# Using these queries save the dataframes into a list
dfs = []
for query in queries:
    query_df = (
        motherduck_setup.md_read_table(
            duck_engine=md.duckdb_engine,
            md_schema="gen-ai",
            md_table="RawNewsCategory",
            keep_columns=["Id", "NewsHeadline", "ShortDescription", "NewsDate", "NewsCategory"],
            custom_query=query
        )
        .with_columns(
            pl.concat_str([pl.col("NewsHeadline"), pl.col("ShortDescription")], separator= " ").alias("NewsDetails")
        )
        .select(pl.exclude("NewsHeadline", "ShortDescription"))
    )
    dfs.append(query_df)

In [27]:
# Loop through everything and print out results
categories = df.collect().select("NewsCategory").to_series().to_list()
for idx, value in enumerate(text_list):
    top_string = ""

    data = dfs[idx].collect()
    top_categories = data.select("NewsCategory").to_series().to_list()
    top_news = data.select("NewsDetails").to_series().to_list()

    for inner_idx, inner_value in enumerate(top_categories):
        print_string = (
            f"""
            Top {inner_idx + 1} News Category: {inner_value}
            Top {inner_idx + 1} News Details: {top_news[inner_idx]}
            """
        )
        top_string = top_string + print_string

    print(
        f"""
        Requested News Category: {categories[idx]}
        Requested News Details: {value}
        ----------
        {top_string}
        """
    )


        Requested News Category: SPORTS
        Requested News Details: Dwight Howard Rips Teammates After Magic Loss To Hornets The five-time all-star center tore into his teammates Friday night after Orlando committed 23 turnovers en route to losing
        ----------
        
            Top 1 News Category: SPORTS
            Top 1 News Details: Dwight Howard Is Finished Masquerading As A Superstar 
            
            Top 2 News Category: SPORTS
            Top 2 News Details: Dwight Howard Responds To LeBron James' Full-Court Shot With One Of His Own Amazing.
            
            Top 3 News Category: SPORTS
            Top 3 News Details: Kobe Bryant Has Every Right To Be Upset With Laker Teammates When Kobe Bryant talks, we listen. That's what five NBA titles and a league MVP gets you. As a result, it became national news when the 36-year-old Bryant was caught on camera going after his teammates for a lackluster effort during practice. "Soft," he called them.
         