# AIChampionsHub : Academy

### Module 2: Adapting AI for Enterprise Use : Retrival Augmented Generation

### Use Case 07 : Using Semi-Structured Data for "Chat with Data" Use
This is part of Course by **AIChampionsHub** - AI Fundamentals and AI Engineering Courses leverage this Notebook.

---
<a href="https://github.com/aichampionslearn/01_LLM_Basics"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/aichampionslearn/01_LLM_Basics/blob/main/AICH_L2_AIAgents_M1_D3_BasicLLMAppv01.ipynb)

### Objective

- AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. Here is a link to Kaggle site:
https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset

- We will this semi-structured data, store in Vector Database, Apply Embeddings and use to enable user Analysis

For OpenAI please make sure that you a `OPENAI_API_KEY`.

In [None]:
# %%capture --no-stderr
!pip install wget --quiet
!pip install pandas --quiet
!pip install chromadb --quiet

In [None]:
!pip install --quiet -U langchain_openai langchain_core langchain_community langchain_ollama langchain_chroma tavily-python
!pip install --quiet sentence-transformers

In [None]:
# from langchain_ollama import OllamaEmbeddings
# from langchain_community.vectorstores import Chroma

import os, getpass
from google.colab import userdata   #For Secret Key
from langchain_chroma import Chroma
from langchain.schema import Document
from tqdm import tqdm  # For showing Progress bar during longer iterations

import wget                         # To download data file from OpenAI Site
import  pandas as pd                # DataFrame for easy data manipulation

In [None]:
def _set_OpenAIKey(var: str, env:int):
    if not env:
        key = userdata.get(var)
        os.environ[var] = key
    else:
        os.environ[var] = getpass.getpass(f"{var}: ")
    return key;

In [None]:
OPENAI_API_KEY = _set_OpenAIKey("OPENAI_API_KEY",0) #0 for reading from userdata

In [None]:
# DATA_FILE_PATH_URL = "https://github.com/openai/openai-cookbook/blob/main/examples/data/AG_news_samples.csv"
DATA_FILE_PATH_URL = "https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv"
DATA_FILE_NAME = "AG_news_samples.csv"

In [None]:
if not os.path.exists(DATA_FILE_NAME):
    wget.download(DATA_FILE_PATH_URL, DATA_FILE_NAME)
    print('File downloaded successfully.')
else:
    print('File already exists in the local file system.')

File downloaded successfully.


# Load the Embedding Model

In [None]:
df = pd.read_csv("/content/" + DATA_FILE_NAME)
# data = df.to_dict(orient='records')  #List of Dictionary values - name:value pairs
# data[0:2]
df = df[0:20].copy()

In [None]:
#!pip install --quiet sentence-transformers

In [None]:
# embedding_model = OllamaEmbeddings(model="nomic-embed-text")

# Below is a a smaller model if you want to use
# all-MiniLM-L6-v2
# embedding_model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')
from langchain_openai import OpenAIEmbeddings

# EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2' # "text-embedding-3-large"
EMBEDDING_MODEL_NAME = "text-embedding-3-small"  # Use a valid OpenAI embedding model name
embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)

## STEP3 : Setup a Database (vector) to Store our Data

In [None]:
COLLECTION_NAME = "AG_news"
# collection = client.create_collection(name="ag_news")
from langchain_chroma import Chroma

vector_db = Chroma(
    collection_name=COLLECTION_NAME,
    embedding_function=embedding_model,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

In [None]:
from tqdm import tqdm

In [None]:
def fn_embed_with_chroma_v01(df, embedding_model):
    documents_to_add = []

    # Process each row in the DataFrame with a progress bar
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):

        description = row['description']
        embedding = embedding_model.embed_documents(description)
        print(description)
        print(embedding[0:20])
        document = Document(
            page_content = description, # Text Content for Embedding
            metadata={'title': row['title'], 'label': row['label']},
            id = str(index),
            embedding = embedding
        )
        # embedding = embedding_model.embed_documents([document.page_content])[0]

        # Append the document directly to documents_to_add
        documents_to_add.append(document)

    # Add documents to the vector store using add_documents outside the loop
    vector_db.add_documents(documents=documents_to_add)

    return vector_db;

In [None]:
def fn_embed_with_chroma(df, embedding_model):
    embeddings = []
    documents_to_add = []

    # Process each row in the DataFrame with a progress bar
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):

        document = Document(
            page_content = row['description'], # Text Content for Embedding
            metadata={'title': row['title'], 'label': row['label']},
            id = str(index)
        )
        # embedding = embedding_model.embed_documents([document.page_content])[0]
        embedding = embedding_model.embed_documents([document.page_content])[0]
        embeddings.append((document, embedding))

        documents_to_add.append(embeddings)
    # Add documents to the vector store using add_documents

    vector_db.add_documents(documents=documents_to_add)

    return embeddings;

In [None]:
# document_embeddings  = fn_embed_with_chroma(df, embedding_model)
fn_embed_with_chroma_v01(df, embedding_model)

In [None]:
# Search for documents similar to the query "climate change" and get top 3 results:
QUERY = "climate change"
results = vector_db.similarity_search(QUERY, k=3)

# Print the page content of the most similar document:
print(results[0].page_content)


BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the  quot;alarming quot; growth of greenhouse gases.


In [None]:
# Search for documents similar to the query "climate change" and get top 3 results:
QUERY = "Technology Trends"
results = vector_db.similarity_search(QUERY, k=3)

# Print the page content of the most similar document:
print(results[0].page_content)

Any product, any shape, any size -- manufactured on your desktop! The future is the fabricator. By Bruce Sterling from Wired magazine.


In [None]:
# Search for documents similar to the query "climate change" and get top 3 results:
QUERY = "Articles about India"
results = vector_db.similarity_search(QUERY, k=3)

# Print the page content of the most similar document:
print(results[0].page_content)

AFP - Hosts India braced themselves for a harrowing chase on a wearing wicket in the first Test after Australia declined to enforce the follow-on here.
