This notebook is adapted from Cohere's notebook on using a [RAG enabled Chatbot](https://github.com/cohere-ai/notebooks/blob/main/notebooks/llmu/RAG_with_Chat_Embed_and_Rerank.ipynb?ref=cohere-ai.ghost.io) for the WiDS GenAI workshop.

We will use Cohere's streaming chat bot, then add documents and received citations for which are used to answer the query. 

In [1]:
import cohere
import config
import os
os.chdir("..")

co = cohere.Client(config.COHERE_KEY) 


In [2]:
pwd

'C:\\Users\\cwithrow\\repos\\ai_demo'

## Quick Example

In [3]:
documents = [
    {
        "title": "Tall penguins",
        "text": "Emperor penguins are the tallest."},
    {
        "title": "Penguin habitats",
        "text": "Emperor penguins only live in Antarctica."},
    {
        "title": "What are animals?",
        "text": "Animals are different from plants."}
]

In [4]:
# Get the user message
message = "What are the tallest living penguins?"


# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",
                          documents=documents)

# Display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
        cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

The tallest living penguins are Emperor penguins. They are native to Antarctica.

CITATIONS:
start=32 end=49 text='Emperor penguins.' document_ids=['doc_0']
start=59 end=80 text='native to Antarctica.' document_ids=['doc_1']

DOCUMENTS:
{'id': 'doc_0', 'text': 'Emperor penguins are the tallest.', 'title': 'Tall penguins'}
{'id': 'doc_1', 'text': 'Emperor penguins only live in Antarctica.', 'title': 'Penguin habitats'}


## Vitamin studies example
Document chunking strategy can be a project in its own, with differing strategies. To keep this demo simple we will use text stored in a csv, from the National Health Institute's public database on vitamin and supplement studies. Other use cases for text in csv could be question and answer data from past customer interactions, a personal health journal or scraped forum data.

**Note:** The following example is used primarily because the data is public, and a good model of how RAG can improve generation. A more effective use case is when working with private documents such as internal policy docs, protected health information or customer service chat histories (using a private LLM) that models do not generally have access to.

### Prep data

Data exported from NIH CARDS database searching for various vitamins, supplements and restricted to human studies: https://cards.od.nih.gov/Application/Search

In [5]:
import pandas as pd
pd.set_option("max_colwidth", 300)

In [6]:
studies_df=pd.read_csv('./data/nih_vitamin_studies.csv').drop('Unnamed: 13',axis=1)

In [7]:
studies_df.head(2)

Unnamed: 0,Type,Activity Code,Project ID,Project Title,SubProject ID,Project End Date,Investigator Names,Country,Total Dollars Amount,Fiscal Year,Funding IC,Project Abstract,Public Health Relevance
0,5,K23,DK084115,Investigations into the Glutamine-Citruline-Arginine Pathway in Sepsis,,31-Dec-2015,"Kao, Christina",UNITED STATES,171885,2014,DK,"DESCRIPTION (provided by applicant): Dr. Christina Kao is an assistant professor in the Section of Pulmonary, Critical Care, and Sleep Medicine at Baylor College of Medicine (BCM). Her short-term goal is to develop the knowledge and skills to conduct metabolic research using stable isotope ...","PUBLIC HEALTH RELEVANCE: Two compounds, arginine and glutamine, help our body fight severe infections. This research will help to better define the relationship between these two compounds and will determine if supplying more of one (glutamine) will in turn lead to increases in the other (argini..."
1,5,R01,HD072120,Early Childhood Development for the Poor: Impacting at Scale,,31-May-2018,"Meghir, Konstantinos",UNITED STATES,427218,2014,HD,"DESCRIPTION (provided by applicant): Neurobiological science has established that the first three years of life lay the basis for lifelong outcomes. But during this life-cycle stage children living in poverty are often vulnerable to negative influences including malnutrition, illnesses, an ...","PUBLIC HEALTH RELEVANCE: By identifying cost-effective and scalable early-years interventions, our research has the potential to revolutionize early childhood development (ECD) policies and contribute to breaking the intergenerational cycle of poverty. If the interventions we propose to test pr..."


In [8]:
studies_df=studies_df.fillna('')

In [9]:
# Limit to key data columns and keep project ID and Investator Names for later citation reference
studies_df=studies_df[['Project Title','Project ID','Investigator Names','Project Abstract']]

In [10]:
# Convert to list of dictionaries for Cohere API
documents=studies_df.to_dict('records')

In [11]:
# Sample
documents[15]

{'Project Title': 'Biological and Environmental modifiers of Vitamin D3 and Prostate Cancer Risk',
 'Project ID': 'MD007105',
 'Investigator Names': 'KITTLES, RICK Antonius',
 'Project Abstract': ' ABSTRACT Disparities in prostate cancer (Pca) are caused by complex interactions of genetic susceptibility, individual risk factors, and environmental factors. Pca is the second leading cause of death among all men; however African American (AA) men have the highest mortality rate of Pca of any racial/ethnic group in the U.S. This difference in mortality accounts for 44% of the overall cancer mortality disparity between AA and European-American (EA) men. Thus, there is a critical need to explore the etiologic pathways that contribute to this disparity. Unfortunately, the only well-established risk factors (age, race and family history) for Pca are non-modifiable. However recent studies have found low levels of vitamin D have been associated with increased Pca risk, and treatment with vitamin

### Question without documents
First let's ask some questions without supplying any additional context.

In [12]:
# Get the user message
message = "What is the research on vitamin D and cancer?"

# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",)

In [13]:
for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)


Vitamin D is a fat-soluble vitamin that is essential for maintaining bone health and has also been studied for its potential role in cancer prevention and treatment. Here is an overview of the current research on vitamin D and cancer:

**1. Vitamin D and Cancer Prevention:**

   - Some observational studies have suggested an association between low levels of vitamin D and an increased risk of certain types of cancer, including colorectal, breast, and prostate cancer. However, it is important to note that these studies do not prove causation.
   - A 2019 review of randomized controlled trials found that vitamin D supplementation did not significantly reduce the risk of developing cancer, except for a possible small reduction in the risk of lung cancer.
   - More research is needed to understand the potential role of vitamin D in cancer prevention, including the optimal dose and the potential benefits for specific populations.

**2. Vitamin D and Cancer Treatment:**

   - In vitro and an

### Send bulk documents directly with question
Vitamin D deficiency is a pretty well studied condition, so the results above aren't bad, but they do not provide sources. Let's directly send the same question along with some documents as context.

In [14]:
# Get the user message
message = "What is the research on vitamin D and cancer?"

# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",
                          documents=documents[:20] # doesn't handle more than this many
                         )

In [15]:
# Display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
        cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

There are several research projects that look into the relationship between vitamin D and cancer. Here are some of them:
- The VITamin D and OmegA-3 TriaL (VITAL) is an NIH-supported, large, randomized, double-blind, placebo-controlled, 2x2 factorial trial that will test 2000 IU/day of vitamin D (as vitamin D3) and 1 g/day of marine I-3 FA (eicosapentaenoic acid [EPA] + docosahexaenoic acid [DHA]) supplements on incident cardiovascular disease and cancer in 20,000 multiethnic men and women with 5 years of treatment and follow-up.
- A project titled "Vitamin D, Steroids, and Asthma in African American Youth" looks at the contribution of vitamin D to disparities in the chronic control and acute severity of asthma in urban African American youth.
- A project titled "Biological and Environmental modifiers of Vitamin D3 and Prostate Cancer Risk" explores the effects of serum Vitamin D, UVR exposure, skin color, age, BMI, and genes involved in Vitamin D synthesis, metabolism, and signaling o

In [16]:
studies_df[studies_df['Project ID']=='HD083113-02']

Unnamed: 0,Project Title,Project ID,Investigator Names,Project Abstract
263,Trial of Vitamin D in Maternal HIV Progression and Child Health,HD083113-02,"Fawzi, Wafaie W; Sudfeld, Christopher",DESCRIPTION (provided by applicant): The overall project goal is to investigate a vitamin D3 (cholecalciferol) as a simple and low cost intervention to prolong and improve quality of life for HIV-infected pregnant women and their children in resource limited settings. In order to meet these...


**Problem!** There is a limit on how much you can send as context! The amount varies with the LLM model, but better to first find the most relevant documents and only send those as context. 
Embedders embed doc chunks/passages as vectors and allow you to do various similarity searches to find relevant documents out of the (hopefully large) dataset you embed. 


### Embed documents
Modified from Cohere's intro that uses unstructured library https://cohere.com/blog/rag-chatbot#embed-the-document-chunks

Other tools you could use include weaviate, pgvector for embedding, huggingface models for reranking



In [17]:
import hnswlib
from typing import List, Dict


In [18]:
class Vectorstore:
    """
    A class representing a collection of documents indexed into a vectorstore.

    Parameters:
    raw_documents (list): A list of dictionaries representing the sources of the raw documents. Each dictionary should have 'title' and 'url' keys.

    Attributes:
    raw_documents (list): A list of dictionaries representing the raw documents.
    docs (list): A list of dictionaries representing the chunked documents, with 'title', 'text', and 'url' keys.
    docs_embs (list): A list of the associated embeddings for the document chunks.
    docs_len (int): The number of document chunks in the collection.
    idx (hnswlib.Index): The index used for document retrieval.

    Methods:
    load_and_chunk(): Loads the data from the sources and partitions the HTML content into chunks.
    embed(): Embeds the document chunks using the Cohere API.
    index(): Indexes the document chunks for efficient retrieval.
    retrieve(): Retrieves document chunks based on the given query.
    """

    def __init__(self, document_list: List[Dict[str, str]]):
        self.document_list = document_list
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_text()
        self.embed()
        self.index()

    def load_text(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for doc_data in self.document_list:
            
            for doc in doc_data:
                self.docs.append(
                    {
                        "title": doc["Project Title"],
                        "text": doc['Project Abstract'],
                        "project_id": doc['Project ID'],
                        "authors": doc['Investigator Names'], 
                    }
                )
    
    def embed(self) -> None:
        """
        Embeds the document text using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 50
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
                texts=texts, model="embed-english-v3.0", input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

    def index(self) -> None:
        """
        Indexes the document chunks for efficient retrieval.
        """
        print("Indexing document chunks...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} document chunks.")

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings

        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]

        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v3.0",
            rank_fields=rank_fields
        )

        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]

        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "project_id": self.docs[doc_id]["project_id"],
                }
            )

        return docs_retrieved

In [19]:
# embed ALL the documents
vectorstore = Vectorstore([documents])

Loading documents...
Embedding document chunks...
Indexing document chunks...
Indexing complete with 922 document chunks.


In [20]:
# Example
vectorstore.retrieve("vitamin d and cancer")

[{'title': 'Vitamin D and Follicular Lymphoma',
  'text': 'Program Director/Principal Investigator (Last, First, Middle): Friedberg, Jonathan, W\nABSTRACT\nIn addition to the primary effects on calcium homeostasis, Vitamin D has important effects on both\ninnate and adaptive immunity in humans. However, the degree to which these pleotropic effects of\nVitamin D influences specific immune responses to infections and cancer in humans is not known.\nIndolent B-cell lymphoma represents a unique model system to define the effect of Vitamin D on the\nimmune response to cancer. Gene expression profiling studies demonstrate the importance of\nimmune-based signatures in the lymph node microenvironment on prognosis in follicular lymphoma. A\nrecently published study identified that immune-infiltration measured by PD-L2 expression in the\nmicroenvironment of follicular lymphoma is strongly predictive of outcome, where low immune\ninfiltration was associated with increased risk of early relapse af

### Pull it together

In [21]:
message = "What is the research on vitamin D and cancer?"
embedded_documents= vectorstore.retrieve(message)

# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",
                          documents= embedded_documents
                          
                         )

In [22]:
# Display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
        cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

There is research to suggest that vitamin D has anti-neoplastic properties, and that individuals with higher plasma 25-hydroxyvitamin D [25(OH)D] levels have a lower risk of colorectal cancer (CRC) and improved survival rates. However, it is unclear whether these findings reflect a true causal relationship. Vitamin D also has important effects on both innate and adaptive immunity in humans, and it may directly influence the immune response to malignancy.

CITATIONS:
start=48 end=74 text='anti-neoplastic properties' document_ids=['doc_1', 'doc_2']
start=102 end=152 text='higher plasma 25-hydroxyvitamin D [25(OH)D] levels' document_ids=['doc_1', 'doc_2']
start=160 end=197 text='lower risk of colorectal cancer (CRC)' document_ids=['doc_1', 'doc_2']
start=202 end=226 text='improved survival rates.' document_ids=['doc_1', 'doc_2']
start=242 end=308 text='unclear whether these findings reflect a true causal relationship.' document_ids=['doc_1', 'doc_2']
start=328 end=392 text='important effe

## Do more with full tutorial!
I encourage you to work through Cohere's full tutorial which also uses reranking. https://cohere.com/blog/rag-chatbot

Reranking takes the document chunks retrieved and reranks their relevance them based on context and domain. Whereas retrieval may focus on context searches specific to the words in the query, a reranker tries to focus on the general subject of the query and will rank retrieval results by relevance to that subject.
