# ColBERT in RAGStack with Astra

This notebook illustrates steps using RAGStack to

1.   Create ColBERT embedding
2.   Index embeddings on Astra
3.   Retrieve with RAGStack and Astra
4.   Use the LangChain ColBERT plugin

ColBERT paper: https://arxiv.org/abs/2004.12832

## set up
Let's import RAGStack-ColBERT

In [None]:
!pip install ragstack-ai-colbert

Prepare documents including chunking

In [None]:
arctic_botany_dict = {
    "Introduction to Arctic Botany": "Arctic botany is the study of plant life in the Arctic, a region characterized by extreme cold, permafrost, and minimal sunlight for much of the year. Despite these harsh conditions, a diverse range of flora thrives here, adapted to survive with minimal water, low temperatures, and high light levels during the summer. This introduction aims to shed light on the resilience and adaptation of Arctic plants, setting the stage for a deeper dive into the unique botanical ecosystem of the Arctic.",  # noqa: E501
    "Arctic Plant Adaptations": "Plants in the Arctic have developed unique adaptations to endure the extreme climate. Perennial growth, antifreeze proteins, and a short growth cycle are among the evolutionary solutions. These adaptations not only allow the plants to survive but also to reproduce in short summer months. Arctic plants often have small, dark leaves to absorb maximum sunlight, and some species grow in cushion or mat forms to resist cold winds. Understanding these adaptations provides insights into the resilience of Arctic flora.",  # noqa: E501
    "The Tundra Biome": "The Arctic tundra is a vast, treeless biome where the subsoil is permanently frozen. Here, the vegetation is predominantly composed of dwarf shrubs, grasses, mosses, and lichens. The tundra supports a surprisingly rich biodiversity, adapted to its cold, dry, and windy conditions. The biome plays a crucial role in the Earth's climate system, acting as a carbon sink. However, it's sensitive to climate change, with thawing permafrost and shifting vegetation patterns.",  # noqa: E501
    "Arctic Plant Biodiversity": "Despite the challenging environment, the Arctic boasts a significant variety of plant species, each adapted to its niche. From the colorful blooms of Arctic poppies to the hardy dwarf willows, these plants form a complex ecosystem. The biodiversity of Arctic flora is vital for local wildlife, providing food and habitat. This diversity also has implications for Arctic peoples, who depend on certain plant species for food, medicine, and materials.",  # noqa: E501
    "Climate Change and Arctic Flora": "Climate change poses a significant threat to Arctic botany, with rising temperatures, melting permafrost, and changing precipitation patterns. These changes can lead to shifts in plant distribution, phenology, and the composition of the Arctic flora. Some species may thrive, while others could face extinction. This dynamic is critical to understanding future Arctic ecosystems and their global impact, including feedback loops that may exacerbate global warming.",  # noqa: E501
    "Research and Conservation in the Arctic": "Research in Arctic botany is crucial for understanding the intricate balance of this ecosystem and the impacts of climate change. Scientists conduct studies on plant physiology, genetics, and ecosystem dynamics. Conservation efforts are focused on protecting the Arctic's unique biodiversity through protected areas, sustainable management practices, and international cooperation. These efforts aim to preserve the Arctic flora for future generations and maintain its role in the global climate system.",  # noqa: E501
    "Traditional Knowledge and Arctic Botany": "Indigenous peoples of the Arctic have a deep connection with the land and its plant life. Traditional knowledge, passed down through generations, includes the uses of plants for nutrition, healing, and materials. This body of knowledge is invaluable for both conservation and understanding the ecological relationships in Arctic ecosystems. Integrating traditional knowledge with scientific research enriches our comprehension of Arctic botany and enhances conservation strategies.",  # noqa: E501
    "Future Directions in Arctic Botanical Studies": "The future of Arctic botany lies in interdisciplinary research, combining traditional knowledge with modern scientific techniques. As the Arctic undergoes rapid changes, understanding the ecological, cultural, and climatic dimensions of Arctic flora becomes increasingly important. Future research will need to address the challenges of climate change, explore the potential for Arctic plants in biotechnology, and continue to conserve this unique biome. The resilience of Arctic flora offers lessons in adaptation and survival relevant to global challenges.",  # noqa: E501
}
arctic_botany_texts = list(arctic_botany_dict.values())

## Step 1. Setup ColBERT Astra Configuration

In [None]:
import os
from getpass import getpass

os.environ["ASTRA_DB_ID"] = input("Enter your Astra DB ID: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra DB Token: ")

In [None]:
import os

from ragstack_colbert import (
    CassandraDatabase,
    ColbertEmbeddingModel,
    ColbertVectorStore,
)

database = CassandraDatabase.from_astra(
    astra_token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
    database_id=os.environ["ASTRA_DB_ID"],
)

embedding_model = ColbertEmbeddingModel()

vector_store = ColbertVectorStore(
    database=database,
    embedding_model=embedding_model,
)

## Step 2. Create embeddings and ingest into Astra in a single line

Connect to Astra including table and index creation

In [None]:
results = vector_store.add_texts(texts=arctic_botany_texts, doc_id="artic_botany")

## Step 3. Retrieval

Create a RAGStack retriever and start asking questions on the indexed embeddings. The library provides
* Embed query tokens
* Generate candidate documents using Astra ANN search
* max similarity scoring
* Ranking

In [None]:
import logging

import nest_asyncio

nest_asyncio.apply()

logging.getLogger("cassandra").setLevel(logging.ERROR)  # workaround to suppress logs
retriever = vector_store.as_retriever()

answers = retriever.text_search("What's artic botany", k=2)
for rank, (answer, score) in enumerate(answers):
    print(f"Rank: {rank} Score: {score} Text: {answer.text}\n")

## 4. LangChain retriever

In [None]:
!pip install "ragstack-ai-langchain[colbert]"

In [None]:
from ragstack_langchain.colbert import ColbertVectorStore as LangchainColbertVectorStore

lc_vector_store = LangchainColbertVectorStore(
    database=database,
    embedding_model=embedding_model,
)

docs = lc_vector_store.similarity_search(
    "what kind fish lives shallow coral reefs atlantic, india ocean, "
    "red sea, gulf of mexico, pacific, and arctic ocean"
)
print(f"first answer: {docs[0].page_content}")