# Amazon AWS S3 Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://www.llamaindex.ai/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using documents stored in an Amazon AWS S3 bucket.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- You will also need an Amazon AWS account, with some documents stored in a S3 bucket.
- Store your AWS credentials in a text file at `~/.aws/credentials`. This file must use the following format:
  
  ```
    [default]
    aws_access_key_id = YOUR_ACCESS_KEY
    aws_secret_access_key = YOUR_SECRET_KEY
  ```

## Set up the RAG workflow environment

In [1]:
import os
from pathlib import Path
import sys

from llama_index import download_loader, ServiceContext, VectorStoreIndex
from llama_index.embeddings.cohereai import CohereEmbedding
from llama_index.llms import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank

Set up some helper functions:

In [2]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [3]:
# Setup the environment
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = os.environ["COHERE_API_KEY"]
except Exception:
    sys.exit(f"Unable to read your Cohere API key. Make sure this is stored in a text file in your home directory at ~/.cohere.key\n")

# Make sure that the AWS credentials are stored in ~/.aws/credentials
try:
    aws_credentials_file = Path.home() / ".aws/credentials"
    assert aws_credentials_file.exists()
except:
    sys.exit(f"Unable to find your AWS credentials file at {aws_credentials_file}. Make sure this file exists and contains your AWS credentials.\n")

# Make sure the AWS credentials are stored in the correct format
try:
    aws_credentials = open(aws_credentials_file, "r").read().strip()
    assert aws_credentials.startswith("[default]")
    assert "aws_access_key_id" in aws_credentials
    assert "aws_secret_access_key" in aws_credentials
except Exception:
    sys.exit(f"""Unable to load AWS credentials. Make sure your ~/.aws/credentials file is in the following format:
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY\n""")

## Start with a basic generation request without RAG augmentation

Let's start by asking the Cohere LLM a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. Let's use an obscure research project we don't expect it to know the answer to.

"*Describe the goals of the OpenNF project.*"

In [4]:
query = "Describe the goals of the OpenNF project."

## Send the generation request to Cohere

In [5]:
llm = Cohere(api_key=os.environ["COHERE_API_KEY"])
result = llm.complete(query)
print(result)

unknown field: parameter model is not a valid field


 The OpenNF project is a collaborative initiative aimed at promoting the development and adoption of Natural Language Processing models (NLP) that are open source, transparent, and accessible to a broader range of users. 

Specifically, the project has the following goals: 

1. **Inclusivity:** The project aims to make state-of-the-art NLP tools and datasets accessible to researchers, developers, and organizations who may not have the resources to afford proprietary models and subscriptions offered by large tech companies. The project is particularly keen on ensuring that underserved languages and dialects benefit from advances in NLP. 

2. **Transparency and Trustworthiness:** OpenNF proponents want to address concerns surrounding the lack of transparency in many of today's NLP models. They believe that openness in terms of data collection, model training, and algorithmic decision-making is crucial for building confidence, addressing biases, and ensuring ethical use of these technolog

Cohere will occasionally get this right *(TODO: find a better example!)* but usually gets it wrong. The correct answer is that OpenNF is NFV+SDN networking project that focuses on state transfer in virtualized network environments.

## Ingestion: Retrieve some documents from an Amazon S3 bucket

I've added a few PDF documents related to the OpenNF project in an Amazon S3 bucket:

![aws-s3-snapshot](imgs/aws-s3-snapshot.png)

### Load these documents using S3Reader

Fortunately, there is a simple S3 utility available via [LlamaHub](https://www.llamahub.ai/), a registry of open-source data connectors that you can easily plug into any LlamaIndex application.

In [6]:
S3Reader = download_loader("S3Reader")
loader = S3Reader(bucket='vector-rag-bootcamp-v2')
documents = loader.load_data()

## Define Embeddings Model

In [7]:
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm
)

[nltk_data] Downloading package punkt to
[nltk_data]     /tmp/JMzMIxqc706ehpGd/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Embeddings Store, Retrieval and Reranking

In [8]:
# Set up the base vector store retriever
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

# Retrieve the most relevant context from the vector store based on the query
search_query_retriever = index.as_retriever(service_context=service_context)
search_query_retrieved_nodes = search_query_retriever.retrieve(query)

# Use a reranker to identify the most closest match
reranker = CohereRerank()
query_engine = index.as_query_engine(
    node_postprocessors = [reranker]
)

Parsing nodes:   0%|          | 0/26 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/48 [00:00<?, ?it/s]

## Lastly, send the augmented request to Cohere

In [9]:
result = query_engine.query(query)
print(result)

The goals of the OpenNF project are to provide critical mechanisms to the SDN+NFV landscape and deliver a high-calibre solution to problems that are rapidly becoming evident. The project is aimed at developing and commercializing the OpenNF technology through various milestones and patents filed through WARF.
