# Code-along 2023-12-12 Retrieval Augmented Generation

## Conceptual Overview of RAG
Also available at [https://whimsical.com/rag-diagram-Ed1YXhRAVd19rLiuk15FWG]
![Screenshot 2023-12-04 at 3.43.11 PM](images/Screenshot%202023-12-04%20at%203.43.11%20PM.png)

In [1]:
!pip install -U newsapi-python llama-index huggingface_hub[inference]

Defaulting to user installation because normal site-packages is not writeable
Collecting newsapi-python
  Downloading newsapi_python-0.2.7-py2.py3-none-any.whl (7.9 kB)
Collecting llama-index
  Downloading llama_index-0.9.14.post2-py3-none-any.whl.metadata (8.2 kB)
Collecting huggingface_hub[inference]
  Downloading huggingface_hub-0.19.4-py3-none-any.whl.metadata (14 kB)
Collecting SQLAlchemy>=1.4.49 (from SQLAlchemy[asyncio]>=1.4.49->llama-index)
  Downloading SQLAlchemy-2.0.23-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.6 (from llama-index)
  Downloading aiohttp-3.9.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting beautifulsoup4<5.0.0,>=4.12.2 (from llama-index)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from ll

## Get Data To Test With

We use will the [News API](https://newsapi.org/). You can get a free developer account to retrieve small amounts of information like we will use for this project.

To start, we will build a RAG system based on November 2023 stories about the open source Llama2 model. But you can change the parameters below and re-run this code to build a knowledge base on other topics or from other timelines.

In [2]:
import json
from newsapi import NewsApiClient
import os

news_api_key = os.getenv('NEWS_API_KEY')
newsapi = NewsApiClient(api_key=news_api_key)
get_articles_now = False
articles_path = './saved_articles.json'

if get_articles_now:
    try:
        all_articles = newsapi.get_everything(q='Llama2',
                                              from_param='2023-11-01',
                                              to='2023-11-30',
                                              language='en',
                                              sort_by='publishedAt',
                                              page_size=100)
        with open(articles_path, 'w') as f:
            json.dump(all_articles, f)
    except Exception as e:
        print("An error occurred:", str(e))
else:
    with open(articles_path, 'r') as f:
        all_articles = json.load(f)

print(f'Retrieved {len(all_articles["articles"])} articles')

Retrieved 100 articles


Inspect a single item in `all_articles['articles']` to see the format

In [3]:
all_articles['articles'][1]

{'source': {'id': None, 'name': 'Biztoc.com'},
 'author': 'medium.datadriveninvestor.com',
 'title': 'How to choose between ChatGPT and Open Source LLMs in Finance',
 'description': 'How to choose between ChatGPT and Open Source LLMs in Finance Many consulting companies present LLMs and GenAI products to CEOs, CFOs, COOs, and CTOs. While these products may seem appealing, companies should remember the distinction between using ChatGPT wit…',
 'url': 'https://biztoc.com/x/71d1b51b76f35b0e',
 'urlToImage': 'https://c.biztoc.com/p/71d1b51b76f35b0e/s.webp',
 'publishedAt': '2023-12-03T10:52:23Z',
 'content': 'How to choose between ChatGPT and Open Source LLMs in FinanceMany consulting companies present LLMs and GenAI products to CEOs, CFOs, COOs, and CTOs. While these products may seem appealing, companie… [+227 chars]'}

It's not ideal that the articles are truncated. That's because a limitation of using a free account. But we will still be able to work with this.

Now look at a sample of article titles for an overview of our content.

In [4]:
titles = [i['title'] for i in all_articles['articles']]
titles[:20]

['Researchers scanned public repos and found 1,681 exposed Hugging Face API tokens belonging to Meta, Microsoft, Google, and others, many with write permissions',
 'How to choose between ChatGPT and Open Source LLMs in Finance',
 "ChatGPT was treated like the 'second coming of the messiah' and its impact was a big surprise, says Meta's AI chief",
 'Show HN: AI That Studies for You',
 "ChatGPT was treated like the 'second coming of the messiah' and its impact was a big surprise, says Meta's AI chief",
 'Ambarella : Q3 FY2024 Earnings Call Transcript',
 'Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements',
 'Alibaba releases 72B LLM with 32k context length',
 'Revolutionizing Business Solutions with SAP BTP: A New Era of LLM Agnosticism',
 'Use Ollama LLM Models Locally with Laravel',
 'Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services',
 'AWS unveils new tools and services for ‘supernova’ of generative

Make a list with `content` from each article that we can use to populate vector DB

In [5]:
articles_text = [i['content'] for i in all_articles['articles']]

## Build Basic RAG system

In [6]:
from llama_index import VectorStoreIndex, ServiceContext, Document
from openai import OpenAI

# Make client to access the model
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Specify documents to be retrieved
documents = [Document(text=t) for t in articles_text]
# Create the vector store that we use to find relevant documents
index = VectorStoreIndex.from_documents(documents)
# A query engine is our final goal. The thing we can query
query_engine = index.as_query_engine(similarity_top_k=2)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Test Our Query Engine

Test a query with `query_engine.query`

In [7]:
query_engine.query("What AWS service can be used to deploy Llama2 models?")

Response(response='Amazon Bedrock', source_nodes=[NodeWithScore(node=TextNode(id_='a14a86b3-6120-454a-97e1-ad82daf9ac86', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='626e19fd-6e9a-433b-aa4e-eff7d252fe96', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='49e4eba93e87faf4d534de4afc098521ee66886bfad74addecc907441b82e8a9'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='5b082e66-75c9-443e-bb06-8f93651b6ebc', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='22842fb263360082eb2f867ff62cc4cd19fadbfc30797f7c71f78e06ee2abd5c'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='5811e5a8-5ace-46cc-9b31-3759d210d711', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='b2396cf0076d8401589f67edc7fa8eb370ec4bc1941dff832c0af8f680f26f84')}, hash='49e4eba93e87faf4d534de4afc098521ee66886bfad74addecc907441b82e8a9', text='Today, were announcing the availabi

Create a convenience function to return only the most relevant info (the response and the text that informed that response.)

In [8]:
def search(query):
    response = query_engine.query(query)
    output = {'response': response.response,
              'retrieved_nodes': [p.text for p in response.source_nodes]
    }
    return output

Test our new function

In [9]:
search("What AWS service an be used to deploy Llama2 models?")

{'response': 'Amazon Bedrock',
 'retrieved_nodes': ['Today, were announcing the availability of Metas Llama 2 Chat 13B large language model (LLM) on Amazon Bedrock. With this launch, Amazon Bedrock becomes the first public cloud service to offer a full… [+8111 chars]',
  'Run Large-Language Models (LLMs) directly in your browser ! \r\nLearn More: API Reference\r\nDeveloped By: RDS \r\nThis web demo enables you to run LLM models from Hugging Face (GGUF/GGML/tiny-llama2/starc… [+34 chars]']}

### Use Llama2 instead of OpenAI models

In [10]:
from llama_index.llms import HuggingFaceInferenceAPI
from llama_index import ServiceContext, VectorStoreIndex

model_name = 'HuggingFaceH4/zephyr-7b-beta'

# Build service_context here that uses model served on HF endpoint
hf_model = HuggingFaceInferenceAPI(model_name=model_name)

# or if you have an HF token, use
# HF_TOKEN = os.getenv('HF_TOKEN')
# remotely_run = HuggingFaceInferenceAPI(
#     model_name=model_name, token=HF_TOKEN
# )

In [11]:
# Create a service_context object with llm=hf_model.
# The service_context gives a lot of the general configuration parameters
service_context = ServiceContext.from_defaults(llm=hf_model)

# Create VectorStoreIndex using same command we used before, but pass in the service_context=service_context.
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

# Create the query engine
query_engine = index.as_query_engine(similarity_top_k=2)

Test it

In [12]:
query_engine.query("What AWS service can be used to deploy Llama2 models?")

Response(response='\n\nThe AWS service that can be used to deploy Llama2 models is Amazon Bedrock. This was announced in a recent announcement by Meta, making Amazon Bedrock the first public cloud service to offer a full-featured deployment option for Llama2 models.', source_nodes=[NodeWithScore(node=TextNode(id_='190f06ab-70bb-4fba-9551-ff47d0b02a71', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='626e19fd-6e9a-433b-aa4e-eff7d252fe96', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='49e4eba93e87faf4d534de4afc098521ee66886bfad74addecc907441b82e8a9'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='e107961d-9c3f-40af-b73d-be325c36af63', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='22842fb263360082eb2f867ff62cc4cd19fadbfc30797f7c71f78e06ee2abd5c'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='8d9a620a-202f-4c11-bf1d-98ad0dc12086', nod