Note: Responses from local models can be quite slow, especially with 8-bit quantization.

With 4bit quantization, `HuggingFaceH4/zephyr-7b-alpha` uses about 8GB of VRAM and spiked to 14GB of RAM when loading the model, then settled around 5GB. I used a T4 instance for this notebook.

In [1]:
!pip install llama-index transformers accelerate bitsandbytes

Collecting llama-index
  Downloading llama_index-0.8.48-py3-none-any.whl (761 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m761.9/761.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from llama-index)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting langchain>=0.0.303 (from llama

## Setup

### Data

In [2]:
from llama_index.readers import BeautifulSoupWebReader

url = "https://www.theverge.com/2023/9/29/23895675/ai-bot-social-network-openai-meta-chatbots"

documents = BeautifulSoupWebReader().load_data([url])

### LLM

This should run on a T4 instance on the free tier

In [3]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/628 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [4]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en-v1.5")

Downloading (…)lve/main/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Index Setup

In [5]:
from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [6]:
from llama_index import SummaryIndex

summary_index = SummaryIndex.from_documents(documents, service_context=service_context)

### Helpful Imports / Logging

In [7]:
from llama_index.response.notebook_utils import display_response

In [8]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Basic Query Engine

### Compact (default)

In [9]:
query_engine = vector_index.as_query_engine(response_mode="compact")

response = query_engine.query("How do OpenAI and Meta differ on AI tools?")

display_response(response)



**`Final Response:`** OpenAI and Meta both have AI tools, but they differ in their focus and intended use. OpenAI presents its products as productivity tools, while Meta is building LLMs for entertainment purposes. OpenAI's latest updates for ChatGPT, which include a voice feature and the ability to upload images, have made the tool more useful for a variety of tasks. On the other hand, Meta has unveiled 28 personality-driven chatbots, featuring celebrities such as Charli D’Amelio, Dwyane Wade, Kendall Jenner, MrBeast, Snoop Dogg, Tom Brady, and Paris Hilton, for use in Meta's messaging apps. While both companies are using generative AI and voices, their intended use and focus differ.

### Refine

In [10]:
query_engine = vector_index.as_query_engine(response_mode="refine")

response = query_engine.query("How do OpenAI and Meta differ on AI tools?")

display_response(response)

**`Final Response:`** OpenAI and Meta differ in their approach to AI tools. While OpenAI presents its products as productivity tools, Meta is focused on entertainment and is building LLMs for its messaging apps. OpenAI's latest updates for ChatGPT include a voice feature and the ability to upload images, while Meta has revealed 28 personality-driven chatbots for its messaging apps. Both companies are using generative AI and voices, but OpenAI's focus is on productivity, while Meta's is on entertainment. However, Meta's recent efforts to integrate AI characters into its social networking platforms suggest a shift towards a partially synthetic social network, which may have implications for the future of social media.

### Tree Summarize

In [11]:
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("How do OpenAI and Meta differ on AI tools?")

display_response(response)

**`Final Response:`** OpenAI and Meta both have AI tools, but they differ in their focus and intended use. OpenAI presents its products as productivity tools, while Meta is building LLMs for entertainment purposes. OpenAI's ChatGPT has added a voice feature, which gives it a hint of personality and makes it more useful as a mobile app. On the other hand, Meta has unveiled 28 personality-driven chatbots for its messaging apps, which are based on celebrity voices. While both companies are using generative AI and voices, their applications and target audiences differ.

## Router Query Engine

In [12]:
from llama_index.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

### Single Selector

In [13]:
from llama_index.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    select_multi=False
)

response = query_engine.query("What was mentioned about Meta?")

display_response(response)

**`Final Response:`** Meta is building LLMs and personality-driven chatbots to be used in their messaging apps. They have unveiled 28 personality-driven chatbots, including characters voiced by celebrities such as Charli D’Amelio, Dwyane Wade, Kendall Jenner, MrBeast, Snoop Dogg, Tom Brady, and Paris Hilton. The article suggests that this technology is new enough that celebrities are not yet entrusting their entire personas to Meta for safekeeping, but they are giving people a taste of what it's like to talk to AI versions of themselves. The potential for fans to spend hours talking to digital versions of celebrities is also mentioned.

### Multi Selector

In [14]:
from llama_index.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    select_multi=True,
)

response = query_engine.query("What was mentioned about Meta? Summarize with any other companies mentioned in the entire document.")

display_response(response)

**`Final Response:`** In the given context, Meta is building LLMs and personality-driven chatbots to be used in their messaging apps. The company unveiled 28 personality-driven chatbots to be used in Meta’s messaging apps, and celebrities including Charli D’Amelio, Dwyane Wade, Kendall Jenner, MrBeast, Snoop Dogg, Tom Brady, and Paris Hilton lent their voices to their effort. The article also mentions OpenAI, which is building LLMs and has added a voice feature to ChatGPT, allowing users to interact with it via voice. Other companies mentioned in the document include Google, Alexa, and the Google assistant.

## SubQuestion Query Engine

In [15]:
from llama_index.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

In [16]:
import nest_asyncio
nest_asyncio.apply()

In [17]:
from llama_index.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    verbose=True,
)

response = query_engine.query("What was mentioned about Meta? How Does it differ from how OpenAI is talked about?")

display_response(response)

Generated 5 sub questions.
[1;3;38;2;237;90;200m[vector_search] Q: What was mentioned about Meta in a recent news article?
[0m[1;3;38;2;237;90;200m[vector_search] A: In a recent news article, it was mentioned that Meta is building LLMs (large language models) and has found its own uses for generative AI and voices. The company unveiled 28 personality-driven chatbots to be used in Meta's messaging apps, featuring celebrities including Charli D’Amelio, Dwyane Wade, Kendall Jenner, MrBeast, Snoop Dogg, Tom Brady, and Paris Hilton. The article suggests that this technology is new enough that celebrities aren't yet entrusting their entire personas to Meta for safekeeping, but the potential for fans to interact with digital versions of celebrities is significant. The article also notes that feeds that were once defined by the connections they enabled between human beings may have become partially synthetic social networks.
[0m[1;3;38;2;90;149;237m[summary] Q: Summarize the entire news a

**`Final Response:`** In recent news articles, it was mentioned that Meta is building LLMs (large language models) and has found its own uses for generative AI and voices. The company unveiled 28 personality-driven chatbots to be used in Meta's messaging apps, featuring celebrities including Charli D’Amelio, Dwyane Wade, Kendall Jenner, MrBeast, Snoop Dogg, Tom Brady, and Paris Hilton. The focus of Meta's efforts is on entertainment, while OpenAI's latest updates for ChatGPT, including the addition of voice and image capabilities, have made the tool more useful for tasks like coaching, tutoring, or therapy. While both companies are building synthetic companions, OpenAI's focus is on productivity, while Meta's is on entertainment.

## SQL Query Engine

Here, we download and use a sample SQLite database with 11 tables, with various info about music, playlists, and customers. We will limit to a select few tables for this test.

In [18]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [19]:
!curl https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip -O /content/chinook.zip
!unzip /content/chinook.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 19  298k   19 60908    0     0   370k      0 --:--:-- --:--:-- --:--:--  369k100  298k  100  298k    0     0  1796k      0 --:--:-- --:--:-- --:--:-- 1787k
curl: (3) URL using bad/illegal format or missing URL
Archive:  /content/chinook.zip
  inflating: chinook.db              


In [20]:
from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column

engine = create_engine("sqlite:////content/chinook.db")

In [21]:
from llama_index import SQLDatabase

sql_database = SQLDatabase(engine)

In [22]:
from llama_index.indices.struct_store import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["albums", "tracks", "artists"],
    service_context=service_context
)

In [23]:
response = query_engine.query("What are some albums? Limit to 5.")

display_response(response)

**`Final Response:`** Based on the query results, some albums are:
1. Koyaanisqatsi (Soundtrack from the Motion Picture) by Philip Glass Ensemble (AlbumId: 347)
2. Mozart: Chamber Music by Nash Ensemble (AlbumId: 346)
3. Monteverdi: L'Orfeo by C. Monteverdi, Nigel Rogers - Chiaroscuro; London Baroque; London Cornett & Sackbu (AlbumId: 345)
4. Schubert: The Late String Quartets & String Quintet (3 CD's) by Emerson String Quartet (AlbumId: 344)
5. Respighi:Pines of Rome by Eugene Ormandy (AlbumId: 343)

In [24]:
response = query_engine.query("What are some artists? Limit it to 5.")

display_response(response)

**`Final Response:`** Some artists, limited to 5, are: A Cor Do Som, AC/DC, Aaron Copland & London Symphony Orchestra, Aaron Goldberg, and Academy of St. Martin in the Fields & Sir Neville Marriner.

This last query should be a more complex join

In [25]:
response = query_engine.query("What are some tracks from the artist AC/DC? Limit it to 3")

display_response(response)

**`Final Response:`** Some tracks from the artist AC/DC that we'll be discussing today are "Bad Boy Boogie," "Breaking The Rules," and "C.O.D." These songs are from different albums, but they all showcase the iconic sound of AC/DC.

In [26]:
print(response.metadata['sql_query'])

SELECT tracks.Name
FROM tracks
INNER JOIN albums ON tracks.AlbumId = albums.AlbumId
INNER JOIN artists ON albums.ArtistId = artists.ArtistId
WHERE artists.Name = 'AC/DC'
GROUP BY tracks.Name
ORDER BY tracks.Name ASC
LIMIT 3;


## Programs

Depending the LLM, you will have to test with either `OpenAIPydanticProgram` or `LLMTextCompletionProgram`

In [27]:
from typing import List
from pydantic import BaseModel

from llama_index.program import OpenAIPydanticProgram, LLMTextCompletionProgram

class Song(BaseModel):
    """Data model for a song."""

    title: str
    length_seconds: int


class Album(BaseModel):
    """Data model for an album."""

    name: str
    artist: str
    songs: List[Song]

In [28]:
from llama_index.output_parsers import PydanticOutputParser

prompt_template_str = """\
Generate an example album, with an artist and a list of songs. \
Using the movie {movie_name} as inspiration.\
"""
program = LLMTextCompletionProgram.from_defaults(
    output_parser=PydanticOutputParser(Album),
    prompt_template_str=prompt_template_str,
    llm=llm,
    verbose=True,
)

In [29]:
output = program(movie_name="The Shining")

In [30]:
print(output)

name='The Shining Soundtrack' artist='Wendy Carlos' songs=[Song(title='Main Title', length_seconds=2), Song(title='The Shining', length_seconds=10), Song(title='The Maze', length_seconds=12), Song(title='The Redrum', length_seconds=10), Song(title='The Maze (Reprise)', length_seconds=6), Song(title='The Shining (End Title)', length_seconds=10)]


## Data Agent

Similar to programs, OpenAI LLMs will use `OpenAIAgent`, while other LLMs will use `ReActAgent`.

In [31]:
from llama_index.agent import OpenAIAgent, ReActAgent

agent = ReActAgent.from_tools(
    [vector_tool, summary_tool],
    llm=llm,
    verbose=True
)

Some inputs are hallucinated, causing issues with responses. Likely a better system prompt or tool descriptions could help.

In [32]:
response = agent.chat("Hello!")
print(response)

[1;3;38;5;200mThought: I am designed to help with a variety of tasks.
Action: vector_search
Action Input: {'text': 'Hello!', 'num_beams': 5}
[0m[1;3;34mObservation: This query is not related to the given context information. The query provided is a hypothetical example of a query that could be made to a language model, and does not have any relevance to the article discussed.
[0m[1;3;38;5;200mResponse: The input provided is not related to the given context information.
[0mThe input provided is not related to the given context information.


In [33]:
response = agent.chat("What was mentioned about Meta? How Does it differ from how OpenAI is talked about?")
print(response)

[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: vector_search
Action Input: {'text': 'Meta and OpenAI', 'num_beams': 5}
[0m[1;3;34mObservation: In the given context, the query "Meta and OpenAI" refers to two companies that are building and utilizing generative AI and voices. OpenAI is presenting its products as productivity tools, while Meta is using generative AI and voices for entertainment purposes. Both companies are building LLMs and have revealed their own uses for generative AI and voices. Meta has unveiled 28 personality-driven chatbots to be used in its messaging apps, while OpenAI has added a voice to ChatGPT, which allows for more dynamic and engaging interactions. The potential for synthetic companions and companionship through AI is discussed, and the rise of a new era in the consumer internet is suggested.
[0m[1;3;38;5;200mResponse: According to the given context, while both Meta and OpenAI are utilizing generative AI and voices, t