# <font color="#76b900"> Trying Out LlamaIndex </font>

**This notebook is there for those who REALLY need to know about LlamaIndex NOW!** The instructor may or may not want to make reference to this notebook to show off how llama-index is roughly-structured, but the details are out of scope for this course. This notebook uses the well-known [Dive into Deep Learning](https://d2l.ai/) book and creates an index over the PDF. From this, it is able to use the index as a query engine to inject responses directly into the model's context, effectively executing on retrieval-augmented generation. This is a lightweight specification with few of the special tricks you might want to implement in practice, but it's a good starting point when you're specifically dealing with the Llama-2 model. 

In [1]:
!wget https://d2l.ai/d2l-en.pdf

--2024-03-06 06:33:53--  https://d2l.ai/d2l-en.pdf
Resolving d2l.ai (d2l.ai)... 18.245.113.100, 18.245.113.91, 18.245.113.51, ...
Connecting to d2l.ai (d2l.ai)|18.245.113.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44685994 (43M) [application/pdf]
Saving to: ‘d2l-en.pdf’


2024-03-06 06:33:54 (70.3 MB/s) - ‘d2l-en.pdf’ saved [44685994/44685994]



In [None]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [2]:
import torch
from transformers import pipeline
from typing import Optional, List, Mapping, Any

from llama_index import (
    ServiceContext, 
    SimpleDirectoryReader,
    SummaryIndex
)
from llama_index.callbacks import CallbackManager
from llama_index.llms import (
    CustomLLM, 
    CompletionResponse, 
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.llms.base import llm_completion_callback

ModuleNotFoundError: No module named 'openai.openai_object'

In [None]:
llama_pipe = pipeline("text-generation", model="TheBloke/Llama-2-70B-chat-GPTQ", device_map="auto")

In [128]:
from llama_index.prompts import PromptTemplate

system_prompt = """[INST]<<SYS>>
You are a helpful, respectful and honest AI assistant. Always answer as helpfully as possible, while being safe. 
Please be brief and efficient unless asked to elaborate, and follow the conversation flow.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something incorrect. 
If you don't know the answer to a question, please don't share false information. 
If the user asks for a format to output, please follow it as closely as possible. 
<</SYS>>"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("{query_str}[/INST]")

import torch
from llama_index.llms import HuggingFaceLLM
llm = HuggingFaceLLM(
    context_window=4096, 
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    model=llama_pipe.model,
    tokenizer=llama_pipe.tokenizer,
    # tokenizer_name="TheBloke/Llama-2-70B-chat-GPTQ",
    # model_name="TheBloke/Llama-2-70B-chat-GPTQ",
    device_map="auto",
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16}
)

from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings
from llama_index import ServiceContext

embed_model = HuggingFaceBgeEmbeddings(model_name="BAAI/bge-base-en-v1.5")

service_context = ServiceContext.from_defaults(
    chunk_size  = 2048, 
    llm         = llm,
    embed_model = embed_model
)

In [None]:
from llama_index.llms import ChatMessage  # We'll use this later

service_context.llm.chat([ChatMessage(role="user", content="Hello World")])

In [130]:
from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('./d2l-en.pdf'))

In [132]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [133]:
from auto_gptq import exllama_set_max_input_length
service_context.llm._model = exllama_set_max_input_length(service_context.llm._model, 4096)

In [None]:
query_engine = index.as_query_engine(service_context=service_context)
response = query_engine.query("Who is the author, what did they do growing up, and how did it help him write this book?")
print(response)

In [None]:
response = query_engine.query("Please explain what topics are covered in the book?")
print(response)

In [None]:
response = query_engine.query("Please explain how language models reason about inputs, and list some examples?")
print(response)

In [None]:
service_context.llm.chat([ChatMessage(role="user", content="Please explain how language models reason about inputs, and list some examples?")])

In [None]:
## Please Run When You're Done!
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>