## Introduction to 🗂️ LlamaIndex 🦙

#### What does LlamaIndex do?

ChatGPT is trained on huge amounts of data. But what if you wish to train ChatGPT on your private data. There are 3 ways in which you can achieve this.


1.   Train an open-source LLM like Llama on your data. This is a complex and time taking process which is not scalable.
2.   Pass all of your documents as prompt to LLM. This has limitations since the context window size is limited.
3.   Fetch and pass only the relevant documents as input to your LLM.

LlamaIndex works using the 3rd method and we will study how we can do that with the help of an example. Some of the concepts being covered by Index etc. will be covered in more detail in the upcoming lessons.

#### Install LlamaIndex and dependicies

##### Installation from Pip

In [1]:
!pip install llama_index

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##### Installation from Source

Git clone this repository: git clone https://github.com/jerryjliu/llama_index.git. Then do:

`pip install -e .` if you want to do an editable install (you can modify source files) of just the package itself.

`pip install -r requirements.txt` if you want to install optional dependencies + dependencies used for development (e.g. unit testing).

#### Train chatbot using LlamaIndex

##### Download the data to train on. We use state of the union text document to train over ChatGPT.

In [26]:
import os
Root = "/content/drive/MyDrive/LLAMA/Data"
os.chdir(Root)

In [20]:
!wget https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt

--2023-06-15 11:47:05--  https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39027 (38K) [text/plain]
Saving to: ‘state_of_the_union.txt.2’


2023-06-15 11:47:05 (8.08 MB/s) - ‘state_of_the_union.txt.2’ saved [39027/39027]



##### Train the chatbot using LlamaIndex
Now comes the main part. We will use LlamaIndex to train ChatGPT over our private data. Some of the concepts mentioned here like VectorStoreIndex will be explained in further detail in upcoming lesson.

We are using Simple directory data reader from LlalaIndex to read the data from above downloaded file. This reader can read data from all the files in a directory and convert it into documents format which can be trained.

Place your openai key in place of `"YOUR KEY"`

In [35]:
import os
os.environ['OPENAI_API_KEY'] = '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'
print(os.getenv('OPENAI_API_KEY'))

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


In [36]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')
documents = SimpleDirectoryReader('/content/drive/MyDrive/LLAMA/Data').load_data()
index = VectorStoreIndex.from_documents(documents)

Now we create a LlamaIndex interface called query engine to query our documents. Again this will be explained in more detail in upcoming lessons.

With this you can now query over your data in natural language.

In [37]:
query_engine = index.as_query_engine()
response = query_engine.query("What is NATO?")
print(response)


NATO (the North Atlantic Treaty Organization) is an intergovernmental military alliance between 29 North American and European countries. It was created to secure peace and stability in Europe after World War 2. The United States is a member of NATO, along with 28 other nations.
