* LlamaIndex is a data framework for Large Language Models (LLMs) based applications.
* LlamaIndex lets you ingest data from APIs, databases, PDFs, and more via flexible data connectors. 
* This data is indexed into intermediate representations optimized for LLMs. 
* LlamaIndex then allows natural language querying and conversation with your data via query engines, chat interfaces, and LLM-powered data agents. 
* It enables your LLMs to access and interpret private data on large scales without retraining the model on newer data.

### How Llama index works?
* LlamaIndex uses Retrieval Augmented Generation (RAG) systems that combine large language models with a private knowledge base. 
* It generally consists of two stages: the indexing stage and the querying stage.

### Indexing stage
* LlamaIndex will efficiently index private data into a vector index during the indexing stage. This step helps create a 
searchable knowledge base specific to your domain. You can input text documents, database records, knowledge graphs, and other data types.
* Indexing converts the data into numerical vectors or embeddings that capture its semantic meaning. It enables quick similarity searches across the content.

### Querying stage
* During the querying stage, the RAG pipeline searches for the most relevant information based on the user's query. This information is then given 
to the LLM, along with the query, to create an accurate response.
* This process allows the LLM to have access to current and updated information that may not have been included in its initial training.
* The main challenge during this stage is retrieving, organizing, and reasoning over potentially multiple knowledge bases.

In [None]:
#pip install llama-index

In [None]:
# By default, LlamaIndex uses OpenAI GPT-3 text-davinci-003 model. To use this model, you must have an OPENAI_API_KEY setup. 

In [None]:
os.environ["OPENAI_API_KEY"] = "INSERT OPENAI KEY"

In [1]:
# Open AI API
import openai
# Set your API key : https://platform.openai.com/account/api-keys
# store the API key in Environment variable : https://networkdirection.net/python/resources/env-variable/
import os
openai.api_key = os.environ.get('OpenAI_API_Key')

#### Adding Personal Data to LLMs using LlamaIndex

In [None]:
# LlamaIndex to create a resume reader.

In [None]:
#pip install pypdf

In [None]:
# Loading data and creating the index

In [15]:
from llama_index import TreeIndex, SimpleDirectoryReader

resume = SimpleDirectoryReader("Private-Data").load_data()
new_index = TreeIndex.from_documents(resume)

In [16]:
resume

[Document(id_='6abde3e7-7404-455c-b3f1-66cae5441bfb', embedding=None, metadata={'page_label': '1', 'file_name': 'abid_resume.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='c13913cd2ba19aa99d4b8f2adebdcd0cb9af070d57e27da5975051c203cb80ba', text="\xa0 \xa0\nContact\nwww.linkedin.com/in/1abidaliawan\n(LinkedIn)\nabid.work  (Portfolio)\nmuckrack.com/abidaliawan\n(Portfolio)\nabidaliawan.com  (Personal)\nTop Skills\nMLOps\nSearch Engine Optimization (SEO)\nData Science Editor \nLanguages\nSpanish  (Elementary)\nSwedish  (Elementary)\nUrdu  (Native or Bilingual)\nItalian  (Elementary)\nEnglish  (Full Professional)\nCertifications\nData Scientist Professional\nHonors-Awards\nDeepnote publishing competition\nMVP\n10 Rank in WiDS Datathon 2021\n2nd Position in Deepnote\nCompetition\n100th percentile in Python\nProgramming\n100th percentile on the Machine\nLearning Fundamentals\nPublications\nDREEM-ME: Distributed Regional\nEnergy Efficient Multi-h

In [17]:
new_index

<llama_index.indices.tree.base.TreeIndex at 0x1c3081a43d0>

In [None]:
# Running a query

In [None]:
# rsponse from OpenAI GPT-3 text-davinci-003 model.

In [18]:
query_engine = new_index.as_query_engine()
response = query_engine.query("When did Abid graduated?")
print(response)

Abid graduated in February 2014.


In [19]:
response = query_engine.query("What is the name of certification that ABid received?")
print(response)

Data Scientist Professional


* Creating an index is a time-consuming process. We can avoid re-creating indexes by saving the context. By default, the following command will save the index store in the ./storage directory.

In [None]:
# Saving and loading the context

In [21]:
new_index.storage_context.persist()

In [None]:
# load the storage context and create an index.

In [22]:
from llama_index import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

In [23]:
storage_context 

StorageContext(docstore=<llama_index.storage.docstore.simple_docstore.SimpleDocumentStore object at 0x000001C3081A08D0>, index_store=<llama_index.storage.index_store.simple_index_store.SimpleIndexStore object at 0x000001C30818F450>, vector_store=<llama_index.vector_stores.simple.SimpleVectorStore object at 0x000001C30818CED0>, graph_store=<llama_index.graph_stores.simple.SimpleGraphStore object at 0x000001C37F289450>)

In [24]:
index

<llama_index.indices.tree.base.TreeIndex at 0x1c37fc086d0>

In [25]:
query_engine = index.as_query_engine()
response = query_engine.query("What is Abid's job title?")
print(response)

Abid's job title is Data Scientist Professional.


## Chatbot

In [26]:
query_engine = index.as_chat_engine()
response = query_engine.chat("What is the job title of Abid in 2021?")
print(response)

The job title of Abid in 2021 is Data Science Consultant.


In [27]:
response = query_engine.chat("What else did he do during that time?")
print(response)

During that time, Abid also worked as a Data Science Consultant at Guidepoint, wrote for Towards Data Science and Towards AI, worked as a Data Science Copywriter for Machine Learning Mastery, served as an Ambassador for Deepnote, and worked as a Technical Writer for Start It Up.


### Building Wiki Text to Speech with LlamaIndex

In [None]:
# Next project involves developing an application that can respond to questions sourced from Wikipedia and convert them into speech.

In [None]:
# Web scraping Wikipedia page

In [30]:
from pathlib import Path

import requests

response = requests.get(
    "https://en.wikipedia.org/w/api.php",
    params={
        "action": "query",
        "format": "json",
        "titles": "Italy",
        "prop": "extracts",
        # 'exintro': True,
        "explaintext": True,
    },
).json()
page = next(iter(response["query"]["pages"].values()))
italy_text = page["extract"]

data_path = Path("data")

if not data_path.exists():
    Path.mkdir(data_path)

# with open("data/italy_text.txt", "w") as fp:
#     fp.write(italy_text)

In [32]:
with open("data/italy_text.txt", "w", encoding="utf-8") as fp:
    fp.write(italy_text)

In [None]:
# convert text to speech using an API.
# pip install elevenlabs

* By using SimpleDirectoryReader we will load the data and convert the TXT file into a vector store using VectorStoreIndex

In [33]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from IPython.display import Markdown, display
from llama_index.tts import ElevenLabsTTS
from IPython.display import Audio

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

In [34]:
documents

[Document(id_='8a75dcdd-849f-498d-ad4b-7efe03c0ac9a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='7ec69609d2c5683f48a4239f6e05692eafa2486e209c5ed3e7e24a44c456b003', text='Italy (Italian: Italia [iˈtaːlja] ), officially the Italian Republic or the Republic of Italy, is a country in Southern and Western Europe. Located in the middle of the Mediterranean Sea, it consists of a peninsula delimited by the Alps and surrounded by several islands.\nItaly shares land borders with France, Switzerland, Austria, Slovenia and the enclaved microstates of Vatican City and San Marino. It has a territorial exclave in Switzerland (Campione) and an archipelago in the African Plate (Pelagie Islands). Italy covers an area of 301,340 km2 (116,350 sq mi), with a population of about 60 million; it is the tenth-largest country by land area in the European continent and the third-most populous member state of the European Union. Its capital

In [35]:
index

<llama_index.indices.vector_store.base.VectorStoreIndex at 0x1c308c8e810>

In [None]:
# Query

In [36]:
query = "Tell me an interesting fact about the country?"
query_engine = index.as_query_engine()
response = query_engine.query(query)

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{response}</p>"))

<b>Tell me an interesting fact about the country?</b>

<p>Italy is known for having the largest number of World Heritage Sites in the world, with a total of 58 sites.</p>

In [None]:
# Text to speech

In [38]:
import os
#elevenlabs_key = os.environ["ElevenLabs_key"]
elevenlabs_key =os.environ.get('ElevenLabs_Key')
tts = ElevenLabsTTS(api_key=elevenlabs_key)

* Add the response to the generate_audio function to generate natural voice. To listen to the audio, use IPython.display’s Audio function.

In [39]:
audio = tts.generate_audio(str(response))
Audio(audio)

### End