From [Christorpher GS Blog](https://christophergs.com/blog/ai-engineering-retrieval-augmented-generation-rag-llama-index)

In [1]:
# get model

!curl https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/blob/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf


'curl' is not recognized as an internal or external command,
operable program or batch file.


In [1]:
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

llm = LlamaCPP(
    model_path="./mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf",  # https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
    context_window=32000,
    max_new_tokens=1024,
    #model_kwargs={'n_gpu_layers': 1},
    verbose=True
)

embedding_model = HuggingFaceEmbedding(model_name="WhereIsAI/UAE-Large-V1")  # https://huggingface.co/WhereIsAI/UAE-Large-V1

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from ./mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:            

In [2]:
from transformers import AutoTokenizer
from llama_index.core import Settings

Settings.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

In [3]:
import logging
import sys
from llama_index.core import set_global_handler

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
set_global_handler("simple")

In [4]:
from llama_index.core import ServiceContext

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embedding_model,
    system_prompt='You are a bot that answers questions about podcast transcripts'
)

  service_context = ServiceContext.from_defaults(


In [5]:
from pathlib import Path
import glob
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

In [6]:
transcript_directory = '.'
# Find all files that match the pattern "*transcript*" using the Python standard library glob module
transcript_files = glob.glob(str(Path(transcript_directory) / '**/transcript44.txt'), recursive=True)

documents = SimpleDirectoryReader(input_files=transcript_files).load_data()

# We pass in the service context we instantiated earlier (powered by our open-source LLM)
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: Hey everyone, welcome to the Latent Space Podca...
> Adding chunk: Hey everyone, welcome to the Latent Space Podca...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: But of course there's data and all this sort of...
> Adding chunk: But of course there's data and all this sort of...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: I actually only have access to gaming GPUs.
Sho...
> Adding chunk: I actually only have access to gaming GPUs.
Sho...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: How do you see that changing, now that a lot of...
> Adding chunk: How do you see that changing, now that a lot of...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: Like people train like 2 million batch sizes is...
> Adding chunk: Like people train like 2 million batch sizes is...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: I would argue 30 milliseconds per toke

Generating embeddings:   0%|          | 0/17 [00:00<?, ?it/s]

In [7]:
#just ask LamaCPP
response = llm.complete("According to Dylan Patel, what don't people understand about the semiconductor supply chain?")
print(response.text)


llama_print_timings:        load time =  103084.24 ms
llama_print_timings:      sample time =      80.36 ms /   249 runs   (    0.32 ms per token,  3098.36 tokens per second)
llama_print_timings: prompt eval time =  103083.17 ms /    22 tokens ( 4685.60 ms per token,     0.21 tokens per second)
llama_print_timings:        eval time =  637223.07 ms /   248 runs   ( 2569.45 ms per token,     0.39 tokens per second)
llama_print_timings:       total time =  744039.33 ms /   270 tokens


** Prompt: **
According to Dylan Patel, what don't people understand about the semiconductor supply chain?
**************************************************
** Completion: **


Dylan Patel: I think one of the things that people don't understand is how long it takes for a new fab to be built. It can take anywhere from two to four years just to build the fab and then another six months to a year to get it fully qualified and up and running at full capacity. So when you see these headlines saying there's going to be a shortage in 2023 or 2024, it's not because there's not enough capacity being added. It's because it takes so long for that capacity to come online. And then once it does come online, it's not like you can just flip a switch and suddenly have all this capacity available. It has to be ramped up slowly over time. So when people see these headlines and they think, "Oh, there's going to be a shortage in 2023 or 2024," they might think, "Well, why don't we just build more fabs?" 

In [9]:
#ask RAG, is it any better???
query_engine = index.as_query_engine()
result = query_engine.query("According to Dylan Patel, what don't people understand about the semiconductor supply chain?")

DEBUG:llama_index.core.indices.utils:> Top 2 nodes:
> [Node 718659e3-fd6d-4d66-a9cd-e2276df548bf] [Similarity score:             0.653151] I'm a fellow writer on Substack.
You are obviously managing your consulting business while you're...
> [Node 6df9d824-3260-47eb-bc18-49e4066f81c6] [Similarity score:             0.620988] Maybe the SemiAnalysis analyst point of view is, is it feasible to build this capacity up in the ...
> Top 2 nodes:
> [Node 718659e3-fd6d-4d66-a9cd-e2276df548bf] [Similarity score:             0.653151] I'm a fellow writer on Substack.
You are obviously managing your consulting business while you're...
> [Node 6df9d824-3260-47eb-bc18-49e4066f81c6] [Similarity score:             0.620988] Maybe the SemiAnalysis analyst point of view is, is it feasible to build this capacity up in the ...


Llama.generate: prefix-match hit

llama_print_timings:        load time =  103084.24 ms
llama_print_timings:      sample time =      50.30 ms /   127 runs   (    0.40 ms per token,  2524.75 tokens per second)
llama_print_timings: prompt eval time =  577741.87 ms /  1029 tokens (  561.46 ms per token,     1.78 tokens per second)
llama_print_timings:        eval time =  364217.23 ms /   126 runs   ( 2890.61 ms per token,     0.35 tokens per second)
llama_print_timings:       total time =  943574.16 ms /  1155 tokens


** Prompt: **
You are a bot that answers questions about podcast transcripts

Context information is below.
---------------------
file_path: transcript\transcript44.txt

I'm a fellow writer on Substack.
You are obviously managing your consulting business while you're also publishing these amazing posts.
What's your writing process?
How do you source info?
When do you sit down and go, like, here's the theme for the week?
Do you have a pipeline going out?
Just anything you can describe.
I'm thankful for my teammates because they are actually awesome, and they're much more directed, focused, working on one thing.
Or not one thing, but a number of things, like someone who's an expert on X and Y and Z and the semiconductor supply chain.
So that really helps with that side of the business.
I, most of the times, only write when I'm very excited, or it's like, hey, we should work on this and we should write about this.
One of the most recent posts we did was we explained the manufacturing proc

In [10]:
result.response

"\nDylan Patel of SemiAnalysis believes that people don't understand the complexity and fragmentation of the semiconductor supply chain beyond the location of the fabs where the chips are produced. He emphasizes the importance of other companies in the supply chain, such as the Austrian companies with high market share and specific technologies required for chip production, and the Japanese chemical companies involved in the process. He also mentions the challenge of building artificial intelligence to use multiple data centers with lower bandwidth between pools of resources, and the complexity of rebuilding the semiconductor supply chain in a few years."

In [12]:
#try with chroma db

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")

# set up ChromaVectorStore and load in data
vector_store2 = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context2 = StorageContext.from_defaults(vector_store=vector_store2)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context2, embed_model=embedding_model, service_context=service_context
)

# Query Data
query_engine = index.as_query_engine()
response = query_engine.query("According to Dylan Patel, what don't people understand about the semiconductor supply chain?")

DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: Hey everyone, welcome to the Latent Space Podca...
> Adding chunk: Hey everyone, welcome to the Latent Space Podca...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: But of course there's data and all this sort of...
> Adding chunk: But of course there's data and all this sort of...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: I actually only have access to gaming GPUs.
Sho...
> Adding chunk: I actually only have access to gaming GPUs.
Sho...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: How do you see that changing, now that a lot of...
> Adding chunk: How do you see that changing, now that a lot of...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: Like people train like 2 million batch sizes is...
> Adding chunk: Like people train like 2 million batch sizes is...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: I would argue 30 milliseconds per toke

Llama.generate: prefix-match hit


INFO:backoff:Backing off send_request(...) for 1.3s (requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='us-api.i.posthog.com', port=443): Max retries exceeded with url: /batch/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000024F3276CB90>, 'Connection to us-api.i.posthog.com timed out. (connect timeout=15)')))
Backing off send_request(...) for 1.3s (requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='us-api.i.posthog.com', port=443): Max retries exceeded with url: /batch/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000024F3276CB90>, 'Connection to us-api.i.posthog.com timed out. (connect timeout=15)')))
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (4): us-api.i.posthog.com:443
Starting new HTTPS connection (4): us-api.i.posthog.com:443
ERROR:backoff:Giving up send_request(...) after 4 tries (requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='us-api.i.posthog.com', port=4


llama_print_timings:        load time =  103084.24 ms
llama_print_timings:      sample time =     113.11 ms /   138 runs   (    0.82 ms per token,  1220.06 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  223537.33 ms /   138 runs   ( 1619.84 ms per token,     0.62 tokens per second)
llama_print_timings:       total time =  230992.71 ms /   139 tokens


** Prompt: **
You are a bot that answers questions about podcast transcripts

Context information is below.
---------------------
file_path: transcript\transcript44.txt

I'm a fellow writer on Substack.
You are obviously managing your consulting business while you're also publishing these amazing posts.
What's your writing process?
How do you source info?
When do you sit down and go, like, here's the theme for the week?
Do you have a pipeline going out?
Just anything you can describe.
I'm thankful for my teammates because they are actually awesome, and they're much more directed, focused, working on one thing.
Or not one thing, but a number of things, like someone who's an expert on X and Y and Z and the semiconductor supply chain.
So that really helps with that side of the business.
I, most of the times, only write when I'm very excited, or it's like, hey, we should work on this and we should write about this.
One of the most recent posts we did was we explained the manufacturing proc

In [13]:
%%time

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex, download_loader

PDFReader = download_loader('PDFReader')
loader = PDFReader()

documents = loader.load_data(file=Path("./pdfs/Trial-by-Sorcery.pdf"))

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embedding_model,
    system_prompt='You are a bot that answers questions about the book.'
)

# create client and a new collection
chroma_client = chromadb.EphemeralClient()


chroma_collection = chroma_client.get_or_create_collection("quickstart")
chroma_client.delete_collection("quickstart")
chroma_collection = chroma_client.get_or_create_collection("quickstart")
    
# set up ChromaVectorStore and load in data
vector_store2 = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context2 = StorageContext.from_defaults(vector_store=vector_store2)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context2, embed_model=embedding_model, service_context=service_context
)

# Query Data
query_engine = index.as_query_engine()



DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: 
> Adding chunk: 
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: Trial
by
Sorcery
	
Dragon	Riders	of	Osnen	Book	...
> Adding chunk: Trial
by
Sorcery
	
Dragon	Riders	of	Osnen	Book	...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: Trial	by	Sorcery	©	2020	by	Richard	Fierce
	
	
T...
> Adding chunk: Trial	by	Sorcery	©	2020	by	Richard	Fierce
	
	
T...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: TABLE	OF	CONTENTS
Chapter	1
Chapter	2
Chapter	3...
> Adding chunk: TABLE	OF	CONTENTS
Chapter	1
Chapter	2
Chapter	3...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: 1
	
I	marveled	at	the	vastness	of	the	Citadel.
...
> Adding chunk: 1
	
I	marveled	at	the	vastness	of	the	Citadel.
...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: least	a	hundred	people	waiting	to	get	into	the	...
> Adding chunk: least	a	hundred	people	waiting	to	get	into	the	...
DEBUG:llama_index.co



DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: swing,	which	threw	his	motion	off.	The	blade	sw...
> Adding chunk: swing,	which	threw	his	motion	off.	The	blade	sw...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: 7
	
When	I	awoke,	I	had	no	idea	where	I	was.
My...
> Adding chunk: 7
	
When	I	awoke,	I	had	no	idea	where	I	was.
My...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: “I	was	browsing	the	vendors.”
“How	did	you	end	...
> Adding chunk: “I	was	browsing	the	vendors.”
“How	did	you	end	...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: “I	have	to.	This	is	my	only	chance.”	I	couldn’t...
> Adding chunk: “I	have	to.	This	is	my	only	chance.”	I	couldn’t...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: have	to	ask.	Did	you	use	magic	against	those	gu...
> Adding chunk: have	to	ask.	Did	you	use	magic	against	those	gu...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: my	body	was	sore.	By	the	time	I	reache

In [14]:
%%time
response = query_engine.query("What is a Dragon Guard?")

DEBUG:llama_index.vector_stores.chroma.base:> Top 1 nodes:
> Top 1 nodes:
DEBUG:llama_index.vector_stores.chroma.base:> [Node 052103da-46a0-4e66-a7ab-a0f001ddfe30] [Similarity score: 0.5267347034731992] 2
	
The	entrance	to	the	Citadel	was	much	more	heavily	guarded	than	the	city
gates.	And	these	guar...
> [Node 052103da-46a0-4e66-a7ab-a0f001ddfe30] [Similarity score: 0.5267347034731992] 2
	
The	entrance	to	the	Citadel	was	much	more	heavily	guarded	than	the	city
gates.	And	these	guar...
DEBUG:llama_index.vector_stores.chroma.base:> [Node b219b40b-73bb-4197-93a5-34f27b6c5be1] [Similarity score: 0.46287886124050837] 1
	
I	marveled	at	the	vastness	of	the	Citadel.
It	was	home	to	the	Dragon	Guard,	the	greatest	warr...
> [Node b219b40b-73bb-4197-93a5-34f27b6c5be1] [Similarity score: 0.46287886124050837] 1
	
I	marveled	at	the	vastness	of	the	Citadel.
It	was	home	to	the	Dragon	Guard,	the	greatest	warr...
DEBUG:llama_index.core.indices.utils:> Top 2 nodes:
> [Node 052103da-46a0-4e66-a7ab-a0f001dd

Llama.generate: prefix-match hit


INFO:backoff:Backing off send_request(...) for 0.4s (requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='us-api.i.posthog.com', port=443): Max retries exceeded with url: /batch/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000024F26916D50>, 'Connection to us-api.i.posthog.com timed out. (connect timeout=15)')))
Backing off send_request(...) for 0.4s (requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='us-api.i.posthog.com', port=443): Max retries exceeded with url: /batch/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x0000024F26916D50>, 'Connection to us-api.i.posthog.com timed out. (connect timeout=15)')))
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (18): us-api.i.posthog.com:443
Starting new HTTPS connection (18): us-api.i.posthog.com:443
INFO:backoff:Backing off send_request(...) for 0.6s (requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='us-api.i.posthog.com', port=443


llama_print_timings:        load time =  103084.24 ms
llama_print_timings:      sample time =      17.15 ms /    73 runs   (    0.23 ms per token,  4256.81 tokens per second)
llama_print_timings: prompt eval time =  378241.80 ms /  1882 tokens (  200.98 ms per token,     4.98 tokens per second)
llama_print_timings:        eval time =   26945.30 ms /    72 runs   (  374.24 ms per token,     2.67 tokens per second)
llama_print_timings:       total time =  405725.66 ms /  1954 tokens


** Prompt: **
You are a bot that answers questions about podcast transcripts

Context information is below.
---------------------
page_label: 11
file_name: pdfs\Trial-by-Sorcery.pdf

2
	
The	entrance	to	the	Citadel	was	much	more	heavily	guarded	than	the	city
gates.	And	these	guards	weren’t	the	city	guard,	either.	They	were	Dragon
Guards.	Their	armor	was	decorated	to	look	like	dragon	scales,	but	it	was
versatile	and	practical	for	battle.	Behind	the	assemblage	of	guards	was	a	long
wooden	table	that	had	weapons	scattered	haphazardly	on	its	surface.
As	I	drew	nearer,	there	was	a	
whooshing
	sound	that	echoed	off	the	massive
walls	and	made	the	items	on	the	table	clatter.	The	guards	seemed
unperturbed	by	the	noise,	but	I	was	trying	to	figure	out	what	it	was	and
where	it	was	coming	from.	Suddenly,	a	massive	blue	dragon	swooped	down
from	the	sky	and	landed	in	the	courtyard.
I	held	my	breath	in	awe	as	I	stared	at	the	powerful	beast.	It	was	easily
thirty	feet	long	from	its	nose	to	its	tail.	The	

In [3]:
%%time

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex, download_loader
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import ServiceContext
from pathlib import Path
import glob
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

PDFReader = download_loader('PDFReader')
loader = PDFReader()

llm = LlamaCPP(
    model_path="./mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf",  # https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
    context_window=32000,
    max_new_tokens=768,
    # model_kwargs={'n_gpu_layers': 1},
    verbose=True
)
embedding_model = HuggingFaceEmbedding(model_name="intfloat/e5-base-v2") 
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embedding_model,
    system_prompt='You are a bot that answers questions about the book.'
)

documents = loader.load_data(file=Path("./pdfs/Trial-by-Sorcery.pdf"))

# create client and a new collection
chroma_client = chromadb.EphemeralClient()

chroma_collection = chroma_client.get_or_create_collection("quickstart2")
chroma_client.delete_collection("quickstart2")
chroma_collection = chroma_client.get_or_create_collection("quickstart2")
                                        
# set up ChromaVectorStore and load in data
vector_store2 = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context2 = StorageContext.from_defaults(vector_store=vector_store2)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context2, embed_model=embedding_model, service_context=service_context
)

# Query Data
query_engine = index.as_query_engine()

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from ./mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:            

CPU times: total: 6min 45s
Wall time: 2min 30s


In [8]:
%%time
#response = query_engine.query("query: What is a Dragon Guard?")
response = query_engine.query("query: Who is the author of the book?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =  144827.68 ms
llama_print_timings:      sample time =       5.41 ms /    15 runs   (    0.36 ms per token,  2773.67 tokens per second)
llama_print_timings: prompt eval time =   90902.54 ms /   181 tokens (  502.22 ms per token,     1.99 tokens per second)
llama_print_timings:        eval time =   31097.30 ms /    14 runs   ( 2221.24 ms per token,     0.45 tokens per second)
llama_print_timings:       total time =  122168.07 ms /   195 tokens


CPU times: total: 5min 21s
Wall time: 2min 3s


In [7]:
print(response)


The context information provided does not contain any details about what a Dragon Guard is.
