<a href="https://colab.research.google.com/github/ganesh3/llm-work/blob/main/RAG_Chatbot_for_AWS_Case_Studies_%26_Blogs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kaggle llama-index llama-index-llms-huggingface llama-index-embeddings-huggingface
!kaggle datasets download -d harshsinghal/aws-case-studies-and-blogs
!unzip /content/aws-case-studies-and-blogs.zip -d /content/aws-case-studies-and-blogs/

Collecting llama-index
  Downloading llama_index-0.12.0-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-llms-huggingface
  Downloading llama_index_llms_huggingface-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.4.0-py3-none-any.whl.metadata (767 bytes)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.0-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.0 (from llama-index)
  Downloading llama_index_core-0.12.1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.0-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from ll

###### In our case, the AWS Case Studies and Blog dataset comprises of only text files. Hence, we will use SimpleDirectoryReader from llama_index.core to read and parse all the text documents present in the data folder.

###### Next, we perform indexing which converts these text documents to their vector embeddings and builds an index structure for efficient and relevant retrieval. For this we load a bge-base embedding model using HuggingFaceEmbedding API from llama_index.embeddings.huggingface.

###### Lastly, we use VectorStoreIndex from llama_index.core to build a vector embedding store for our documents. Once we have generated the index, we store it in our storage to avoid indexing every time.


In [2]:
import os

In [3]:
from llama_index.core import (VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage, Settings)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [4]:
# bge-base embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
# check if storage already exists
PERSIST_DIR = "./aws-storage-index"

In [6]:
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("/content/aws-case-studies-and-blogs/").load_data()
    print("Total documents : ", len(documents))
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

Total documents :  347


In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [9]:
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16, low_cpu_mem_usage=True)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [10]:
tokenizer.save_pretrained("tinyllama-tokenizer")
model.save_pretrained("tinyllama-model", max_shard_size="1000MB")

In [11]:
tokenizer.save_pretrained("tinyllama-tokenizer")
model.save_pretrained("tinyllama-model", max_shard_size="1000MB")

In [12]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    device='cuda'
)

Both `device` and `device_map` are specified. `device` will override `device_map`. You will most likely encounter unexpected behavior. Please remove `device` and keep `device_map`.


In [13]:
prompt = "What are different use cases of Amazon Sagemaker?"

In [14]:
sequences = pipe(
    prompt,
    max_new_tokens=250,
    do_sample=True,
    top_k=10,
    return_full_text = False,
)

In [15]:
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: 


In [16]:
from llama_index.llms.huggingface import HuggingFaceLLM
Settings.llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.1, "do_sample": False},
    tokenizer_name="tinyllama-tokenizer",
    model_name="tinyllama-model",
    tokenizer_kwargs={"max_length": 512},
    model_kwargs={"torch_dtype": torch.float16}
)








Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [17]:
import warnings
warnings.simplefilter("ignore")

from IPython.display import HTML, display

In [18]:
def set_css():
    display(HTML('''
        <style>
            pre {
                white-space: pre-wrap;
            }
        </style>
    '''))

get_ipython().events.register('pre_run_cell', set_css)

In [19]:
# We can now query the index
query_engine = index.as_query_engine(similarity_top_k=2) # similarity_top_k is the number of documents the engine will retrieve from the index for context
response = query_engine.query("What are different use cases of Amazon Sagemaker?")
print(response)

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.



1. Improving Performance and Standardizing Deployment of ML Models Using Amazon SageMaker
2. Using SageMaker to Implement Improved Revenue Management Solution for Visualfabriq in Under 2 Months
3. Standardized deployment of ML models using SageMaker
4. Improving Model Response Times by 200 percent and deploying a scalable solution that requires less manual intervention and facilitates faster onboarding for new customers.
5. Using SageMaker to improve response times by 200 percent and deployed a scalable solution that requires less manual intervention and facilitates faster onboarding for new customers.
6. Improving Revenue Management Solution for Visualfabriq in Under 2 Months
7. Deploying a scalable solution that requires less manual intervention and facilitates faster onboarding for new customers.
8. Using Amazon SageMaker to Improve Response Times by 200 percent and deployed a scalable solution that requires less manual intervention and facilitates faster onboarding for new custome

In [20]:
!pip freeze > ./requirements.txt

In [None]:
query_engine = index.as_chat_engine()
query_engine.reset()
print("Your bot is ready to talk! Type your messages here or send 'stop'")
while True:
    query = input("<|user|>\n")
    if query == 'stop':
        break
    response = query_engine.chat(query)
    print("\n", response, "\n")

Your bot is ready to talk! Type your messages here or send 'stop'
<|user|>
What are the different use cases of AWS Sagemaker?

 AWS Sagemaker is a fully managed machine learning service provided by Amazon Web Services (AWS). Here are some of the common use cases of AWS Sagemaker:

1. Model training and deployment: AWS Sagemaker provides a platform for model training and deployment. It allows you to train models on your own data, and then deploy them to production.

2. Image classification: AWS Sagemaker provides a platform for image classification tasks. It allows you to train models on your own data, and then deploy them to production.

3. Natural language processing: AWS Sagemaker provides a platform for natural language processing tasks. It allows you to train models on your own data, and then deploy them to production.

4. Time series forecasting: AWS Sagemaker provides a platform for time series forecasting tasks. It allows you to train models on your own data, and then deploy the