**Coursebook: Using LLM for Web Scraping and Prompt Design**

- Part 4 of using LLM for web scraping and prompt design
- Course Length: 9 hours
- Last Updated: July 2023
---

Developed by Algoritma's Research and Development division

## Background

The coursebook is part of the **Large Language Models Specialization** developed by [Algoritma](https://algorit.ma/). The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

# Using LLM for Web Scraping and Prompt Design

- **Leveraging Large Language Models (LLM) for Web Scraping and Information Extraction**
   - Using LLM for web scraping
   - Introduction to the steps involved in connecting website URLs with LLM
   - Introduction to LlamaIndex and its usage in web scraping
   - Demonstration of using LangChain and OpenAI to build a Question-Answering System with website data

- **Optimizing Large Language Models (LLM) for Local Deployment and Performance Enhancement**
   - Building a locally hosted (offline) LLM using LlamaIndex and OPT (Instruction-Tuning LLM)
   - Demonstration of using LlamaIndex and OPT to host LLM locally
   - Designing effective prompts for LLM
   - Using LangChain's Caching to enhance LLM performance
   - Demonstration of prompt design and caching to maximize LLM usage

- **Ethical Considerations and Future Implications of Generative AI with Large Language Models (LLM)**
   - A language for LLM prompt design: Guidance
   - Understanding the ethical considerations of Generative AI
   - Impact on privacy, bias, and misinformation
   - Responsible user of Large Language Models in society
   - Discussion on the future of Generative AI and its potential impact

## Leveraging Large Language Models (LLM) for Web Scraping and Information Extraction

### Using LangChain to Scrape Information from Website

We have used text and pdf by loading it as `Document` using `Loader`. Now we try to scrape information from website and feed it to LLM using `WebBaseLoader` from `langchain`.

[Docs WebBaseLoader](https://python.langchain.com/docs/integrations/document_loaders/web_base)

In [44]:
from langchain.document_loaders import WebBaseLoader

# load FAQ blog about Algoritma
loader = WebBaseLoader("https://blog.algorit.ma/faq-bootcamp-algoritma-data-science/")
docs = loader.load()

`docs` is a list of documents, the component of each document are:
- page_content: The content/information of page
- metadata: The detail information about the page content, like source, language and title of the page

In [48]:
# Only got 1 document because we only scrape from 1 website source
len(docs)

1

In [50]:
# see the metadata
docs[0].metadata

{'source': 'https://blog.algorit.ma/faq-bootcamp-algoritma-data-science/',
 'title': 'FAQ Bootcamp Algoritma Data Science School',
 'description': 'Temukan jawaban dari beberapa pertanyaan yang sering ditanyakan seputar Bootcamp Data Science di Algoritma Data Science School.',
 'language': 'id'}

In [51]:
# the length of the page_content
len(docs[0].page_content)

10053

In [49]:
# see the first of 100 character page_content
docs[0].page_content[:100]

'\n\n\n\n\nFAQ Bootcamp Algoritma Data Science School\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

Notice there are so many "\n\n". Before we embed this text, we need to clean it and divide the text into chunks using `CharacterTextSplitter`

In [53]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)

algo_texts = text_splitter.split_documents(docs)

Created a chunk of size 199, which is longer than the specified 100
Created a chunk of size 7513, which is longer than the specified 100
Created a chunk of size 236, which is longer than the specified 100
Created a chunk of size 179, which is longer than the specified 100
Created a chunk of size 107, which is longer than the specified 100
Created a chunk of size 134, which is longer than the specified 100
Created a chunk of size 226, which is longer than the specified 100


In [54]:
# see how many chunks created
len(algo_texts)

22

In [55]:
# See the first chunk
# Even tho there are several \n left, but so many \n removed compared with docs[0].page_content[:100]
algo_texts[0]

Document(page_content='FAQ Bootcamp Algoritma Data Science School\n\n\n \n\n\nSkip to content\n\nAbout\n\n \nSearch', metadata={'source': 'https://blog.algorit.ma/faq-bootcamp-algoritma-data-science/', 'title': 'FAQ Bootcamp Algoritma Data Science School', 'description': 'Temukan jawaban dari beberapa pertanyaan yang sering ditanyakan seputar Bootcamp Data Science di Algoritma Data Science School.', 'language': 'id'})

Because we have divided the text into chunks, now we can embed the text, stored it to Chroma (vector database) and create QnAChain just like we did in module 3

In [56]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

In [57]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Name the chroma collection as "algo-faq"
algo_db = Chroma.from_documents(algo_texts, embedding_function, collection_name="algo-faq")

In [58]:
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain import OpenAI

# Load OpenAI token from env
load_dotenv()
# create llm using OpenAI
llm = OpenAI()

# Create the QnA Chain
algo_faq = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=algo_db.as_retriever()
)

Contoh bot/QnA system yang dapat digunakan user untuk menanyakan dalam berbagai bahasa dan mendapatkan jawaban sesuai konteks pertanyaan berdasarkan informasi QnA perusahaan, dalam hal ini adalah Algoritma

In [42]:
# Asking qustion
algo_faq("Why study in Algoritma Data Science?")

{'query': 'Why study in Algoritma Data Science?',
 'result': ' Algoritma Data Science School provides a comprehensive curriculum for students to learn data science, with benefits such as lifetime learning, full course material, portfolio projects, teaching assistant, mentoring sessions, career preparation services, and quality facilities.'}

In [43]:
algo_faq("Metode belajar di Algoritma?")

{'query': 'Metode belajar di Algoritma?',
 'result': ' Metode belajar di Algoritma Data Science School terbagi dua, yaitu Onsite dan Online. Onsite adalah Belajar dengan metode offline di kelas Algoritma yang terletak di Menara Menara Standard Chartered, Lantai 31. Online adalah belajar dari mana saja Anda mau dengan metode online interaktif.'}

In [59]:
algo_faq("Apakah program di Algoritma dapat diikuti orang awam?")

{'query': 'Apakah program di Algoritma dapat diikuti orang awam?',
 'result': ' Ya, program di Algoritma Data Science School dapat diikuti oleh orang awam tanpa latar belakang pendidikan IT sebelumnya. Contohnya seperti Reyna Cheryl Sondakh dan Samuel Gema yang telah berhasil menyelesaikan program dengan background non-IT dan kini telah menjadi data scientist atau data analyst di perusahaan ternama di Indonesia.'}

### Using LLama Index to Create QnA Web-based information

The goal of LlamaIndex is to enhance document management through advanced technology, providing an intuitive and efficient way to search and summarize documents using LLMs and innovative indexing techniques. [source1](https://medium.com/badal-io/exploring-langchain-and-llamaindex-to-achieve-standardization-and-interoperability-in-large-2b5f3fabc360) [source2](https://gpt-index.readthedocs.io/en/latest/index.html#why-llamaindex).



In [61]:
# Using Trafilatura to Scrape information from web
from llama_index import TrafilaturaWebReader

In [63]:
documents = TrafilaturaWebReader().load_data(["https://blog.algorit.ma/faq-bootcamp-algoritma-data-science/"])

len(documents)

1

In [64]:
len(documents)

1

In [65]:
documents

[Document(id_='21fbf7e9-3972-4eda-bb17-7cb62785682a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='69b226edcf37ea64cfc32e7e9a0eba2c4bf7680cb0f5d00d90eba8757072e169', text='Frequently Asked Questions (FAQ) Bootcamp Algoritma Data Science School\nTemukan jawaban dari beberapa pertanyaan yang sering ditanyakan seputar Bootcamp Data Science di Algoritma Data Science School.\nTable of Contents\nKenapa Belajar di Algoritma Data Science School?\n- Lifetime Learning\nSeluruh Alumni Algoritma (Academy Regular dan Full-Stack) mendapatkan akses mengikuti workshop yang diadakan oleh Tim Algoritma secara gratis seumur hidup!\n- Materi Lengkap\nMateri Kursus (PDF & HTML), dataset untuk latihan, catatan referensi, dan worksheet (Notebook R atau Notebook Jupyter) dapat diakses dengan mudah melalui akun Learning Management System.\n- Portfolio Project\nProyek Data Science yang Anda buat sesuai dengan perkembangan industri dan kasus

In [68]:
# see the first 200 text content
# more clean than langchain webbaseloader
documents[0].text[:200]

'Frequently Asked Questions (FAQ) Bootcamp Algoritma Data Science School\nTemukan jawaban dari beberapa pertanyaan yang sering ditanyakan seputar Bootcamp Data Science di Algoritma Data Science School.\n'

After that, we create Vector Index (data structure) and store it to Chroma using `VectorStoreIndex` for efficient retrieval of items based on keys in information vector store (`Chroma`)

[VectorStoreIndex](https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/index/index_guide.html#vector-store-index)

In [74]:
from llama_index import GPTVectorStoreIndex
import chromadb

In [75]:
# Create empty collection
chroma_client = chromadb.Client()
algo_collection = chroma_client.create_collection('algo-faq')

In [79]:
# Load OpenAI token from env
load_dotenv()
# Create Vector Index and store it to 'index' object
index = GPTVectorStoreIndex.from_documents(documents, 
                                           chroma_collection = algo_collection)
# Create query engine from index
query_engine = index.as_query_engine()

In [80]:
# Query/ask question from query_engine
response = query_engine.query("Apakah program di Algoritma dapat diikuti orang awam?")

response

Response(response='\nYa, program di Algoritma dapat diikuti oleh orang awam. Algoritma tidak membatasi usia untuk belajar data science bersama Algoritma, umumnya yang mengikuti program ini berusia 17 - 55 tahun.', source_nodes=[NodeWithScore(node=TextNode(id_='a8b1ac9d-bad1-491a-b233-9a9169d2deb4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='21fbf7e9-3972-4eda-bb17-7cb62785682a', node_type=None, metadata={}, hash='69b226edcf37ea64cfc32e7e9a0eba2c4bf7680cb0f5d00d90eba8757072e169'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='8dd47171-f2a9-4aab-b9d4-c1701a86f662', node_type=None, metadata={}, hash='c5a3f1870b9d87f2ed3ec9d5f246d4d51f912288ec31cd9978caed1cda07a20f')}, hash='8c3a398efe18675002915bebedb4637136a3b312e200da01f35699ac2b835206', text='hingga meraih mimpi yang Anda inginkan. Sehingga Algoritma pun tidak membatasi usia untuk belajar data science b

In [82]:
print(f"Answer: {response}")

Answer: 
Ya, program di Algoritma dapat diikuti oleh orang awam. Algoritma tidak membatasi usia untuk belajar data science bersama Algoritma, umumnya yang mengikuti program ini berusia 17 - 55 tahun.


### Wrap Together Into Function

In [83]:
# Create embedding
def create_embedding_store(name):
    chroma_client = chromadb.Client()
    return chroma_client.create_collection(name)

# Query Pages and get list of questions
def query_pages(collection, urls, questions):
    # Get content from webpages
    docs = TrafilaturaWebReader().load_data(urls)
    # create vectorstoreindex
    index = GPTVectorStoreIndex.from_documents(docs, chroma_collection=collection)
    # Create query engine
    query_engine = index.as_query_engine()
    # Iterate, query and answer every questions
    for question in questions:
        print(f"Question: {question}")
        print(f"Answer: {query_engine.query(question)} \n\n")

Example: if we have product and want to know about the reviewer characteristic and opinion that using our product

In [None]:
# Get product review for 2 pages
url_list_erha = ["https://reviews.femaledaily.com/products/cleanser/facial-wash/erha/erha-1-facial-wash?cat=&cat_id=0&age_range=&skin_type=&skin_tone=&skin_undertone=&hair_texture=&hair_type=&order=honest&page=1",
                 "https://reviews.femaledaily.com/products/cleanser/facial-wash/erha/erha-1-facial-wash?cat=&cat_id=0&age_range=&skin_type=&skin_tone=&skin_undertone=&hair_texture=&hair_type=&order=honest&page=2"]

questions_erha = [
    "How long the reviewer use Erha 1?",
    "What is the sentiment review of this product and what its about?",
    "Is there any suggested improvement from reviewer?",
    "What is the reviewer's favorite thing about this product?" 
]

collection_erha = create_embedding_store("erha1")

In [17]:
# query the question and generate answer
query_pages(
    collection_erha,
    url_list_erha,
    questions_erha
)

Question: How long the reviewer use Erha 1?
Answer: 
More than 1 year 


Question: What is the sentiment review of this product and what its about?
Answer: 
The sentiment review of this product is generally positive. Customers have found that the product is gentle on the skin, has a clear texture, and has a faint chemical smell. It lathers well and leaves the skin feeling soft and not dry. The packaging is compact and travel friendly, and the price is affordable. Customers have also found that the product lasts a long time and is easy to repurchase. 


Question: Is there any suggested improvement from reviewer?
Answer: 
Yes, there are suggested improvements from the reviewers. For example, Zahradestriana recommends using a facial wash with AHA, DMAE, and Aloe Vera Extract for normal and dry skin. Novitawd suggests finding a facial wash that is free of SLS. Windiiw_ recommends using a facial wash with Niacinamide, AHA, DMAE, and Aloe Vera Extract for anti-aging and moisturizing benefits

## Optimizing Large Language Models (LLM)

### LangChain Caching

LangChain provides an optional caching layer for LLMs. This is useful for two reasons:

It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times. It can speed up your application by reducing the number of API calls you make to the LLM provider.

[source](https://python.langchain.com/docs/modules/model_io/models/llms/llm_caching)

First, let's track how many tokens and cost we use for our query. We can tracking token usage using `get_openai_callback()`

[source](https://python.langchain.com/docs/modules/model_io/models/llms/token_usage_tracking)

In [85]:
from langchain.callbacks import get_openai_callback

In [91]:
with get_openai_callback() as cb:
    result = algo_faq("Apakah program di Algoritma dapat diikuti orang awam?")
    print(f"Question: {result['query']}\nAnswer:\n{result['result']} \n")
    print("---- cb -----")
    print(cb)

Question: Apakah program di Algoritma dapat diikuti orang awam?
Answer:
 Ya, program di Algoritma dapat diikuti orang awam yang tidak memiliki latar belakang pendidikan IT sebelumnya. Algoritma akan mengajarkan dasar-dasar data science, seperti dasar-dasar pemrograman dan statistik praktis, sehingga program ini terbuka untuk siapa saja yang tidak memiliki latar belakang IT. Contohnya seperti Reyna Cheryl Sondakh dan Samuel Gema. 

---- cb -----
Tokens Used: 2910
	Prompt Tokens: 2786
	Completion Tokens: 124
Successful Requests: 1
Total Cost (USD): $0.0582


Caching help us to reduce the cost for repetitive query. We can implement langchain caching with various way. Which are:
- In Memory Cache
- SQLite Cache

#### In Memory Cache

In [92]:
# In memory cache
import langchain
from langchain.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()

In [93]:
# Let's see caching impact
with get_openai_callback() as cb:
    result = algo_faq("Apakah program di Algoritma dapat diikuti orang awam?")
    print(f"Question: {result['query']}\nAnswer:\n{result['result']} \n")
    print("---- cb -----")
    print(cb)

Question: Apakah program di Algoritma dapat diikuti orang awam?
Answer:
 Ya, program di Algoritma dapat diikuti orang awam yang tidak memiliki latar belakang pendidikan IT sebelumnya. Kursus dasar-dasar data science, seperti dasar-dasar pemrograman dan statistik praktis, diajarkan sehingga sangat memungkinkan program ini terbuka untuk siapa saja yang tidak memiliki latar belakang IT. Contohnya seperti Reyna Cheryl Sondakh dan Samuel Gema. 

---- cb -----
Tokens Used: 2914
	Prompt Tokens: 2786
	Completion Tokens: 128
Successful Requests: 1
Total Cost (USD): $0.05828


In [94]:
with get_openai_callback() as cb2:
    result2 = algo_faq("Apakah program di Algoritma dapat diikuti orang awam?")
    print(f"Question: {result2['query']}\nAnswer:\n{result2['result']} \n")
    print("---- cb -----")
    print(cb2)

Question: Apakah program di Algoritma dapat diikuti orang awam?
Answer:
 Ya, program di Algoritma dapat diikuti orang awam yang tidak memiliki latar belakang pendidikan IT sebelumnya. Kursus dasar-dasar data science, seperti dasar-dasar pemrograman dan statistik praktis, diajarkan sehingga sangat memungkinkan program ini terbuka untuk siapa saja yang tidak memiliki latar belakang IT. Contohnya seperti Reyna Cheryl Sondakh dan Samuel Gema. 

---- cb -----
Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0


The second query doesn't need any tokens and cost, and give us the same result as before.

#### SQLite Cache

In [98]:
# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
langchain.llm_cache = SQLiteCache(database_path=".langchain.db")

In [99]:
with get_openai_callback() as cb:
    result3 = algo_faq("Apakah ada alternatif pembayaran selain cash dan juga credit untuk menjadi student Algoritma?")
    print(f"Question: {result3['query']}\nAnswer:\n{result3['result']} \n")
    print("---- cb -----")
    print(cb)

Question: Apakah ada alternatif pembayaran selain cash dan juga credit untuk menjadi student Algoritma?
Answer:
 Ya, ada tiga mitra pembiayaan yang dapat menawarkan paket pembayaran hingga 18 bulan kepada siswa yang memenuhi syarat. Ketiga platform tersebut adalah Danacita, Edufund, Koinworks. 

---- cb -----
Tokens Used: 2865
	Prompt Tokens: 2795
	Completion Tokens: 70
Successful Requests: 1
Total Cost (USD): $0.0573


In [100]:
with get_openai_callback() as cb2:
    result4 = algo_faq("Apakah ada alternatif pembayaran selain cash dan juga credit untuk menjadi student Algoritma?")
    print(f"Question: {result4['query']}\nAnswer:\n{result4['result']} \n")
    print("---- cb -----")
    print(cb2)

Question: Apakah ada alternatif pembayaran selain cash dan juga credit untuk menjadi student Algoritma?
Answer:
 Ya, ada tiga mitra pembiayaan yang dapat menawarkan paket pembayaran hingga 18 bulan kepada siswa yang memenuhi syarat. Ketiga platform tersebut adalah Danacita, Edufund, Koinworks. 

---- cb -----
Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0


## Ethical Considerations

### A Language for LLM Prompt Design: Guidance