
The workflow can be described as followed:

1. The user poses a question.
2. A Google search is performed using the question.
3. The top-k search results, or the most relevant webpages, are downloaded.
4. Raw HTML data is transformed into a usable format by LangChain.
5. All documents are split into 1,000 character chunks.
6. Compute embeddings for each document chunk and store them in a vector store (chromadb).
7. Build a prompt using the user's question from step 1 and all the scraped web data using LangChain.
8. Query an OpenAI model to generate an answer.
9. Identify the documents that contributed to the answer and return them as references.

## Querying Google and scraping websites

First we need to install the required dependencies.

In [7]:
!pip3 install -U readabilipy langchain openai bs4 requests chromadb tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting readabilipy
  Downloading readabilipy-0.2.0-py3-none-any.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.0.192-py3-none-any.whl (989 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.0/990.0 kB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting requests
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.6/62.6 kB[0m [31m6.3 MB/s[0m eta [36m0

In [8]:
import requests # Required to make HTTP requests
from bs4 import BeautifulSoup # Required to parse HTML
import numpy as np # Required to dedupe sites
from urllib.parse import unquote # Required to unquote URLs

In [9]:
query = 'tell me about legitt.xyz' # The query to search Google for and ask the AI about

In [10]:
response = requests.get(f"https://www.google.com/search?q={query}") # Make the request
soup = BeautifulSoup(response.text, "html.parser") # Parse the HTML
links = soup.find_all("a") # Find all the links in the HTML

In [11]:
# loop over `links` and keep only the one that have the href starting with "/url?q="
urls = ['https://legitt.xyz/',
    'https://legitt.xyz/about-us',
    'https://legitt.xyz/blog',
    'https://legitt.xyz/pricing',
    'https://legitt.xyz/product-tour',
    'https://legitt.xyz/smart-contract',]

# urls = []
# for l in [link for link in links if link["href"].startswith("/url?q=")]:
#     # get the url
#     url = l["href"]
#     # remove the "/url?q=" part
#     url = url.replace("/url?q=", "")
#     # remove the part after the "&sa=..."
#     url = unquote(url.split("&sa=")[0])
#     # special case for google scholar
#     if url.startswith("https://scholar.google.com/scholar_url?url=http"):
#         url = url.replace("https://scholar.google.com/scholar_url?url=", "").split("&")[0]
#     elif 'google.com/' in url: # skip google links
#         continue
#     if url.endswith('.pdf'): # skip pdf links
#         continue
#     if '#' in url: # remove anchors (e.g. wikipedia.com/bob#history and wikipedia.com/bob#genetics are the same page)
#         url = url.split('#')[0]
#     # print the url
#     urls.append(url)

# Use numpy to dedupe the list of urls after removing anchors
urls = list(np.unique(urls))
urls

['https://legitt.xyz/',
 'https://legitt.xyz/about-us',
 'https://legitt.xyz/blog',
 'https://legitt.xyz/pricing',
 'https://legitt.xyz/product-tour',
 'https://legitt.xyz/smart-contract']

In [12]:
from readabilipy import simple_json_from_html_string # Required to parse HTML to pure text
from langchain.schema import Document # Required to create a Document object

In [13]:
def scrape_and_parse(url: str) -> Document:
    """Scrape a webpage and parse it into a Document object"""
    req = requests.get(url)
    article = simple_json_from_html_string(req.text, use_readability=True)
    # The following line seems to work with the package versions on my local machine, but not on Google Colab
    # return Document(page_content=article['plain_text'][0]['text'], metadata={'source': url, 'page_title': article['title']})
    if article is not None:
    # Iterate over the attributes
      return Document(page_content='\n\n'.join([a['text'] for a in article['plain_text']]), metadata={'source': url, 'page_title': article['title']})
    else:
    # Handle the case when article is None
      return None  # Or raise an exception, depending on your requirements

    

In [14]:
from langchain import schema

In [15]:
# It's possible to optitimize this by using asyncio
try:
  documents = [scrape_and_parse(url) for url in urls] # Scrape and parse all the urls
except TypeError:
  print("documents not created!!!\n")

In [16]:
documents

[Document(page_content="How it works.Programmed to Perform.Legitt lets you create contracts that can be programmed to monitor data streams and trigger event-based actions.Watch VideoBe in Control.Search any document you are looking for with just a few keywords.Track all your contracts in a single visual dashboard with their current statuses.Create contracts by selecting from 100s of templates, or use AI Contract builder to create contracts from scratch.Upload existing contracts and convert them into smart contracts on Legitt platform.Check all your messages and stey updated.Track and manage overdue actions to minimise the impact of fines and penaltiesAlways be updated on the next action on your contracts. Never miss a deadline and avoid penalties and finesAll your documents, intelligently sorted with status and chronological order.We have customized notification by which you never miss an update or a deadline. Always keep track of important events and avoid penalties and fines.Write:Al

In [18]:
type(documents)
len(documents)

6

## Splitting documents into chunks


In [19]:
from langchain.text_splitter import CharacterTextSplitter

In [20]:
text_splitter = CharacterTextSplitter(separator=' ', chunk_size=1000, chunk_overlap=200)

In [21]:
texts = text_splitter.split_documents(documents)

In [22]:
len(texts)

235

## Computing embeddings of chunks and storage in a vector store


In [23]:
from langchain.embeddings.openai import OpenAIEmbeddings

In [24]:

OPENAI_API_KEY = "sk-****"

In [25]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [26]:
from langchain.vectorstores import Chroma

In [27]:
docsearch = Chroma.from_documents(texts, embeddings)

AuthenticationError: ignored

## Configuring what model we use and ask questions

We can now pick which model to use and start asking questions!

In [None]:
from langchain.llms import OpenAIChat # Required to create a Language Model

In [None]:
# Pick an OpenAI model
llm = OpenAIChat(model_name='gpt-3.5-turbo', openai_api_key=OPENAI_API_KEY)



In [None]:
from langchain import  VectorDBQA # Required to create a Question-Answer object using a vector

In [None]:
import pprint # Required to pretty print the results

In [None]:
# Stuff all the information into a single prompt (see https://docs.langchain.com/docs/components/chains/index_related_chains#stuffing)
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "Who were the main players in the race to complete the human genome? And what were their approaches? Give as much detail as possible."
result = qa({"query": query})



In [None]:
pprint.pprint(result)

{'query': 'Who were the main players in the race to complete the human genome? '
          'And what were their approaches? Give as much detail as possible.',
 'result': 'The main players in the race to complete the human genome were the '
           'publicly funded Human Genome Project (HGP) and the privately '
           'funded Celera Corporation, led by J. Craig Venter. Their '
           'approaches differed in that the HGP was a large, collaborative '
           'international effort, while Celera focused on creating a '
           'proprietary database using advanced sequencing technology. The '
           'competition arose from the prospect of gaining control over '
           'potential patents on the genome sequence, which was considered '
           'valuable. However, the rivalry ended when Celera and the HGP '
           'joined forces, thus speeding completion of the rough draft '
           'sequence of the human genome. Collaborative efforts continued for '
          

In [None]:
[a.metadata['source'] for a in result['source_documents']] # Print the source documents

['https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project',
 'https://www.britannica.com/event/Human-Genome-Project',
 'https://web.ornl.gov/sci/techresources/Human_Genome/project/hgp.shtml',
 'https://www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genome-project']

In [None]:
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "How were the donor participants recruited for the human genome project? Summarize in three sentences."
result = qa({"query": query})

In [None]:
pprint.pprint(result)

{'query': 'How were the donor participants recruited for the human genome '
          'project? Summarize in three sentences.',
 'result': 'The International Human Genome Sequencing Consortium collected '
           'blood or sperm samples from many donors, with their identities '
           'protected to maintain anonymity. Only a few samples were used for '
           'DNA resources, and most of the sequence generated by the public '
           'HGP came from a single anonymous male donor from Buffalo, New '
           'York. Volunteers were recruited through a process of informed '
           'consent, with a 1997 newspaper advertisement from Buffalo seeking '
           'participants.',
 'source_documents': [Document(page_content='of the joint publications, press releases announced that the project had been completed by both groups. Improved drafts were announced in 2003 and 2005, filling in to approximately 92% of the sequence currently. Genome donors[edit] In the International Hu

In [None]:
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "What happened to Craig Venter's company following the completion of the human genome project? Give as much detail as possible."
result = qa({"query": query})

In [None]:
pprint.pprint(result)

{'query': "What happened to Craig Venter's company following the completion of "
          'the human genome project? Give as much detail as possible.',
 'result': 'Following the completion of the human genome project, Craig '
           "Venter's company, Celera Genomics, faced a decision on what type "
           'of company it would become. It added sequences from three '
           'different mouse strains to its database and briefly ventured into '
           'proteomics. However, Venter resigned as CEO in January 2002, and '
           'the company decided to focus on drug discovery rather than '
           'information. Despite being timed to coincide with the celebrations '
           'of the 50th anniversary of the Watson-Crick discovery of the '
           'double-helical structure of DNA, there was less fanfare '
           'surrounding the official date of completion of the HGP in April '
           '2003. Celera remained a threat, as the validity of the WGS '
           's

In [None]:
qa = VectorDBQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=docsearch, return_source_documents=True)
query = "How come the project finished earlier then expected? Give as much detail as possible."
result = qa({"query": query})

In [None]:
pprint.pprint(result)

{'query': 'How come the project finished earlier then expected? Give as much '
          'detail as possible.',
 'result': 'The project finished earlier than expected due to a deliberate '
           'focus on technology development, improved sequencing technologies, '
           'and a change in approach to the finishing process. The original '
           "completion date was set for 2005, but the project's goals and "
           'related strategic plans were updated periodically throughout the '
           'project. The final completion date was moved forward to 2003 with '
           'a plan for a "working draft" of the human genome sequence by '
           'December 2001. The project ended up costing less than expected, at '
           "around $2.7 billion. Many of the project's achievements were "
           'beyond what scientists thought possible in 1988.',
 'source_documents': [Document(page_content='original goals for the Human Genome Project in 1988, which included sequencing