# Text Parsing & Web Scraping Using LangChain

## Lab Description:

In this lab, participants will learn how to perform web scraping using LangChain and process the extracted content with an LLM. We start by fetching text from a single webpage and generating a structured response. Then, we extend the process to iteratively scrape all links within a webpage. We then demonstrate how to extract and process text from a PDF embedded in a webpage. Finally, we explore the LangChain Wikipedia API wrapper to efficiently retrieve structured data from Wikipedia.

## Lab Objectives:
### After Completing the Lab, Participants will be able to:

- Extract text from a single webpage and process it using an LLM to generate a structured and readable summary.
- Iteratively scrape all links within a webpage and extract their content dynamically.
- Scrape content from a PDF embedded in a webpage, demonstrating how to handle different document formats.
- Use LangChain’s Wikipedia API wrapper to extract structured information from Wikipedia pages efficiently.

## What is Web Scraping ?

Web scraping is the automated process of extracting data from websites using scripts or bots. It typically involves fetching a webpage’s HTML and parsing specific content, such as text, images, or links, to gather desired information. Web scraping is an important process for collection of large text data for many applications like LLM Training.



## Importing the Libraries

In [1]:
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
from langchain_community.llms import Ollama
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from IPython.display import Markdown, display
import nest_asyncio
import os

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Loading the LLM

We use **LLaMA 3.1:8b** model as the LLM for this lab.

In [2]:
# Load the LLaMA 3.1 model with 8 billion parameters using the Ollama library
llm = Ollama(model='llama3.1:8b', base_url="http://10.79.253.112:11434")

# Apply nest_asyncio to allow asynchronous tasks to run in the notebook environment
nest_asyncio.apply()


  llm = Ollama(model='llama3.1:8b', base_url="http://10.79.253.112:11434")


## Scraping from a single web page

Scraping from a single webpage is a straightforward task. We simply provide the links that need to be scraped to LangChain's `AsyncChromiumLoader`.
Chromium is one of the browsers supported by Playwright, a library used for browser automation. AsyncChromiumLoader uses an instance of Chromium in headless mode (which means the browser runs without displaying its graphical user interface). Essentially, it allows us to load webpages without needing to open a browser.

BeautifulSoup is a library used to parse HTML content. It converts raw HTML into a structured tree, making navigation and manipulation easier.

Once we obtain the HTML content of a webpage using `AsyncChromiumLoader`, we can pass it to the `BeautifulSoupTransformer`. Since BeautifulSoup simplifies working with HTML, it provides functionalities for extracting specific tags (such as `<span>`) from the document. We use the `tags_to_extract` parameter for this purpose.

`docs_transformed` is a list of Document objects with associated metadata and page_content. Since we are primarily interested in the `page_content`, we can extract only that and store it in the document variable.

<div style="text-align: center;">
    <img src="flow.png" alt="flow" width="780" height="620">
</div>

In [3]:
# Load the CNN webpage in headless mode using AsyncChromiumLoader
loader = AsyncChromiumLoader(["https://edition.cnn.com/"], headless=True)

# Load the HTML content from the page
html = loader.load()

# Initialize BeautifulSoupTransformer for HTML processing
bs_transformer = BeautifulSoupTransformer()

# Extract text from <span> tags in the HTML document
docs_transformed = bs_transformer.transform_documents(html, tags_to_extract=["span"])

# Get the first 1000 characters of the extracted content
document = docs_transformed[0].page_content[0:1000]


USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
document

'No   More Your CNN account Sign in to your CNN account Your CNN account Sign in to your CNN account  Follow CNN US-China trade talks Superbug threat Israeli airstrike on central Gaza school compound Trump-Carney Tesla sales plunge T. rex ancestors Frequent flyer millionaire  • Live Updates Live Updates Live Updates Pakistan says it downed 5 Indian fighter jets. India hasn’t confirmed losses, says it struck ‘terrorist infrastructure’ after Kashmir tourist massacre. Full report India launches strikes in Pakistan, Islamabad claims 5 Indian jets downed Military escalation between India and Pakistan spirals. Here’s what we know In pictures Scenes from Kashmir and Pakistan Indian PM Modi’s right-hand man ’proud of our armed forces‘ Video India conducts deepest strikes into Pakistan in more than 50 years 2:23 Trump’s team finally meeting with China. The future of the global economy is riding on its success India just agreed a massive trade deal – but it’s not with the US First boats carrying

Now that we have the text content, its time to pass it to the LLM. The LLM can perform various tasks on the data. We can prompt the LLM to extract headlines from a news website and ask it to summarize it for us. 

In [5]:
#Prompt / Instructions for the LLM
prompt = PromptTemplate.from_template(
    """You are provided with the HTML content {content} of a news article webpage, which has been transformed using BeautifulSoup. The content includes multiple tech articles, each consisting of a headline and a corresponding body of text. Your task is to:

        1. Identify the article headlines: Locate the headlines of the articles from the structured HTML.
        2. Summarize the articles: After identifying a headline, find the corresponding body of the article that follows it, and summarize the key information from the article.
        3. Respond in a structured format: Present the results in an organized and easy-to-read format with clear distinctions between each article.
        
        Output Requirements:

        For each article, follow this structure:

        Headline: Extract the headline of the article.
        Summary: Summarize the corresponding article that follows the headline. Keep the summary concise, highlighting the most important points, including any key events, dates, or important figures mentioned.
        
        Ensure the response is easy to understand, using short paragraphs or bullet points when needed.
        
        Ignore any advertisements, sidebars, or unrelated content that may also be present on the page.

        Ignore all the headers and other html elements, only focus on meaningful text. 
        
        You should only extract content that is directly related to the news articles (headlines and the following article text).
        
        Keep your language simple and easy to follow to ensure clarity."""
)

#Building the chain
chain = prompt | llm | StrOutputParser()

#Getting the response from the LLM
response = chain.invoke({"content":document})

#Displaying the response
display(Markdown(response))

Here are the results of extracting the headlines and summarizing the corresponding articles:

**Article 1: US-China trade talks**

* **Headline:** Trump’s team finally meeting with China. The future of the global economy is riding on its success
* **Summary:** The US and China are holding crucial trade talks, which will determine the fate of the global economy. This development comes after India and Pakistan's military escalation.

**Article 2: US-India-Pakistan Conflict**

* **Headline:** Pakistan says it downed 5 Indian fighter jets. India hasn’t confirmed losses, says it struck ‘terrorist infrastructure’ after Kashmir tourist massacre.
* **Summary:**
	+ Pakistan claims to have shot down 5 Indian fighter jets in a military escalation between the two nations.
	+ India has not confirmed any losses but claims to have targeted "terrorist infrastructure" following a recent attack on tourists in Kashmir.
	+ The conflict is ongoing, with both countries accusing each other of aggression.

**Article 3: Tesla Sales**

* **Headline:** Trump-Carney Tesla sales plunge
* **Summary:** Due to the ongoing trade talks between the US and China, Tesla's sales have reportedly declined. However, it is unclear how this decline will affect the company in the long term.

**Article 4: Superbug Threat**

* **No corresponding headline found**
	+ (Assuming there was a mistake in parsing the HTML content or that the relevant article was not provided)

**Article 5: Israeli Airstrike**

* **Headline:** Israeli airstrike on central Gaza school compound
* **Summary:** Israel carried out an airstrike on a school compound in central Gaza, sparking concerns over civilian casualties.

**Article 6: India Launches Strikes**

* **Headline:** India launches strikes in Pakistan, Islamabad claims 5 Indian jets downed
* **Summary:**
	+ India has launched military strikes against targets in Pakistan.
	+ Pakistan claims to have shot down 5 Indian fighter jets during the operation.
	+ The conflict between the two nations continues to escalate.

**Article 7: Frequent Flyer Millionaire**

* **No corresponding headline found**
	+ (Assuming there was a mistake in parsing the HTML content or that the relevant article was not provided)

Note: Articles 4 and 7 do not have corresponding headlines, which might indicate an error in parsing the HTML content.

The model fetches the important headlines, and summarizes it for us. All the data is scraped from the news website !

## Scraping by iteratively fetching all the links in a given webpage

So the expected question is, what if there are multiple links in a single webpage ? Apparently 90% of all websites today will have hyperlinks inside them. What if we want to fetch data from all these links ? or say, some of these links that are of interest to us ?. 

We can do this with BeautifulSoup and a little HTML knowledge. 



## Fetching all the links within a website

First we get the content of the website from which we need to fetch all the links. This can be done using python's built in requests library. Then we use BeautifulSoup to parse the HTML content.

Once we have the parsed data, we find all the <a href> tags in the document. In HTML, The <a href> tag specifies the URL or location that the link points to. 

Now that we have all the <a href> tags, we can start to create URLS from them. 

If the text inside <a href> tag starts with "https://" we can understand that it is a completelty new URL and needs no further processing, we can add it to our url list. 

What if it starts with a "/" ? In this case we will have to append this text to our base url. 

Let us see an example:

Suppose we have a dummy website. https://www.dummy.com

It has an <a href> tag that looks like this: <a href = "/dummy_example/example.pdf">..</a>

So we will have to add the "/dummy_example/example.pdf" to our base url, that is, "https://www.dummy.com", to get something like "https://www.dummy.com/dummy_example/example.pdf". 

It is then this url that we add to the url list.

1. **Imports and Initialization**:
   - `requests` is used to handle HTTP requests, and `BeautifulSoup` from `bs4` is used for parsing HTML content.
   - An empty list `url` is initialized to store the extracted URLs.

2. **Fetching and Parsing HTML**:
   - The base URL of the website is defined as `site`.
   - An HTTP GET request fetches the HTML content, which is then parsed using `BeautifulSoup` with the `"html.parser"` option for efficient HTML parsing.

3. **Extracting Links**:
   - A loop iterates over each anchor tag (`<a>`) found on the page.
   - For each anchor, the `href` attribute is checked to identify the URL format:
     - If it starts with `"https://"`, it is considered an absolute URL and added directly to the `url` list.
     - If it starts with `"/"`, it is treated as a relative URL and appended to the base site URL to form a complete link.
     - Other formats are handled by appending them to the base URL with appropriate formatting.
   
This process collects all valid URLs, both absolute and relative, and stores them in the `url` list for further use.


In [6]:
import requests
from bs4 import BeautifulSoup

#initialize the url list
url = []   

#the base url
site = "https://docs.ezmeral.hpe.com/unified-analytics/15/"

#get the content from the base url
r = requests.get(site)
#parsing using beautifulsoup
s = BeautifulSoup(r.text, "html.parser")

#iteratively fetch all the urls.
for i in s.find_all("a"):
    href = i.attrs.get('href')

    if href.startswith('https://'):
        url.append(href)
    elif href.startswith('/'):
        if site.endswith('/'):
            site = site[:-1]
        url.append(site + href)
        
    else:
        if site.endswith('/'):
            url.append(site + href)
        else:
            url.append(site + '/' + href)    


url



['https://docs.ezmeral.hpe.com/unified-analytics/15/#content',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/index.html',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/GetStarted/get-started.html',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/ManageClusters/administration.html',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/Security/security.html',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/DataEngineering/data-engineering.html',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/DataAnalytics/data-analytics.html',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/DataScience/data-science.html',
 'https://partner.hpe.com/',
 'https://support.hpe.com/',
 'https://developer.hpe.com/',
 'https://community.hpe.com/t5/HPE-Ezmeral-Software-platform/bd-p/ezmeral-software-platform/',
 'https://www.hpe.com/psnow/doc/a00109800enw',
 'https://www.hpe.com/us/en/legal/privacy.html',
 'https://docs.ezmeral.hpe.com/unified-analytics/15/./glossary/glossary.html']

1. **Loading HTML Asynchronously**:
   - `AsyncHtmlLoader` is used to load HTML content from the specified `url`.
   - The `load()` method is called to retrieve the HTML content, storing it in the `html` variable.

2. **Transforming HTML Content with BeautifulSoup**:
   - A `BeautifulSoupTransformer` instance, `bs_transformer`, is created to handle HTML transformation.
   - `transform_documents()` processes the `html` content, using BeautifulSoup to extract specified HTML tags.
   - The `tags_to_extract` argument defines which HTML tags to extract—in this case, all `<div>` tags.

This setup allows for asynchronous HTML loading and selective extraction of specified elements, making it efficient for targeted data retrieval.


In [7]:
loader = AsyncHtmlLoader(url[0:8])
html = loader.load()

bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(html, tags_to_extract=["div"])


Fetching pages: 100%|##########| 8/8 [00:00<00:00, 18.61it/s]


In [12]:
prompt = PromptTemplate.from_template(
    """You are a helpful chat assistant. Given a chunk of text which is scraped text from an HTML document {doc} transformed using beautiful soup.
       Identify headings of key topics from the text and summarize them in a structured format. Avoid any ads or other HTML content that might
       be present. Only focus on relevant content that might be present."""
)

chain = prompt | llm | StrOutputParser()
doc = docs_transformed[1].page_content

response = chain.invoke({"doc" : doc})

display(Markdown(response))


**HPE Ezmeral Unified Analytics Software Documentation**

**Table of Contents**

1. **Installation and Service Activation**
	* Upgrading HPE Ezmeral Unified Analytics Software
2. **Cluster Management**
	* Expanding the Cluster
	* Importing Frameworks and Managing the Application Lifecycle
3. **Connectivity and Storage**
	* Connecting to External S3 Object Stores
	* Connecting to HPE Ezmeral Data Fabric
	* Connecting to HPE GreenLake for File Storage
4. **Configuration and Troubleshooting**
	* Configuring Endpoints
	* GPU Support
	* GPU Resource Management
	* Troubleshooting
5. **Product Information**
	* Product Version and Lifecycle Support
	* Support Matrix
6. **Release Notes**
	* Release Notes (1.5.0)
	* Release Notes (1.5.2)
7. **Licensing and Support**
	* Term Licensing
8. **Additional Resources**
	* Partners
	* Support
	* Dev-Hub
	* Community
	* Training
	* ALA
	* Privacy Policy
	* Glossary

## Scraping From PDF document Inside a Website

This code downloads a PDF document from a specified URL, extracts the text content from it, and then uses a language model to generate a summary. Here’s a step-by-step breakdown:

1. **Import Libraries**: 
   - `requests` for handling the download of the PDF from the URL.
   - `PyPDF2` for reading and extracting text from the PDF.

2. **Download PDF**:
   - Define the PDF URL (`site`), then use `requests.get` to download it.
   - Save the content to a local file named "doc.pdf" in binary write mode (`wb`).

3. **Extract Text from PDF**:
   - Open "doc.pdf" in binary read mode (`rb`), create a `PdfReader` object to read the file, and initialize an empty string `text`.
   - Loop through each page of the PDF, extract the text, and concatenate it to `text`.

4. **Prompt Creation and Model Interaction**:
   - Create a prompt template with `PromptTemplate.from_template` to generate a summarization request, inserting the extracted text (`text`) into the `{doc}` placeholder.
   - Pass the prompt through a chain of processes, where `llm` represents the language model and `StrOutputParser` parses the model's output.
   - The final output, `response`, is a summarized paragraph based on the content extracted from the PDF.

5. **Display the Summary**:
   - Use `display(Markdown(response))` to show the summary in a formatted Markdown output cell.

In [6]:
import requests
import PyPDF2

site = "https://arxiv.org/pdf/1706.03762"

r = requests.get(site)

with open("doc.pdf", "wb") as file:
    file.write(r.content)

with open("doc.pdf", "rb") as file:
    reader = PyPDF2.PdfReader(file)
    text = ""
    
    for page_num in range(len(reader.pages)):
        page = reader.pages[page_num]
        text += page.extract_text()

prompt = PromptTemplate.from_template(
    """Given a text chunk extracted from a pdf document {doc}, summarize the content of the pdf into a single paragraph"""
)

chain = prompt | llm | StrOutputParser()

response = chain.invoke({"doc":text})

display(Markdown(response))


The provided PDF discusses various papers and research on neural machine translation systems, specifically focusing on Google's system and its ability to bridge the gap between human and machine translation. The document references several studies, including "Deep recurrent models with fast-forward connections for neural machine translation" by Jie Zhou et al. and "Fast and accurate shift-reduce constituent parsing" by Muhua Zhu et al. It also presents visualizations of attention mechanisms in a neural machine translation system, highlighting the ability of some attention heads to follow long-distance dependencies and perform anaphora resolution (i.e., resolving pronouns to their corresponding antecedents). The figures show the attentions for different words or groups of words, demonstrating how the model's attention mechanism can be used to understand complex sentence structures.

## Wikipedia API Wrapper

To use this, ensure the `wikipedia` Python package is installed. This wrapper utilizes the Wikipedia API to perform searches and retrieve page summaries, typically returning summaries of the `top-k` results. It also restricts document content by setting a maximum character limit (`doc_content_chars_max`).

In [13]:
from langchain_community.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper

wiki = WikipediaAPIWrapper(top_k_results=2, doc_content_char_max=500)

print(wiki.run(query="hp enterprises"))

Page: Hewlett Packard Enterprise
Summary: The Hewlett Packard Enterprise Company (HPE) is an American multinational information technology company based in Spring, Texas.
HPE was founded on November 1, 2015, in Palo Alto, California, as part of the splitting of the Hewlett-Packard company. It is a business-focused organization which works in servers, storage, networking, containerization software and consulting and support.
The split was structured so that the former Hewlett-Packard Company would change its name to HP Inc. and spin off Hewlett Packard Enterprise as a newly created company. HP Inc. retained the old HP's personal computer and printing business, as well as its stock-price history and original NYSE ticker symbol for Hewlett-Packard; Enterprise trades under its own ticker symbol: HPE. At the time of the spin-off, HPE's revenue was slightly less than that of HP Inc.
In 2017, HPE spun off its Enterprise Services business and merged it with Computer Sciences Corporation to bec

<div style="text-align: left;">
    <img src="logo.png" alt="flow" width="150" height="100">
</div>