# bsoup 
- https://scrapeops.io/python-web-scraping-playbook/python-beautifulsoup-findall/
- https://medium.com/ds3ucsd/web-scraping-in-15-steps-c8e295d53a9e


In [1]:
from bs4 import BeautifulSoup

In [2]:
html_doc = """
<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Link 1</a></li>
            <li><a href="http://scrapy.org">Link 2</a></li>
        </ul>
    </body>
</html>
"""

In [3]:
html_doc

'\n<html>\n    <body>\n        <h1>Hello, BeautifulSoup!</h1>\n        <ul>\n            <li><a href="http://example.com">Link 1</a></li>\n            <li><a href="http://scrapy.org">Link 2</a></li>\n        </ul>\n    </body>\n</html>\n'

In [4]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [6]:
## Find All <a> Tags
print(soup.find_all('a'))

[<a href="http://example.com">Link 1</a>, <a href="http://scrapy.org">Link 2</a>]


- As .find_all() returns an array of elements, you will need to loop each element in the list to extract the data you want:

In [7]:
element_list = soup.find_all('a')
for element in element_list:
    print(element.get_text())

Link 1
Link 2


In [8]:
#To limit the number of results the .find_all() method returns then use the limit parameter:

In [10]:
soup.find_all('a', limit=1)

[<a href="http://example.com">Link 1</a>]

## FindAll By Class And Ids
- The .find_all() method allows you to find elements on the page by class name, id, or any other element attribute using the attrs parameter.
- tag type = p
- class name = class_name
- id  = id_name
- attrs = aria-hidden-True

In [11]:
## <p> Tag + Class Name
soup.find_all('p', class_='class_name')

[]

In [12]:
## <p> Tag + Id
soup.find_all('p', id='id_name')

[]

In [13]:
## <p> Tag + Any Attribute
soup.find_all('p', attrs={"aria-hidden": "true"})

[]

## Find by Text
-  Text = Link 1

In [14]:
## Strings that exactly match 'Link 1'
soup.find_all(string="Link 1")

['Link 1']

In [15]:
import re

## Strings that contain 'Link'
soup.find_all(string=re.compile("Link"))

['Link 1', 'Link 2']

## Multiple Criteria
-  If you need to find page elements that require you to add multiple attributes to the query then you can do so with the attrs parameter:

In [16]:
## <p> Tag + Class Name & Id
soup.find_all('p', attrs={"class": "class_name", "id": "id_name"})

[]

## FindAll Using Regex


In [17]:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

body


In [18]:
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

html


## FindAll Using Custom Functions

In [19]:
def custom_selector(tag):
	# Return "span" tags with a class name of "target_span"
	return tag.name == "span" and tag.has_attr("class") and "target_span" in tag.get("class")

soup.find_all(custom_selector)


[]

## Example2
- https://scrapeops.io/python-web-scraping-playbook/python-beautifulsoup-web-scraping/

In [20]:
html_doc2 = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The first paragraph</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [23]:
soup2 = BeautifulSoup(html_doc2, 'html.parser')


In [24]:
print(soup2.find('title'))

<title>The Dormouse's story</title>


In [26]:
print(soup2.find('title').get_text())

The Dormouse's story


In [27]:
print(soup2.find('p'))

<p class="title"><b>The first paragraph</b></p>


In [28]:
print(soup2.find_all('a'))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [122]:
h1b = soup2.find_all(["p", "b"])
for element in h1b:
  print(element)

<p class="title"><b>The first paragraph</b></p>
<b>The first paragraph</b>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>


## From websites
- response.content instead of response.text as using response.text can sometimes lead to character encoding issues.
- The .content attribute holds raw bytes, which can be decoded better than the text representation we recieve with the .text attribute.

In [31]:
import requests
from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com/')
soup3 = BeautifulSoup(response.content, 'html.parser')

## Getting HTML Data From File

In [36]:
from bs4 import BeautifulSoup

with open("index1.html") as fp:
    soup4 = BeautifulSoup(fp, 'html.parser')

In [37]:
print(soup4.find_all('a'))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [38]:
print(soup4.find('title'))

<title>The Dormouse's story</title>


## Querying With Python Object Attributes
- As BeautifulSoup converts the HTML file into a complex tree of Python objects, we can select values from within that DOM tree like we would with any other Python dictionary.
- For example, here are some examples of querying the DOM tree of QuotesToScrape.com with object attributes:

In [39]:
import requests
from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com/')
soup5 = BeautifulSoup(response.content, 'html.parser')

In [41]:
print(soup5.find('title'))

<title>Quotes to Scrape</title>


In [42]:
print(soup5.h1)

<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>


In [43]:
print(soup5.h2)

<h2>Top Ten tags</h2>


In [44]:
print(soup5.h2.text)

Top Ten tags


In [46]:
print(soup5.h1.a)

<a href="/" style="text-decoration: none">Quotes to Scrape</a>


In [47]:
print(soup5.h1.a.string)

Quotes to Scrape


In [48]:
print(soup5.h1.a['href'])

/


- This method works but it isn't the best as:
- It will only return the first value it finds that matches your criteria.
- You can't create complex queries like searching for all div tags where class='quotes'
- As a result, it is recommended to use BeautifulSoups .find() and .find_all() methods, or use CSS Selectors via .select().

In [50]:
print(soup5.find('h1'))

<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>


In [51]:
print(soup5.find('h1').get_text())


Quotes to Scrape



In [52]:
print(soup5.find('h1').find('a').get('href'))

/


- That all looks pretty similar to querying with object attributes, however, the .find() gives us the ability to use more complex queries like searching by class, id, and other element attributes.
- Using .find() you can create queries where two conditions or more conditions must be satisfied:

In [53]:
## <p> Tag + Class Name
soup5.find('p', class_='class_name')

## Querying With CSS Selectors
- BeautifulSoup provides a .select() method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements.
- The SoupSieve lists all the currently supported CSS selectors, however, here are some of the most commonly used:
    - .classes
    - #ids
    - [attributes=value]
    - parent child
    - parent > child
    - sibling ~ sibling
    - sibling + sibling
    - :not(element.class, element2.class)
    - :is(element.class, element2.class)
    - parent:has(> child)


In [55]:
response = requests.get('https://quotes.toscrape.com/')
soup5 = BeautifulSoup(response.content, 'html.parser')

In [56]:
print(soup5.select('h1 a')[0].get_text())

Quotes to Scrape


In [57]:
## Find All Quotes
print(soup5.select('span.text'))

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>, <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>, <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>, <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>, <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>, <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>, <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</spa

- .select() Returns List
- The .select() method returns a list of elements, so when only looking for 1 element you need to take the first element ([0]) from the list.

In [70]:
print(soup5.select('h2')[0].get_text())
#select always returns a list, so [0]

Top Ten tags


## Aria Labels

In [71]:
import requests
from bs4 import BeautifulSoup

In [88]:
url = 'https://mjl.clarivate.com/search-results'

In [89]:
response = requests.get(url)
soup6 = BeautifulSoup(response.text, "html.parser")

In [96]:
response.text[0:100]

'<!DOCTYPE html>\n<html lang="en">\n\n<head>\n  <base href="/"/>\n  <meta charset="UTF-8"/>\n  <meta name="'

In [93]:
soup6.find('a')

In [97]:
a_tags = soup6.select('a[aria-label]')

In [98]:
len(a_tags)

0

In [99]:
for tag in a_tags:
    print(tag.text.strip())

In [100]:
a_tags = soup6.findAll('a', attrs={"aria-label": True})

In [101]:
a_tags = soup.findAll(lambda tag: tag.name == "a" and "aria-label" in tag.attrs)

In [110]:
html = '''
<div id ="119">
<span class="span" aria-label="4 people reacted to this post" role="button"></span>
</div>
'''

In [111]:
soup8 = BeautifulSoup(html, 'html.parser')

In [112]:
#find Span
f = soup8.find('span')
f

<span aria-label="4 people reacted to this post" class="span" role="button"></span>

In [113]:
#Get aria-label attribute of Span
print(f['aria-label'])

4 people reacted to this post


In [114]:
print(f['role'])

button


In [118]:
soup8b=BeautifulSoup(html,"lxml")
soup8b.find("span")['aria-label']

'4 people reacted to this post'

In [119]:
soup8b.find("div", attrs ={"id":"119"})

<div id="119">
<span aria-label="4 people reacted to this post" class="span" role="button"></span>
</div>

In [120]:
soup8b.find("div", attrs ={"id":"119"}).find("span")['aria-label']

'4 people reacted to this post'

## XPath in BS
-  cannot be used
- only lxml

In [123]:
import requests
from bs4 import BeautifulSoup
from lxml import etree

In [124]:
response = requests.get("https://www.scrapingbee.com/blog/")
soup9 = BeautifulSoup(response.content, 'html.parser')
body = soup9.find("body")

In [125]:
dom = etree.HTML(str(body)) # Parse the HTML content of the page
xpath_str = '//*[@id="content"]/section/div/div[1]/h1' # The XPath expression for the blog's title
print(dom.xpath(xpath_str)[0].text)

The ScrapingBee Blog


In [134]:
#All Links
links = soup9.find_all("a") # Find all elements with the tag <a>
for link in links:
  print("Link:", link.get("href"), "Text:", link.string)

Link: / Text: None
Link: https://app.scrapingbee.com/account/login Text: Login
Link: https://app.scrapingbee.com/account/register Text: Sign Up
Link: /#pricing Text: Pricing
Link: /#faq Text: FAQ
Link: /blog/ Text: Blog
Link: # Text: Other Features
Link: /features/ai-web-scraping-api/ Text: AI Web Scraping
Link: /features/screenshot/ Text: Screenshots
Link: /features/google/ Text: Google search API
Link: /features/data-extraction/ Text: Data extraction
Link: /features/javascript-scenario/ Text: JavaScript scenario
Link: /features/make/ Text: No code web scraping
Link: # Text: Developers
Link: /tutorials Text: Tutorials
Link: /documentation/ Text: Documentation
Link: https://help.scrapingbee.com/en/ Text: Knowledge Base
Link: /blog/web-scraping-without-getting-blocked/ Text: None
Link: /blog/web-scraping-101-with-python/ Text: None
Link: /blog/web-scraping-javascript/ Text: None
Link: /blog/web-scraping-r/ Text: None
Link: /blog/web-scraping-c++/ Text: None
Link: /blog/web-scraping-csha

## Sibling in BS
- find_previous_sibling to find the single previous sibling
- find_next_sibling to find the single next sibling
- find_all_next to find all the next siblings
- find_all_previous to find all previous siblings

In [126]:
html_content = '''
<p>First paragraph</p>
<p>Second Paragraph</p>
<p id="main">Main Paragraph</p>
<p>Fourth Paragraph</p>
<p>Fifth Pragaraph</p>
'''

In [127]:
soup10 = BeautifulSoup(html_content, 'html.parser')

In [130]:
main_element = soup10.find("p", attrs={"id": "main"})
print(main_element)

<p id="main">Main Paragraph</p>


In [129]:
# Find the previous sibling:
print(main_element.find_previous_sibling())

<p>Second Paragraph</p>


In [131]:
# Find the next sibling:
print(main_element.find_next_sibling())

# Find all next siblings:
print(main_element.find_all_next())

# Find all previous siblings:
print(main_element.find_all_previous())


<p>Fourth Paragraph</p>
[<p>Fourth Paragraph</p>, <p>Fifth Pragaraph</p>]
[<p>Second Paragraph</p>, <p>First paragraph</p>]


Link: / Text: None
Link: https://app.scrapingbee.com/account/login Text: Login
Link: https://app.scrapingbee.com/account/register Text: Sign Up
Link: /#pricing Text: Pricing
Link: /#faq Text: FAQ
Link: /blog/ Text: Blog
Link: # Text: Other Features
Link: /features/ai-web-scraping-api/ Text: AI Web Scraping
Link: /features/screenshot/ Text: Screenshots
Link: /features/google/ Text: Google search API
Link: /features/data-extraction/ Text: Data extraction
Link: /features/javascript-scenario/ Text: JavaScript scenario
Link: /features/make/ Text: No code web scraping
Link: # Text: Developers
Link: /tutorials Text: Tutorials
Link: /documentation/ Text: Documentation
Link: https://help.scrapingbee.com/en/ Text: Knowledge Base
Link: /blog/web-scraping-without-getting-blocked/ Text: None
Link: /blog/web-scraping-101-with-python/ Text: None
Link: /blog/web-scraping-javascript/ Text: None
Link: /blog/web-scraping-r/ Text: None
Link: /blog/web-scraping-c++/ Text: None
Link: /blog/web-scraping-csha

## Tables
-  We can parse a table's content with BeautifulSoup by finding all <tr> elements, and finding their <td> or <th> children.

In [135]:
response = requests.get("https://demo.scrapingbee.com/table_content.html")
soup11 = BeautifulSoup(response.content, 'html.parser')

In [138]:
data = []
table = soup11.find('table')
table_body = table.find('tbody')
#table_body

In [139]:
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all(['td', 'th'])
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

In [140]:
print(data)

[['SYMBOL', 'NAME', 'PRICE', 'CHANGE', '%CHANGE'], ['AMD', 'Advanced Micro Devices Inc', '89.48', '-5.34', '-5.63'], ['ADBE', 'Adobe Inc.', '378.07', '-15.76', '-4'], ['ABNB', 'Airbnb Inc', '99.91', '-9.01', '-8.27'], ['ALGN', 'Align Technology Inc', '247.75', '-9.3', '-3.62'], ['AMZN', 'Amazon.com Inc', '103.87', '-5.78', '-5.27'], ['AMGN', 'Amgen Inc', '237.7', '-2.31', '-0.96'], ['AEP', 'American Electric Power Company Inc', '95.24', '-3.02', '-3.07'], ['ADI', 'Analog Devices Inc', '150.32', '-6.6', '-4.21'], ['ANSS', 'ANSYS Inc', '232.21', '-9.66', '-3.99'], ['AAPL', 'Apple Inc', '133.98', '-3.15', '-2.3'], ['AMAT', 'Applied Materials Inc', '96.48', '-5.4', '-5.3'], ['ASML', 'ASML Holding NV', '497.48', '-24.05', '-4.61'], ['TEAM', 'Atlassian Corporation PLC', '167.46', '-16.35', '-8.9'], ['ADSK', 'Autodesk Inc', '175.43', '-11.65', '-6.23'], ['ATVI', 'Activision Blizzard Inc', '75.39', '-1.09', '-1.43'], ['ADP', 'Automatic Data Processing Inc', '208', '-3.62', '-1.71'], ['AZN', 'A

In [141]:
import pandas as pd

In [143]:
df = pd.DataFrame(data[1:], columns=data[0])
df.head()

Unnamed: 0,SYMBOL,NAME,PRICE,CHANGE,%CHANGE
0,AMD,Advanced Micro Devices Inc,89.48,-5.34,-5.63
1,ADBE,Adobe Inc.,378.07,-15.76,-4.0
2,ABNB,Airbnb Inc,99.91,-9.01,-8.27
3,ALGN,Align Technology Inc,247.75,-9.3,-3.62
4,AMZN,Amazon.com Inc,103.87,-5.78,-5.27
