# BeautifulSoup Tutorial

In [2]:
# importing and getting the URL

from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print(soup.title.text)

Online Tutorials Library


In [None]:
# all link available on web page will be printed

for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
# pass the document through open filehandle.

from bs4 import BeautifulSoup
with open("example.html") as fp:
   soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")

In [12]:
# convert a syntax to an HTML syntax

import bs4
html = '''<b>tutorialspoint</b>, <i>&web scraping &data science;</i>'''
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup)

<html><body><b>tutorialspoint</b>, <i>&amp;web scraping &amp;data science;</i></body></html>


##### Beautifulsoup basically converts a complex html page into different python objects. Below are some major ones:
1. Tag
2. Navigable String
3. Beutiful Soup
4. Comments


### 1. Tag Objects
Example: `<input>, <h1>, <div>, etc.`

In [19]:
# for knowing the tag

soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>')
tag = soup.html
print(f"The type of tag is-> {type(tag)} \nThe name of the tag is-> {tag.name}")

The type of tag is-> <class 'bs4.element.Tag'> 
The name of the tag is-> html


In [20]:
# if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup

tag.name = "Strong"
print(tag)
print(tag.name)

<Strong><body><b class="boldest">TutorialsPoint</b></body></Strong>
Strong


 <b>`Attribute(tag.attrs):`</b> Anything that is NOT tag, is basically an attribute and must contain a value.

In [21]:
# To know the class of a tag

tutorialsP = BeautifulSoup("<div class='tutorialsP'></div>",'lxml')
tag2 = tutorialsP.div
tag2['class']

['tutorialsP']

In [24]:
# We can do all kinds of modifications to our tag’s attributes (add/remove/modify).

tag2['class'] = 'Online-Learning'
tag2['style'] = '2007'
print(tag2)

del tag2['style']
print(tag2)

print(tag2['class'])

# print(tag2['style']) #gives key error

<div class="Online-Learning" style="2007"></div>
<div class="Online-Learning"></div>
Online-Learning


<b>`Multi-valued attributes`</b>

In [26]:
# If there is multiple value in an attribute(here 'class') than it is printed.

css_soup = BeautifulSoup('<p class="body bold"></p>')
css_soup.p['class']

['body', 'bold']

In [29]:
# If there is multiple value in an attribute(here 'id') than it is printed.

id_soup = BeautifulSoup('<p id="body bold"></p>')
print(id_soup.p['id'])
type(id_soup.p['id'])

body bold


str

### 2. Navigable String
The string written inside a tag.

In [3]:
# To access the contents, use “.string” with tag. 

soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>")
soup.string

'Hello, Tutorialspoint!'

In [4]:
# the Data-Type 
type(soup.string)

bs4.element.NavigableString

In [5]:
# You can replace the string with another string but you can’t edit(by indexing/slicing) the existing string.

soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>")
soup.string.replace_with("Online Learning!")
soup

<html><body><h2 id="message">Online Learning!</h2></body></html>

### 3. Beautiful Soup
BeautifulSoup is the object created when we try to scrape a web resource. So, it is the complete document which we are trying to scrape. 

In [6]:
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>")
type(soup)

bs4.BeautifulSoup

In [7]:
soup.name

'[document]'

### 4. Comments
The comment object illustrates the comment part of the web document. It is just a special type of NavigableString.

In [8]:
soup = BeautifulSoup('<p><!-- Everything inside it is COMMENTS --></p>')
comment = soup.p.string
type(comment)

bs4.element.Comment

In [10]:
# Makes the HTML tag more readable.

print(soup.p.prettify())

<p>
 <!-- Everything inside it is COMMENTS -->
</p>


## Navigation By Tags

### 1. Going down

In [12]:
html_doc = """
<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a href="https://www.tutorialspoint.com/java/java_overview.htm" class="prog" id="link1">Java</a>,
<a href="https://www.tutorialspoint.com/cprogramming/index.htm" class="prog" id="link2">C</a>,
<a href="https://www.tutorialspoint.com/python/index.htm" class="prog" id="link3">Python</a>,
<a href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" class="prog" id="link4">JavaScript</a> and
<a href="https://www.tutorialspoint.com/ruby/index.htm" class="prog" id="link5">C</a>;
as per online survey.</p>
<p class="prog">Programming Languages</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

In [13]:
soup.head

<head><title>Tutorials Point</title></head>

In [14]:
soup.title

<title>Tutorials Point</title>

In [15]:
soup.body.b

<b>The Biggest Online Tutorials Library, It's all Free</b>

In [16]:
# Using a tag name as an attribute will give you only the first tag by that name.

soup.a

<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>

In [18]:
# To find all the tag we use "find_all()" function.

soup.find_all("a")

[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
 <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
 <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
 <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>,
 <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>]

## .contents

In [33]:
Htag = soup.head
Htag.contents

[<title>Tutorials Point</title>]

In [23]:
Ttag = Htag.contents[0]
Ttag

<title>Tutorials Point</title>

In [24]:
Ttag.contents

['Tutorials Point']

In [27]:
len(soup.contents)

2

In [28]:
soup.contents[1].name

'html'

In [29]:
# A string does not have .contents, because it can’t contain anything.

text = Ttag.contents[0]
text.contents

AttributeError: 'NavigableString' object has no attribute 'contents'

## .children
Instead of getting them as a list, use <b>`.children`</b> generator to access tag’s children.

In [31]:
for child in Ttag.children:
    print(child)

Tutorials Point


## .descendants
The <b>`.descendants`</b> attribute allows you to iterate over all of a tag’s children, recursively. It's direct children and the children of its direct children and so on.

In [35]:
for child in Htag.descendants:
    print(child)

<title>Tutorials Point</title>
Tutorials Point


In [37]:
len(list(soup.children))

2

In [40]:
len(list(soup.descendants))

33

## .strings and stripped_strings
If there’s more than one thing inside a tag, you can still look at just the strings. 

In [42]:
Htag.string # without "s"

'Tutorials Point'

In [41]:
for string in soup.strings: # with "s"
    print(repr(string))

'\n'
'Tutorials Point'
'\n'
'\n'
"The Biggest Online Tutorials Library, It's all Free"
'\n'
'Top 5 most used Programming Languages are:\n'
'Java'
',\n'
'C'
',\n'
'Python'
',\n'
'JavaScript'
' and\n'
'C'
';\nas per online survey.'
'\n'
'Programming Languages'
'\n'


In [45]:
# For removing white spaces "stripped_strings" is used.

for string in soup.stripped_strings:
    print(repr(string))

'Tutorials Point'
"The Biggest Online Tutorials Library, It's all Free"
'Top 5 most used Programming Languages are:'
'Java'
','
'C'
','
'Python'
','
'JavaScript'
'and'
'C'
';\nas per online survey.'
'Programming Languages'


### 2. Going up

## .parent

In [48]:
Ttag = soup.title
print(Ttag)
print()
print(Ttag.parent)

<title>Tutorials Point</title>

<head><title>Tutorials Point</title></head>


In [49]:
print(Ttag.string)
Ttag.string.parent

Tutorials Point


<title>Tutorials Point</title>

In [51]:
# There is no parent for the HTML Document itself that's why its "None"
print(soup.parent)

None


## .parents
To iterate over all the parents elements, use <b>`.parents`</b> attribute.

In [53]:
link = soup.a
print(link)

for parent in link.parents:
    if parent is None:
        print("End",parent)
    else:
        print(parent.name)

<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
p
body
html
[document]


### 3. Going sideways

In [54]:
sibling_soup = BeautifulSoup("<a><b>TutorialsPoint</b><c><strong>The Biggest Online Tutorials Library, It's all Free</strong></b></a>")
print(sibling_soup.prettify())

<html>
 <body>
  <a>
   <b>
    TutorialsPoint
   </b>
   <c>
    <strong>
     The Biggest Online Tutorials Library, It's all Free
    </strong>
   </c>
  </a>
 </body>
</html>


<b>In the above doc, `<b>` and `<c>` tag is at the same level and they are both children of the same tag. Both `<b>` and `<c>` tag are siblings.    

## .next_sibling and .previous_sibling
Use <b>`.next_sibling`</b> and <b>`.previous_sibling`</b> to navigate between page elements that are on the same level of the parse tree:

In [55]:
sibling_soup.b.next_sibling

<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c>

In [56]:
sibling_soup.c.previous_sibling

<b>TutorialsPoint</b>

In [57]:
# If no sibling is found

print(sibling_soup.b.previous_sibling)
print(sibling_soup.c.next_sibling)

None
None


## .next_siblings and .previous_siblings
To iterate over a tag’s siblings use .next_siblings and .previous_siblings.

In [58]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

',\n'
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
',\n'
<a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
',\n'
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
' and\n'
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
';\nas per online survey.'


In [60]:
# Here "soup.find()" is used in for loop to find the "id" in html doc.

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

',\n'
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
',\n'
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
'Top 5 most used Programming Languages are:\n'


### 4. Going back and forth

## .next_element and .previous_element
The .next_element attribute of a tag or string points to whatever was parsed immediately afterwards. `Sometimes it looks similar to .next_sibling, however it is not same entirely.`

In [62]:
soup


<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
<a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a> and
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>;
as per online survey.</p>
<p class="prog">Programming Languages</p>
</body></html>

In [61]:
# Here is what "next_sibling" will return.

last_a_tag = soup.find("a", id="link5")
print(last_a_tag)
last_a_tag.next_sibling

<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>


';\nas per online survey.'

In [63]:
# Here is what "next_element" will return.

last_a_tag.next_element

'C'

`The .previous_element attribute is the exact opposite of .next_element. It points to whatever element was parsed immediately before this one.`

In [64]:
last_a_tag.previous_element

' and\n'

In [65]:
last_a_tag.previous_element.next_element

<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>

## .next_elements and .previous_elements
We use these iterators to move forward and backward to an element.

In [67]:
for element in last_a_tag.next_elements:
    print(repr(element))

'C'
';\nas per online survey.'
'\n'
<p class="prog">Programming Languages</p>
'Programming Languages'
'\n'


## Searching the tree using 'find()' and 'find_all()'
Below are the different types of filters.

### 1. A string
Passing a string to the search method and Beautifulsoup will perform a match against that exact string.

In [68]:
markup = BeautifulSoup('<p>Top Three</p><p><pre>Programming Languages are:</pre></p><p><b>Java, Python, Cplusplus</b></p>')
markup.find_all('p')

[<p>Top Three</p>, <p></p>, <p><b>Java, Python, Cplusplus</b></p>]

### 2. Regular Expression
You can find all tags starting with a given string/tag.

In [69]:
import re
markup = BeautifulSoup('<p>Top Three</p><p><pre>Programming Languages are:</pre></p><p><b>Java, Python, Cplusplus</b></p>')
markup.find_all(re.compile('^p'))

[<p>Top Three</p>,
 <p></p>,
 <pre>Programming Languages are:</pre>,
 <p><b>Java, Python, Cplusplus</b></p>]

### 3. List
You can pass multiple tags to find by providing a list. 

In [70]:
markup.find_all(['pre', 'b'])

[<pre>Programming Languages are:</pre>, <b>Java, Python, Cplusplus</b>]

### 4. True
True will return all tags that it can find, but no strings on their own.

In [71]:
markup.find_all(True)

[<html><body><p>Top Three</p><p></p><pre>Programming Languages are:</pre><p><b>Java, Python, Cplusplus</b></p></body></html>,
 <body><p>Top Three</p><p></p><pre>Programming Languages are:</pre><p><b>Java, Python, Cplusplus</b></p></body>,
 <p>Top Three</p>,
 <p></p>,
 <pre>Programming Languages are:</pre>,
 <p><b>Java, Python, Cplusplus</b></p>,
 <b>Java, Python, Cplusplus</b>]

## find_all()
`find_all(name, attrs, recursive, string, limit, **kwargs)`

In [72]:
url="https://www.imdb.com/chart/top/?ref_=nv_mv_250"
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')

In [73]:
print(soup.find('title'))

<title>Top 250 Movies - IMDb</title>


In [74]:
for heading in soup.find_all('h3'):
   print(heading.text)

IMDb Charts
You Have Seen
Top Rated Movies by Genre
Recently Viewed


<b> All the filters we can use with `find_all()` can be used with find() and other searching methods too like `find_parents()` or `find_siblings()`.

## find()
`find(name, attrs, recursive, string, **kwargs)`

In [75]:
soup.find_all('title',limit=1)

[<title>Top 250 Movies - IMDb</title>]

In [77]:
soup.find('title')

<title>Top 250 Movies - IMDb</title>

In [80]:
print(soup.find_all('h2'))

[]


In [81]:
print(soup.find('h2'))

None


## find_parents() and find_parent()
`Syntax: find_parents(name, attrs, string, limit, **kwargs)
 Syntax: find_parent(name, attrs, string, **kwargs)`

In [82]:
a_string = soup.find(string="The Godfather")
a_string

'The Godfather'

In [83]:
a_string.find_parents('a')

[<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>]

In [84]:
a_string.find_parent('a')

<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>

In [85]:
a_string.find_parent('tr')

<tr>
<td class="posterColumn">
<span data-value="2" name="rk"></span>
<span data-value="9.156531607201476" name="ir"></span>
<span data-value="6.93792E10" name="us"></span>
<span data-value="1828964" name="nv"></span>
<span data-value="-1.843468392798524" name="ur"></span>
<a href="/title/tt0068646/"> <img alt="The Godfather" height="67" src="https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY67_CR1,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
      2.
      <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
<span class="secondaryInfo">(1972)</span>
</td>
<td class="ratingColumn imdbRating">
<strong title="9.2 based on 1,828,964 user ratings">9.2</strong>
</td>
<td class="ratingColumn">
<div class="seen-widget seen-widget-tt0068646 pending" data-titleid="tt0068646">
<div class="boundary">
<div class="popover">
<span class="delete"> </span

In [None]:
Syntaxes:
    
find_next_siblings(name, attrs, string, limit, **kwargs)
find_next_sibling(name, attrs, string, **kwargs)

find_previous_siblings(name, attrs, string, limit, **kwargs)
find_previous_sibling(name, attrs, string, **kwargs)

find_all_next(name, attrs, string, limit, **kwargs)
find_next(name, attrs, string, **kwargs)

find_all_previous(name, attrs, string, limit, **kwargs)
find_previous(name, attrs, string, **kwargs)

Where,

find_next_siblings() and find_next_sibling() methods will iterate over all the siblings of the element that come after the current one.

find_previous_siblings() and find_previous_sibling() methods will iterate over all the siblings that come before the current element.

find_all_next() and find_next() methods will iterate over all the tags and strings that come after the current element.

find_all_previous and find_previous() methods will iterate over all the tags and strings that come before the current element.

## CSS selectors

In [86]:
soup.select('title')

[<title>Top 250 Movies - IMDb</title>,
 <title>Top 250 Movies</title>,
 <title>IMDb, an Amazon company</title>]

In [87]:
soup.select("p:nth-of-type(1)")

[<p>The Top Rated Movie list only includes feature films.</p>,
 <p class="imdb-footer__copyright footer__copyright">© 1990-<!-- -->2022<!-- --> by IMDb.com, Inc.</p>]

In [89]:
 len(soup.select("p:nth-of-type(1)"))

2

In [90]:
len(soup.select("a"))

592

In [91]:
len(soup.select("p"))

2

In [92]:
soup.select("html head title")

[<title>Top 250 Movies - IMDb</title>, <title>Top 250 Movies</title>]

In [93]:
soup.select("head > title")

[<title>Top 250 Movies - IMDb</title>]

In [95]:
#print HTML code of the tenth li elemnet

soup.select("li:nth-of-type(10)")

[<li class="subnav_item_main">
 <a href="/search/title?genres=film_noir&amp;sort=user_rating,desc&amp;title_type=feature&amp;num_votes=25000,"> Film-Noir
 </a> </li>]