# What is bs4 ?
Beautiful Soup is a Python library for pulling data out of HTML and XML files . It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

## Install bs4


## Import bs4

In [2]:
from bs4 import BeautifulSoup

## Make a soup object out of a website
"html.parser" is one option for parsers we could use. There are
other options, like "lxml" and "html5lib" that have different
advantages and disadvantages.


In [2]:
#The HTTP request
webpage = request.get(' URL', 'html.parser');

#Turn the website into a soup object
soup = BeautifulSoup(webpage. content);


## Object Types

soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>')

print(soup.div) #gets the first tag of that type on the page
print(soup.div.name)  #div
print(soup.div.attrs)  #{'id': 'example'}
print(soup.div.string) # An example div  #Navigable Strings: Piece of text inside of HTML Tags

In [26]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [27]:
soup = BeautifulSoup(html_doc, 'html.parser')

### Way to navigate through Code

In [5]:
soup.title

<title>The Dormouse's story</title>

In [6]:
soup.title.name

'title'

In [7]:
soup.title.string

"The Dormouse's story"

In [8]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [9]:
soup.p['class']

['title']

In [10]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [11]:
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [12]:
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [13]:
for link in soup.find_all('a'):                          #find all a links
    print(link.get('href'))   

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [14]:
print(soup.get_text())                          #this is print all text .

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



### Common objects to deal with : Tag, NavigableString, BeautifulSoup

In [17]:
soup = BeautifulSoup('<b class="Boy">Smart</b>', 'html.parser')
tag = soup.b
type(tag)

bs4.element.Tag

In [18]:
tag.name              #eevery tag has name

'b'

In [19]:
tag = BeautifulSoup('<b id="Boy">Smart</b>', 'html.parser').b
tag['id']                                                 #tag has many attributes

'Boy'

In [20]:
tag.attrs                       

{'id': 'Boy'}

### NavigableString
Text under tags 

In [21]:
soup = BeautifulSoup('<b id="Boy">Smart</b>', 'html.parser')
tag = soup.b
tag.string

'Smart'

In [23]:
tag.string.replace_with("quite bad")
tag

<b id="Boy">quite bad</b>

### .contents and .children

In [29]:
head_tag = soup.head
head_tag
print(head_tag)

print(head_tag.contents)


title_tag = head_tag.contents[0]

print(title_tag)

print(title_tag.contents)

<head><title>The Dormouse's story</title></head>
[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>
["The Dormouse's story"]


#### Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

In [33]:
for child in title_tag.children:
    print(child)

The Dormouse's story


## .strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator it will show text inside all the tags in new but with '/n' may between them

In [34]:
for string in soup.strings:
    print(repr(string))

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

In [37]:
for string in soup.stripped_strings:
    print(repr(string))
#str() and repr() both are used to get a string representation of object.
#Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


### .find_all()

In [49]:
print(soup.find_all("title"))


print(soup.find_all("p", "title"))


print(soup.find_all("a"))


print(soup.find_all("a",{'class':'sister'}))


print(soup.find_all(id="link2"))

print(soup.find_all(attrs={"id": "link1"}))


print(soup.find_all("a", limit=1))

[<title>The Dormouse's story</title>]
[<p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]


### .find()
find_all() provide multiple results,while find() only one.

In [51]:
print(soup.find("title"))


print(soup.find("p", "title"))


print(soup.find("a"))


print(soup.find("a",{'class':'sister'}))


print(soup.find(id="link2"))

print(soup.find(attrs={"id": "link1"}))

<title>The Dormouse's story</title>
<p class="title"><b>The Dormouse's story</b></p>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
