# Line numbers

The html.parser and html5lib parsers can keep track of where in the original document each Tag was found. You can access this information as **Tag.sourceline** (line number) and **Tag.sourcepos** (position of the start tag within a line).

In [2]:
from bs4 import BeautifulSoup
markup = "<p\n>Paragraph 1</p>\n    <p>Paragraph 2</p>"
soup=BeautifulSoup(markup,'html.parser')
for tag in soup.find_all('p'):
    print(tag.sourceline, tag.sourcepos, tag.string)

None None Paragraph 1
None None Paragraph 2


In [4]:
soup=BeautifulSoup(markup,'html5lib')
for tag in soup.find_all('p'):
    print(tag.sourceline, tag.sourcepos, tag.string)

None None Paragraph 1
None None Paragraph 2


# Comparing objects for equality

BeautifulSoup says that two NavigableString or Tag can be equal when they represent same html or xml markup. 

In [6]:
markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup=BeautifulSoup(markup,'html.parser')
first_b,second_b=soup.find_all('b')
print(first_b==second_b)

True


In [7]:
print(first_b.previous_element==second_b.previous_element)

False


# Copying BeautifulSoup objects

We can use copy.copy to create a copy of NavigableString or Tag.The copy is considered equal to the original, since it represents the same markup as the original, but it’s not the same object.

In [8]:
import copy
p_copy=copy.copy(soup.p)
p_copy

<p>I want <b>pizza</b> and more <b>pizza</b>!</p>

# Advanced parser customization

## Parsing only a part of document

If we want to use Beautiful Soup to look at only document’s **a tags**. It’s a waste of time and memory to parse the entire document and then go over it again looking for a tags. It would be much faster to ignore everything that wasn’t an a tag in the first place. The **SoupStrainer** class allows you to choose which parts of an incoming document are parsed. You just create a SoupStrainer and pass it in to the BeautifulSoup constructor as the parse_only argument.


**Note**-This feature won't work if we are working with **html5lib parser** because using html5lib the whole document will be parsed no matter what.

In [10]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""



In [12]:
from bs4 import SoupStrainer
soup=BeautifulSoup(html_doc,'html.parser')
only_a_tags=SoupStrainer('a')
print(BeautifulSoup(html_doc,'html.parser',parse_only=only_a_tags).prettify())


<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>
<a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
</a>


In [15]:
only_tags_with_id_link2=SoupStrainer(id='link2')
print(BeautifulSoup(html_doc,'html.parser',parse_only=only_tags_with_id_link2).prettify())

<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>


# Customizing multi-valued attributes

In [23]:
soup=BeautifulSoup('<p class="cl1 cl2" id="id1 id2">hello</p>')
soup.p['class']

['cl1', 'cl2']

In [None]:
#we can turn off multi valued attributes 
soup=BeautifulSoup('<p class="cl1 cl2" id="id1 id2">hello</p>', multi_valued_attributes=None)
soup.p['class']