# Searching the tree

In [2]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

## a. Filters

Examples of different filters you can pass into these(**find_all(), find()**) methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

### 1.  string

In [3]:
#we can pass a string to search the doc using find_all()
soup.find_all('b')

[<b>The Dormouse's story</b>]

In [4]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### 2.  list

In [5]:
#If we pass a list, it will give all the items that match the list attributes
soup.find_all(['a','b'])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [6]:
#by passing 'true' it will give all the tags but none of the text strings
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p


### 3. Regular expression

In [7]:
#tag staring with letter 'b'
import re
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)


body
b


In [8]:
#tag containing letter 't'
for tag in soup.find_all(re.compile('t')):
    print(tag.name)

html
title


### 4. Function

In [9]:
#returns tag that has class but not id
def has_class_not_tag(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_not_tag)


[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [10]:
def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## b. Search methods

### 1. find_all()
**Signature: find_all(name, attrs, recursive, string, limit, **kwargs)**

**name argument**

In [11]:
#pass the name of the tag to get the items that match the tag
soup.find_all('title')

[<title>The Dormouse's story</title>]

In [12]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

**attrs argument**

In [13]:
name_soup=BeautifulSoup("<form email='name'></form>")
name_soup.find_all(attrs={'email':'name'})

[<form email="name"></form>]

In [14]:
data_soup=BeautifulSoup("<div data-foo='greg'></div>")
data_soup.find_all(attrs={'data-foo':'greg'})

[<div data-foo="greg"></div>]

*searching by CSS class*
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument **class_**

In [15]:
soup.find_all(class_='sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [16]:
soup.find_all(class_=re.compile('tle'))

[<p class="title"><b>The Dormouse's story</b></p>]

In [17]:
#we can pass the tag and css class to get the item
css_soup=BeautifulSoup("<p class='body strikeout'></p")
css_soup.find_all('p',class_='body')

[<p class="body strikeout"></p>]

**keywords argument**

In [18]:
#we can use multiple attributes 
soup.find_all(href=re.compile('elsie'),id='link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

**string argument**

In [27]:
soup.find_all(string='Elsie')

['Elsie']

In [23]:
soup.find_all(string=['Elsie','Lacie','Tillie'])

['Elsie', 'Lacie', 'Tillie']

In [24]:
soup.find_all(string=re.compile('bottom'))

[';\nand they lived at the bottom of a well.']

In [25]:
soup.find_all('a',text='Elsie')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

**limit argument**

In [29]:
#limits the number of results to 2
soup.find_all('a',limit=2)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

**recursive argument**
The < title > tag is beneath the < html > tag, but it’s not directly beneath the <html> tag: the *head tag* is in the way. Beautiful Soup finds the *title tag* when it’s allowed to look at all descendants of the *html tag*, but when recursive=False restricts it to the *html tag’s* immediate children, it finds nothing.

In [30]:
soup.html.find_all('title')

[<title>The Dormouse's story</title>]

In [31]:
soup.html.find_all('title',recursive=False)

[]

### 2. find( )

The only difference between **find_all** and **find** is that find_all() returns a list containing the single result, and find() just returns the result.



If find_all() can’t find anything, it returns an **empty list**. If find() can’t find anything, it returns **None**



In [35]:
print(soup.find('notag'))

None


### 3. find_parents( ) and find_parent( )

### 4.find_next_siblings() and find_next_sibling()

### 5. find_previous_siblings() and find_previous_sibling()

### 6. find_all_next( ) and find_next( )

### 7. find_all_previous() and find_previous()

are also used for searching the tree.These function same as that of find_all() and find() methods.

## CSS Selectors

BeautifulSoup has a **.select()** method which uses the *SoupSieve package* to run a CSS selector against a parsed document and return all the matching elements. Tag has a similar method which runs a CSS selector against the contents of a single tag.

In [37]:
soup.select('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [41]:
soup.select('title')

[<title>The Dormouse's story</title>]

In [42]:
#finding tags beneath other tags
soup.select('html head title')

[<title>The Dormouse's story</title>]

In [43]:
#finding tag directly beneath other tag
soup.select('head > title')

[<title>The Dormouse's story</title>]

In [44]:
soup.select('p > a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [45]:
soup.select('body > a')

[]

In [48]:
#finding the siblings of the tags
soup.select('#link1 ~ .sister')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [50]:
#finding the immediate sibling of the tag
soup.select('#link2 + .sister')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [51]:
#finding tags by CSS class
soup.select('.sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [53]:
#finding tags by id
soup.select('#link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [54]:
soup.select('a#link3')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [56]:
#finding any selector that match the list
soup.select('#link1 ,#link2')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [57]:
#test for existence of attribute
soup.select('a[href]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [61]:
soup.select('p[a]')

[]

**select_one()**

select_one() is same as select but finds only first tag that matches the selector

In [62]:
soup.select_one('a')

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>