# Searching the tree

In [1]:
import re
from bs4 import BeautifulSoup

upload "harry_potter.html" and "harry_potter.jpg" file.

In [2]:
with open('harry_potter.html') as html_code:
    soup = BeautifulSoup(html_code, 'lxml')

In [3]:
print(soup.prettify())

<html>
 <head>
  <title>
   Harry Potter
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    Harry Potter
   </b>
  </p>
  <img alt="front_cover" height="300" src="harry_potter.jpg" width="300"/>
  <p class="fictional fantasy">
   Harry Potter is a series of seven fantasy novels written by British author
   <a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">
    J K Rowling
   </a>
   . 
        The novels chronicle the lives of a young wizard, Harry Potter, and his friends
   <a class="character" href="https://en.wikipedia.org/wiki/Hermione_Granger" id="link2">
    Hermione Granger
   </a>
   and
   <a class="character" href="https://en.wikipedia.org/wiki/Ron_Weasley" id="link3">
    Ron Weasley
   </a>
   .
  </p>
  <p class="main story">
   The main story concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and Mu

## Filters

Filters help us select elements from which we want to extract contents.

We will use these filters in find( ) and findAll( ) methods to extract exact data from webpage. Filters are 4 types, those are:     
1. String                                         
2. Regular expression                                     
3. List                                           
4. Function

### string

String is a simplest filter among all filters, we just pass the string to method and BeautifulSoup will perform match against that string and it will return the data which is present inside of that string tag.

In [4]:
soup.p

<p class="title"><b>Harry Potter</b></p>

In [5]:
soup.find('p')

<p class="title"><b>Harry Potter</b></p>

Here we are passing the String('p') and this code will find the all "p" tags in webpage.

In [6]:
soup.findAll('p')

[<p class="title"><b>Harry Potter</b></p>,
 <p class="fictional fantasy">
         Harry Potter is a series of seven fantasy novels written by British author
         <a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">J K Rowling</a>. 
         The novels chronicle the lives of a young wizard, Harry Potter, and his friends
         <a class="character" href="https://en.wikipedia.org/wiki/Hermione_Granger" id="link2">Hermione Granger</a> and 
         <a class="character" href="https://en.wikipedia.org/wiki/Ron_Weasley" id="link3">Ron Weasley</a>.
 	</p>,
 <p class="main story">
 		The main story concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and Muggles (non-magical people).
 	</p>]

Here we are passing the String('title') and this code will find the first "title" tag in webpage.

In [7]:
soup.find('title')

<title> Harry Potter </title>

Here we are passing the String('title') and this code will find the all "title" tags in webpage.

In [8]:
soup.findAll('title')

[<title> Harry Potter </title>]

If there is no such string type of tag in web page then will get "None"

In [9]:
print(soup.find('span'))

None


In [10]:
soup.findAll(src = 'harry_potter.jpg')

[<img alt="front_cover" height="300" src="harry_potter.jpg" width="300"/>]

### Regular expression
#### If we pass any regular expression object to method then BeautifulSoup will do match against that regular expression using "search( )" .

This code will search for first tag which is started with "a".

In [11]:
soup.find(re.compile("^a"))

<a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">J K Rowling</a>

This code will search all tags which are started with "a".

In [13]:
for tag in soup.findAll(re.compile("^a")):
    print(tag)


<a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">J K Rowling</a>
<a class="character" href="https://en.wikipedia.org/wiki/Hermione_Granger" id="link2">Hermione Granger</a>
<a class="character" href="https://en.wikipedia.org/wiki/Ron_Weasley" id="link3">Ron Weasley</a>
<a href="https://en.wikipedia.org/wiki/Harry_Potter">wikipedia.</a>


We can find based on the attributes as well.

In [14]:
for tag in soup.findAll('a', attrs = {'class': re.compile('^char')}):
    print(tag)

<a class="character" href="https://en.wikipedia.org/wiki/Hermione_Granger" id="link2">Hermione Granger</a>
<a class="character" href="https://en.wikipedia.org/wiki/Ron_Weasley" id="link3">Ron Weasley</a>


This code will search for first "anchor" tag having "class" attributes and having "t" in that class value (not only in starting position)

In [15]:
soup.find('a', attrs = {'class': re.compile('t')})

<a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">J K Rowling</a>

In [16]:
for tag in soup.findAll('a', attrs = {'class': re.compile('r')}):
    print(tag)

<a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">J K Rowling</a>
<a class="character" href="https://en.wikipedia.org/wiki/Hermione_Granger" id="link2">Hermione Granger</a>
<a class="character" href="https://en.wikipedia.org/wiki/Ron_Weasley" id="link3">Ron Weasley</a>


### List

If we pass string in list format to method then BeautifulSoup will do match against that items (not in a order how we have passed in list).

This code will find all "img" tags and "a" tags.

In [17]:
soup.findAll(['a','img'])

[<img alt="front_cover" height="300" src="harry_potter.jpg" width="300"/>,
 <a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">J K Rowling</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Hermione_Granger" id="link2">Hermione Granger</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Ron_Weasley" id="link3">Ron Weasley</a>,
 <a href="https://en.wikipedia.org/wiki/Harry_Potter">wikipedia.</a>]

If we want to find all the tags we can simply pass "True" in method and this will return all the tags.

In [18]:
for tag in soup.findAll(True):
    print(tag.name)

html
head
title
body
p
b
img
p
a
a
a
p
i
a


### Function

If above 3 filters (string, regular expression, list) does not satisfy our requirement then we can write our own function to filter the data from webpage, then we can pass that function to method.

This code will find the all tags which are having "src" attribute and do not have "href" attribute.

In [19]:
def has_src_but_no_href(tag):
    return tag.has_attr('src') and not tag.has_attr('href')

In [20]:
soup.findAll(has_src_but_no_href)

[<img alt="front_cover" height="300" src="harry_potter.jpg" width="300"/>]

In [21]:
def has_wikipedia_source(href):
    return href and re.compile("wikipedia").search(href)

In [22]:
soup.findAll(href = has_wikipedia_source)

[<a class="author" href="https://en.wikipedia.org/wiki/J._K._Rowling" id="link1">J K Rowling</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Hermione_Granger" id="link2">Hermione Granger</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Ron_Weasley" id="link3">Ron Weasley</a>,
 <a href="https://en.wikipedia.org/wiki/Harry_Potter">wikipedia.</a>]

# Navigating Trees

The findAll function is responsible for finding tags based on their name and attribute.

But what if you need to find a tag based on its location in a document?



### Children

If you want to find only descendants that are children, you can use the .children tag:

In [24]:
from urllib.request import urlopen

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)


In [25]:
for children in bsObj.find("table", {"id":"giftList"}).children:  
  print(children)

# This prints all the rows of the table, including the header.



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


### Siblings

If we want to remove the header, we can use next_siblings

In [28]:
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
  print(sibling)

'''
The output of this code is to print all rows of products from the product table, except for
the first title row. However, the title row gets skipped for two reasons:

First, objects cannot be siblings with themselves. 
Any time you get siblings of an object, the object itself will not be included 
in the list. 

Second, this function calls next siblings only. If we were to select a
row in the middle of the list, for example, and call next_siblings on it, only the
subsequent (next) siblings would be returned
'''



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

'\nThe output of this code is to print all rows of products from the product table, except for\nthe first title row. However, the title row gets skipped for two reasons:\n\nFirst, objects cannot be siblings with themselves. \nAny time you get siblings of an object, the object itself will not be included \nin the list. \n\nSecond, this function calls next siblings only. If we were to select a\nrow in the middle of the list, for example, and call next_siblings on it, only the\nsubsequent (next) siblings would be returned\n'

### Parents

In [29]:
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())


$15.00



Check out: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree for more navigation methods.

