In [1]:
import re
from bs4 import BeautifulSoup

In [2]:
with open('files/TomJerry_WithImages.html') as html_code:
    soup = BeautifulSoup(html_code, 'lxml')

In [3]:
print(soup.prettify())

<html>
 <head>
  <title>
   The story of Tom and Jerry
  </title>
 </head>
 <body class="container">
  <h1>
   Tom and Jerry
  </h1>
  &gt;
  <img alt="cartoon_image" height="300" src="TomAndJerry.jpg" width="300"/>
  <p class="comedy animated series">
   Tom and Jerry is an American animated series of comedy short films created by
   <a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">
    William Hanna
   </a>
   and
   <a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">
    Joseph Barbera
   </a>
   . 
        It centers on a rivalry between the title characters
   <a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">
    Tom
   </a>
   , a cat, and
   <a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">
    Jerry
   </a>
   , a mouse.
  </p>
  <div>
   <img alt="creator_image" height="300" name="William_Hanna" src="https://upload.wikimedia.org/wikipedia/commons/d/d2/Will

### Advanced use of find() and find_all()

In [4]:
soup.find('a')

<a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">William Hanna</a>

Finding "img" element by < img > tag using find( ) method and it will find only first "img" element.

In [5]:
soup.find('img')

<img alt="cartoon_image" height="300" src="TomAndJerry.jpg" width="300"/>

Finding "p" element by < p > tag using find( ) method and it will find only first "p" element.

In [6]:
soup.find_all('a')

[<a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">William Hanna</a>,
 <a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">Joseph Barbera</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">Tom</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">Jerry</a>]

In [7]:
soup('a')

[<a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">William Hanna</a>,
 <a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">Joseph Barbera</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">Tom</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">Jerry</a>]

Here we are getting all tags using findAll( ), In this case file is small but in some cases files are very large and to find all the tags it will take more time or in some cases only we want some limited number of tags to achieve that we will have attribute called 'limit'. 

In [8]:
soup.find_all('a', limit = 3)

[<a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">William Hanna</a>,
 <a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">Joseph Barbera</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">Tom</a>]

When ever we call findAll method this method will consider children and children of children and so on, but if we want only direct children in that case we can use attribute called 'recursive', if we pass false to this attribute then this will consider only direct child.

#### Only for class attributes

In [10]:
soup.find_all(attrs={'character'})

[<a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">Tom</a>,
 <a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">Jerry</a>]

In [11]:
soup.find_all(attrs={re.compile('^ani')})

[<p class="comedy animated series">
         Tom and Jerry is an American animated series of comedy short films created by 
         <a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">William Hanna</a> and  
         <a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">Joseph Barbera</a>. 
         It centers on a rivalry between the title characters
         <a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">Tom</a>, a cat, and 
         <a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">Jerry</a>, a mouse.</p>]

In [99]:
soup.find_all(string="Tom")

['Tom']

In [100]:
soup.find_all(string="Jerry")

['Jerry']

In [101]:
soup.find_all("a", string="Tom")

[<a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">Tom</a>]

In [102]:
soup.find_all("a", string="Jerry")

[<a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">Jerry</a>]

In [110]:
soup.find_all(['a', 'p'], string=re.compile('cat'))

[<p class="comedy story">
             The series features comic fights between an iconic pair of adversaries, 
             a house cat (Tom) and a mouse (Jerry). The plots of each short usually center on Tom's 
             numerous attempts to capture Jerry and the mayhem and destruction that follows. 
             Tom rarely succeeds in catching Jerry, mainly because of Jerry's cleverness, 
             cunning abilities, and luck. 
         </p>]

In [111]:
def is_the_only_string_within_a_tag(s):
    return (s == s.parent.string)

In [112]:
soup.find_all(string=is_the_only_string_within_a_tag)

[' The story of Tom and Jerry ',
 'Tom and Jerry',
 'William Hanna',
 'Joseph Barbera',
 'Tom',
 'Jerry',
 "\n            The series features comic fights between an iconic pair of adversaries, \n            a house cat (Tom) and a mouse (Jerry). The plots of each short usually center on Tom's \n            numerous attempts to capture Jerry and the mayhem and destruction that follows. \n            Tom rarely succeeds in catching Jerry, mainly because of Jerry's cleverness, \n            cunning abilities, and luck. \n        ",
 'Tom and Jerry show is a full length comedy show']