https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/

- **urllib.request:** It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic and digest authentication, redirections, cookies, etc). For more detail refer to the documentation page.
- **BeautifulSoup:** It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. In this article, we will use latest version BeautifulSoup 4. You can look at the installation instruction in its documentation page.

In [1]:
#import the library used to query a website
from urllib.request import urlopen
html = urlopen("http://victorianhumour.com/jokedb/?offset=40&paging=40")
print(html)

<http.client.HTTPResponse object at 0x106680588>


In [2]:
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup

In [3]:
soup = BeautifulSoup(html, "lxml")

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Victorian Humour Database
  </title>
  <!-- Bootstrap -->
  <link href="/jokedb/static/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>
  <link href="/jokedb/static/css/base.css" rel="stylesheet" type="text/css"/>
  <!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
  <!--[if lt IE 9]>
      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
 </head>
 <body>
  <div class="navbar navbar-inverse navbar-fixed-top" role="navigation">
   <div class="container">
    <div class="navbar-header">
     <button class="navbar-toggle collapsed" data-target=".navbar-collapse" data-toggle="collapse" type="button">
      <span class="sr-onl

# HTML Tags

In [5]:
# soup.<tag>: Return content between opening and closing tag including tag.
soup.title

<title>Victorian Humour Database</title>

In [6]:
# v
soup.t

<t>Modern Fashions.</t>

In [7]:
# soup.<tag>.string: Return string within given tag
soup.title.string

'Victorian Humour Database'

# Links

In [8]:
# Find all the links within page’s <a> tags:
soup.a 

<a class="navbar-brand" href="/jokedb/">Victorian Humour Joke database</a>

In [9]:
soup.find_all("a")

[<a class="navbar-brand" href="/jokedb/">Victorian Humour Joke database</a>,
 <a href="/jokedb/">Home</a>,
 <a href="/jokedb/works">Works</a>,
 <a href="/jokedb/login">log in</a>,
 <a href="http://victorianhumour.com/o/">Transcription gathering</a>,
 <a href="http://victorianhumour.tumblr.com">victorianhumour.tumblr.com</a>,
 <a href="/?offset=0&amp;paging=40"><button class="btn btn-sm btn-primary">Prev</button></a>,
 <a href="/?offset=80&amp;paging=40"><button class="btn btn-sm btn-primary">Next</button></a>,
 <a href="/jokedb/joke/41">41</a>,
 <a href="/jokedb/joke/42">42</a>,
 <a href="/jokedb/joke/43">43</a>,
 <a href="/jokedb/joke/44">44</a>,
 <a href="/jokedb/joke/45">45</a>,
 <a href="/jokedb/joke/46">46</a>,
 <a href="/jokedb/joke/47">47</a>,
 <a href="/jokedb/joke/48">48</a>,
 <a href="/jokedb/joke/49">49</a>,
 <a href="/jokedb/joke/50">50</a>,
 <a href="/jokedb/joke/51">51</a>,
 <a href="/jokedb/joke/52">52</a>,
 <a href="/jokedb/joke/53">53</a>,
 <a href="/jokedb/joke/54">54

In [10]:
# Iterate over each tag to return just the link 
all_links = soup.find_all("a")

for link in all_links:
    print(link.get("href"))

/jokedb/
/jokedb/
/jokedb/works
/jokedb/login
http://victorianhumour.com/o/
http://victorianhumour.tumblr.com
/?offset=0&paging=40
/?offset=80&paging=40
/jokedb/joke/41
/jokedb/joke/42
/jokedb/joke/43
/jokedb/joke/44
/jokedb/joke/45
/jokedb/joke/46
/jokedb/joke/47
/jokedb/joke/48
/jokedb/joke/49
/jokedb/joke/50
/jokedb/joke/51
/jokedb/joke/52
/jokedb/joke/53
/jokedb/joke/54
/jokedb/joke/55
/jokedb/joke/56
/jokedb/joke/57
/jokedb/joke/58
/jokedb/joke/59
/jokedb/joke/60
/jokedb/joke/61
/jokedb/joke/62
/jokedb/joke/63
/jokedb/joke/64
/jokedb/joke/65
/jokedb/joke/66
/jokedb/joke/67
/jokedb/joke/68
/jokedb/joke/69
/jokedb/joke/70
/jokedb/joke/71
/jokedb/joke/72
/jokedb/joke/73
/jokedb/joke/74
/jokedb/joke/75
/jokedb/joke/76
/jokedb/joke/77
/jokedb/joke/78
/jokedb/joke/79
/jokedb/joke/80


# Divs

In [11]:
# Extract information within all table tags
container = soup.find('div', class_='container').findChildren('div')
container

[<div class="navbar-header">
 <button class="navbar-toggle collapsed" data-target=".navbar-collapse" data-toggle="collapse" type="button">
 <span class="sr-only">Toggle navigation</span>
 <span class="icon-bar"></span>
 <span class="icon-bar"></span>
 <span class="icon-bar"></span>
 <span class="icon-bar"></span>
 </button>
 <a class="navbar-brand" href="/jokedb/">Victorian Humour Joke database</a>
 </div>, <div class="collapse navbar-collapse">
 <ul class="nav navbar-nav">
 <li class="active"><a href="/jokedb/">Home</a></li>
 <li class="active"><a href="/jokedb/works">Works</a></li>
 <li><a href="/jokedb/login">log in</a></li>
 <li><a href="http://victorianhumour.com/o/">Transcription gathering</a></li>
 <li><form action="/jokedb/search" class="navbar-form pull-right" method="POST">
 <input class="search-query span2" name="q"/>
 <button class="btn btn-sm btn-primary" type="submit">Search</button>
 </form>
 </li>
 </ul>
 </div>]

In [12]:
# a = []

# for row in container.findAll("li"):
#     print(row)

In [13]:
soup.t.string

'Modern Fashions.'

In [35]:
tits = soup.find_all('j')
tits

AttributeError: 'ResultSet' object has no attribute 'string'

In [15]:
#from bs4 import BeautifulSoup
from bs4 import Comment

In [16]:
# soup = BeautifulSoup("""1<!--The loneliest number-->
#                         <a>2<!--Can be as bad as one--><b>3""", "lxml")

# comments = soup.findAll(text=lambda text:isinstance(text, Comment))
# [comment.extract() for comment in comments]
# print(soup)

In [17]:
hi = soup.j.next_sibling
print(hi)

None


In [34]:
soup.find_all('j')

for row in soup.find_all('j'):
    #print(row.t.string)
    #tag = Tag(soup, "newTag", "t")
    #row.find(text="</t>").replaceWith("Hooray!")

    pass
    
    
    
#     for i in row.contents:
#         print(i)
    #print(row.contents)
#     for child in row.children:
#         print(child)
