# Web Crawling
![Section Title: Web Crawling](title_pict/web_craw2.png)

Web crawling involves browsing and extracting data from websites. The basic steps of web crawling are:
- Sending a request for information to a website
- Retrieving the content on the website
- Parsing the retrieved data to extract useful information


In this chapter, we will use Python's [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) library to extract data from HTML and XML files.
- Extensible Markup Language (XML) is a markup language designed to store, transmit, and reconstruct data.

We will work with examples using the website [http://quotes.toscrape.com/](http://quotes.toscrape.com/), which offers a collection of quotes, authors, and tags for practicing web scraping and crawling. 
- Through web crawling, we will extract the quotes, authors, and tags.


## BeautifulSoup

The BeautifulSoup function processes the raw HTML content retrieved from the website.
- It converts the raw HTML data (in bytes) into a structured object for easy navigation and search.
- It organizes the content into a tree-like structure for efficient parsing and manipulation.

### urllib Library

If you use the following code for the URL http://quotes.toscrape.com/ mentioned above, you will encounter an error because:
- The response from the URL is in HTML format, not JSON.
- The *json.loads()* function is designed to parse JSON data, not HTML.


``` python
import urllib.request
url = f'http://quotes.toscrape.com/'

response = urllib.request.urlopen(url)
data = json.loads(response.read().decode())

Instead of attempting to load the HTML content as JSON, we can use the BeautifulSoup library to parse and process it.
- This will return a BeautifulSoup object which is a data structure representing a parsed HTML or XML document.

In the follwoing code:
- response.read().decode() reads the raw HTML content from the website and decodes it into a string.
- This string contains the HTML content of the webpage.


In [64]:
import urllib.request
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read().decode(), 'html.parser')

The second argument *html.parser* specifies the parser to process the HTML.
- It is Python's built-in HTML parser.
- It reads and understands the HTML content.

In [68]:
type(soup)

bs4.BeautifulSoup

In [71]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="

#### Quote

The *find_all()* method is used to retrieve specific information from the HTML content. In the following code:  
- **span**: Refers to the *<span>* HTML element, commonly used to group and style content within a webpage.  
    - The code specifically searches for all *<span>* elements present in the HTML content.  
- **text**: Represents the *class* attribute of the targeted *<span>* elements.  
    - The argument *class_='text'* ensures that only *<span>* elements with the class name *'text'* are included in the search.  
    - The *'text'* class likely represents quotes within the webpage.  

In [84]:
quotes = soup.find_all('span', class_='text')
quotes

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

Each element in the quotes list is a Tag object, which represents an HTML element. 
- This Tag object has a text attribute that can be used to access the textual content contained within the tag.

In [96]:
quotes[0]

<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>

In [88]:
type(quotes[0])

bs4.element.Tag

In [92]:
quotes[0].text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

#### Author

- **small**: Refers to the *<small>* HTML element, commonly used to style or format text as smaller or less prominent within a webpage.  
    - The code specifically searches for all *<small>* elements present in the HTML content.  
- **author**: Represents the *class* attribute of the targeted *<small>* elements.  
    - The argument *class_='text'* ensures that only *<small>* elements with the class name *'author'* are included in the search.  
    - The *'author'* class represents the authors of quotes.

In [104]:
authors = soup.find_all('small', class_='author')
authors

[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

Each element in the authors list is a Tag object, which represents an HTML element. 
- This Tag object has a text attribute that can be used to access the textual content contained within the tag.

In [106]:
authors[0]

<small class="author" itemprop="author">Albert Einstein</small>

In [108]:
type(authors[0])

bs4.element.Tag

In [110]:
authors[0].text

'Albert Einstein'

#### Tag

- **a**: Refers to the *<a>* HTML element, commonly used to create hyperlinks that navigate to other web pages or sections within the same page.  
    - The code specifically searches for all *<a>* elements present in the HTML content.  
- **tag**: Represents the *class* attribute of the targeted *<a>* elements.  
    - The argument *class_='tag'* ensures that only *<a>* elements with the class name *'tag'* are included in the search.  
    - The *'tag'* class represents the tags of quotes.
    - Tags typically describe the themes or categories of the quotes.

In [113]:
tags = soup.find_all('a', class_='tag')
tags

[<a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href="/tag/books/page/1/">books</a>,
 <a class="tag" href="/tag/classic/page/1/">classic</a>,
 <a class="tag" href="/tag/humor/page/1/">humor</a>,
 <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 

In [115]:
tags[0]

<a class="tag" href="/tag/change/page/1/">change</a>

In [117]:
type(tags[0])

bs4.element.Tag

In [119]:
tags[0].text

'change'

### requests Library

An alternative approach we can use the *requests* library instead of *urllib.request*.

In [130]:
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = requests.get(url)

The response status can be checked using the status_code attribute. Here are the possible returns:
- 200: Success – No issue reaching the website.
- 404: Not Found – Resource not found.
- 500: Internal Server Error – Server issue.
- 403: Forbidden – Access denied.
- 400: Bad Request – Invalid request.

In [132]:
response.status_code

200

*response.content* is a bytes object that contains the HTML content of the website.

In [146]:
response.content

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n    \n    \n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinki

The BeautifulSoup function processes the raw HTML content in *response.content into a structured object.

In [153]:
soup = BeautifulSoup(response.content, 'html.parser')

In [155]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="

Now, we will parse the quotes, authors, and tags using da different approach.

- **div**: Refers to the $<div>$  HTML element, commonly used to group and organize content within a webpage.
    - The code specifically searches for all $<div>$ elements present in the HTML content.  
- **quote**: Represents the *class* attribute of the targeted $<div>$ elements.  
    - The argument *class_='quote'* ensures that only $<div>$ elements with the class name *'quote'* are included in the search.  

In [202]:
quotes = soup.find_all('div', class_='quote')

In [200]:
quotes

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itempr

In [161]:
quotes[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

#### Quote

In [216]:
quotes[0].find('span')

<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>

In [219]:
quotes[0].find('span').text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

#### Author

In [222]:
quotes[0].find('small')

<small class="author" itemprop="author">Albert Einstein</small>

In [224]:
quotes[0].find('small').text

'Albert Einstein'

#### Tag

In [227]:
quotes[0].find_all('a')

[<a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>]

In [237]:
quotes[0].find_all('a')[0].text

'(about)'

We can extract all tags using a list comprehension:

In [233]:
tags = [tag.text for tag in quotes[0].find_all('a', class_='tag')]
tags

['change', 'deep-thoughts', 'thinking', 'world']

## Regular  Expressions

In this section, we will extract all links in the *href* attributes that start with */tag*.
- The *href* attribute contains links to other websites or specific sections within a website.
- We will use regular expressions to accomplish this task.

First, we create a *BeautifulSoup* object using the HTML content.

In [306]:
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

All `<a>` HTML element:

In [311]:
a_list = soup.find_all('a')
a_list

[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href

The element at index 10 of the list above is:

In [286]:
a_list[10]

<a href="/author/Albert-Einstein">(about)</a>

It can be converted into a string using the *decode()* method.

In [288]:
a_list[10].decode()

'<a href="/author/Albert-Einstein">(about)</a>'

In the following code, each element of the *a_list* is iterated over in a loop. 
- The regular expression 'href="(/tag[^"]+)"' is applied. This expression works as follows:
    - It looks for href=" in the HTML content.
    - After locating href=", it captures the part that starts with */tag.*
    - It continues capturing characters until it encounters the closing quotation mark *"* after */tag*.

In [304]:
import re
href_list = []
for a_item in a_list:
    href_list += re.findall('href="(/tag[^"]+)"' , a_item.decode())
href_list

['/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/']