# Web Scraping with BeautifulSoup : Error handling 

- **Installing Beautifulsoup python libraries**
  
  BeautifulSoup is a Python library for extracting data from HTML files. It is widely used in Web Scraping to extract information from web pages, and creates an analysis tree from HTML documents. This feature makes web page content more readable than what we've seen with Requests. In practice, to analyze these HTML files, BeautifulSoup uses parsers, which are software libraries used to analyze and extract data from HTML documents. The parser transforms the HTML code into a tree-like representation, called a DOM (Document Object Model) tree, facilitating the navigation and extraction of specific data.

  There are several parsers available, each with its own advantages and disadvantages. The three most commonly used parsers are:
  - **`html.parser`**: `html.parser` is Python's default parser, which means no additional installation is required. Simple enough to use for basic parsing tasks, it is however slower than `lxml` and `html5lib`. Also, this parser is not very tolerant of syntax errors, and may have difficulty parsing complex `HTML` files. To use html.parser, you need to specify it as an optional argument when creating the `BeautifulSoup` object.
  - **`html5lib`**: `html5lib` is a relatively slow parser compared to some other `HTML` parsers. This is due to its exhaustive parsing approach and `HTML5` compatibility, which may result in slightly slower performance. The `html5lib` library can have a relatively larger memory footprint due to its comprehensive design and support for various `HTML5` features. It is a more complete library, but performs less well.
  - **`lxml`:** `lxml` is a faster, more syntax-tolerant parser than html.parser. It is also capable of parsing `XML` files, in addition to `HTML` files. However, it must be installed separately. This is the parser we recommend you use.

  Once we've created an instance of the `BeautifulSoup` class with the contents of our file, commonly known as `soup`, we can use various functions to extract data from the page.

  <img src="https://assets-datascientest.s3.eu-west-1.amazonaws.com/133_webscraping/soup.png"  width="600" height="300">

In [1]:
%pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
     ---------------------------------------- 0.0/143.0 kB ? eta -:--:--
     -- ------------------------------------- 10.2/143.0 kB ? eta -:--:--
     ------- ----------------------------- 30.7/143.0 kB 330.3 kB/s eta 0:00:01
     ------------------ ------------------ 71.7/143.0 kB 563.7 kB/s eta 0:00:01
     ------------------------------------ 143.0/143.0 kB 948.1 kB/s eta 0:00:00
Collecting soupsieve>1.2 (from beautifulsoup4)
  Obtaining dependency information for soupsieve>1.2 from https://files.pythonhosted.org/packages/4c/f3/038b302fdfbe3be7da016777069f26ceefe11a681055ea1f7817546508e3/soupsieve-2.5-py3-none-any.whl.metadata
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.5
Note: you may need to restart the kernel to


[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


- **Installing an additional parser**

As you probably know by now, BeautifulSoup uses a parser to analyze documents. By default, Python has a built-in parser called html.parser. However, some other parsers can be used for specific tasks, such as lxml, which is commonly recommended. To install lxml on your own machines, use the following command:

In [2]:
%pip install lxml

Collecting lxml
  Obtaining dependency information for lxml from https://files.pythonhosted.org/packages/5f/df/6d15cc415e04724ba4c141051cf43709e09bbcdd9868a6c2e7a7073ef498/lxml-4.9.4-cp312-cp312-win_amd64.whl.metadata
  Downloading lxml-4.9.4-cp312-cp312-win_amd64.whl.metadata (3.8 kB)
Downloading lxml-4.9.4-cp312-cp312-win_amd64.whl (3.8 MB)
   ---------------------------------------- 0.0/3.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.8 MB 435.7 kB/s eta 0:00:09
   - -------------------------------------- 0.1/3.8 MB 1.2 MB/s eta 0:00:04
   ----- ---------------------------------- 0.5/3.8 MB 3.0 MB/s eta 0:00:02
   -------- ------------------------------- 0.8/3.8 MB 4.1 MB/s eta 0:00:01
   ------------ --------------------------- 1.2/3.8 MB 4.7 MB/s eta 0:00:01
   ---------------- ----------------------- 1.5/3.8 MB 5.1 MB/s eta 0:00:01
   ------------------- -------------------- 1.9/3.8 MB 5.4 


[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


**BeautifulSoup's functions**

To extract data from an `HTML` file, we first need to target specific `HTML` elements. To do this, we can use the functions of the `BeautifulSoup` library to select elements in the `HTML` source code.

**The find() function:**
The `find()` function searches for the first occurrence of an `HTML` element corresponding to the specified criterion. It takes as parameters the name of the `HTML` tag and, if required, a dictionary of attributes. For example, if you want to find the first paragraph (`<p>` tag) on the page with the class "my-class", you can use the following function: `soup.find('p', class_ = "my-class"})`

<div class="alert alert-warning">
<i class="fa fa-info-circle"></i> 
    You may notice that we use <code>class_</code> instead of <code>class</code> in <code>BeautifulSoup</code>. <code>class</code> is a keyword reserved in Python for defining new classes, a type of customizable object. These words cannot be used as variable, function or attribute names. In this way, <code>class_</code> avoids any conflict or ambiguity with the Python interpreter.

> Occasionally, you may come across a different syntax for the `find()` function, notably with the use of the '.' accessor. This is an abbreviated function for accessing the first tag, equivalent to `find()`, but its use depends on personal preference and coding style.
> Thus, the following two codes are equivalent:
> <br /> 
>
>`soup.div.div.a`
>
>`soup.find('div').find('div').find('a')`
>
>> They display the first link of the `HTML` soup document hosted in two `div`.
> 
> In addition, we can use several optional arguments with this `find()` function to specify our search based on the elements at our disposal. We can specify attributes such as class name, unique id or any other attribute using the `attrs` parameter in the form of a dictionary that can take one or more attributes.

**(b)** Run the following code cell to see a concrete example of using the `find()` function with different optional arguments.

In [8]:
from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/Alan_Turing'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.title.text
print(title)

Alan Turing - Wikipedia


In [9]:
title = soup.title.text
title

'Alan Turing - Wikipedia'

In [12]:
divs = soup.find_all('div', class_ ='content')
for each_div in divs:
    print(each_div.text)

 Initial content 
 A second English content 


In [13]:
first_title = soup.find('h1', id="first")
second_title = first_title.find_next('h1') # We search for the next level 1 title (tag h1) after first_title
print(first_title.text)
print(second_title.text, ", is the next level 1 title after first_title")

 Title 1 
 Title 2  , is the next level 1 title after first_title


In [14]:
first_item = soup.find('li')
second_item = first_item.find_next_sibling('li')
third_item = second_item.find_next_sibling('li') # equivalent to first_item.find_next_sibling('li').find_next_sibling('li')

print(first_item.text)
print(second_item.text, ", is the first 'sibling element' after first_item")
print(third_item.text, ", is the first 'sibling element' after second_item")

 Element 1 
 Element 2  , is the first 'sibling element' after first_item
 Element 3  , is the first 'sibling element' after second_item


In [15]:
elements_css_class = soup.select('.chip')     # an HTML element whose class ('.' in css) is "chip".
elements_css_id = soup.select('#second')      # an HTML element whose id ('#' in css) is "second".
elements_parent_children = soup.select_one('div > p') # the first paragraph inside a div


for element in elements_css_class : # browses the list of tags affected by the 'chip' class
    print(" Element .chip class: ", element.text)
    
print(elements_css_id[0].text)      # We access to the 1st element of the returned list
print(elements_parent_children.text)  

 Element .chip class:   Element 1 
 Element .chip class:   Element 2 
 Element .chip class:   Element 3 
 Title 2 
 A new paragraph 


**Example:**

In [11]:
from bs4 import BeautifulSoup as bs

code_source = '''
<html>
  <body>
    <h1 id="first"> Title 1 </h1>
    <div id="main-content"> Unique main content </div>
    <div class="content"> Initial content </div>
    <div class="content" data-lang="en"> A second English content </div>
    <h1 id="second"> Title 2 </h1>
    <ul id="lists">
        <li class="chip"> Element 1 </li>
        <li class="chip"> Element 2 </li>
        <li class="chip"> Element 3 </li>
    </ul>
    <div> 
        <p class="paragraph"> A new paragraph </p>
    </div>
  </body>
</html>
'''

soup = bs(code_source, 'html.parser')
element_by_id = soup.find('div', id= 'main-content')
element_by_class = soup.find('div', class_= 'content')
element_by_attrs = soup.find('div', attrs={'class': 'content', 'data-lang': 'en'})

print("element_by_id : ",element_by_id.text)
print("element_by_class : ",element_by_class.text)
print("element_by_attrs : ",element_by_attrs.text)

element_by_id :   Unique main content 
element_by_class :   Initial content 
element_by_attrs :   A second English content 


### Comprehensive list of Beautiful Soup methods for web scraping:

**1. find():**

The `find()` method finds the first instance of a specific element or string within the parsed HTML document.

**Syntax:**

```python
soup.find(tag, attributes, recursive, text, **kwargs)
```

**Parameters:**

* `tag`: The tag name of the element you want to find.
* `attributes`: A dictionary of attributes and their values to match.
* `recursive`: A Boolean value that indicates whether to search recursively within the document tree.
* `text`: The text content of the element you want to find.
* `**kwargs`: Keyword arguments for more advanced filtering options.

**Examples:**

```python
# Find the first `<a>` element with the `href` attribute containing "google.com"
link = soup.find('a', href='google.com')

# Find the first `<div>` element with the class "content"
content_div = soup.find('div', class_='content')

# Find the first `<p>` element whose text contains "Beautiful Soup"
paragraph = soup.find('p', text='Beautiful Soup')
```

**2. find_all():**

The `find_all()` method returns a list of all instances of a specific element or string within the parsed HTML document.

**Syntax:**

```python
soup.find_all(tag, attributes, recursive, text, **kwargs)
```

**Parameters:**

Same as `find()`, but returns a list instead of a single element.

**Examples:**

```python
# Find all `<a>` elements within the document
links = soup.find_all('a')

# Find all `<div>` elements with the class "content"
content_divs = soup.find_all('div', class_='content')

# Find all `<p>` elements whose text contains "Beautiful Soup"
paragraphs = soup.find_all('p', text='Beautiful Soup')
```

**3. select():**

The `select()` method uses CSS selectors to find elements within the parsed HTML document.

**Syntax:**

```python
soup.select(selector)
```

**Parameter:**

A CSS selector string.

**Examples:**

```python
# Find all `<a>` elements with the class "link" and the href attribute containing "google.com"
links = soup.select('a.link[href="google.com"]')

# Find all `<div>` elements with the class "content"
content_divs = soup.select('div.content')

# Find all `<p>` elements whose text contains "Beautiful Soup"
paragraphs = soup.select('p:contains("Beautiful Soup")')
```

**4. select_one():**

The `select_one()` method applies a CSS selector and returns the first matching element, similar to `find()`.

**Syntax:**

```python
soup.select_one(selector)
```

**Parameter:**

Same as `select()`, but returns a single element instead of a list.

**Examples:**

```python
# Find the first `<a>` element with the class "link" and the href attribute containing "google.com"
link = soup.select_one('a.link[href="google.com"]')

# Find the first `<div>` element with the class "content"
content_div = soup.select_one('div.content')

# Find the first `<p>` element whose text contains "Beautiful Soup"
paragraph = soup.select_one('p:contains("Beautiful Soup")')
```

**5. descendants():**

The `descendants()` method returns a list of all descendant elements within the specified element.

**Syntax:**

```python
element.descendants()
```

**Parameter:**

The element to search for descendants.

**Examples:**

```python
# Find all descendant `<span>` elements within the `<div>` element with the class "content"
spans = content_div.descendants('span')

# Find all descendant elements (excluding text content) within the `<p>` element containing "Beautiful Soup"
elements = paragraph.descendants('*')
```

**6. parents():**

The `parents()` method returns a list of all parent elements of the specified element