<center>
    <img src="files/logo2.png" width="200" />

<h1 style="color: #00BFFF;">Web Scraping: Mining the Web for Data</h1>
<img src="files/cavalls-de-valltorta.jpg" width="400" />
<h3 style="color: #00BFFF;">"Data Hunting and Gathering"</h3>
</center>

<!-- vscode-markdown-toc -->
* 1. [HTML](#HTML)
* 2. [CSS Selectors](#CSS)
* 3. [JavaScript](#JavaScript)
* 4. [Summary](#Summary)

<!-- vscode-markdown-toc-config
	numbering=true
	autoSave=true
	/vscode-markdown-toc-config -->
<!-- /vscode-markdown-toc -->

____________

Welcome to the first part of our journey into the world of web scraping. Web scraping, also known as web harvesting or web data extraction, is a technique used for extracting data from websites. This process involves fetching the web page and then extracting data from it.

In this lesson, we'll use the `requests` library to fetch web pages and `Beautiful Soup` from the `bs4` package to parse these pages and extract information.

<h3 style="color: #00BFFF;">By the end of this lesson, you'll:</h3>

- Understand **the value of web scraping**.
- Extract data from basic **HTML** and **CSS** structures.
- Learn to use **BeautifulSoup** and the **requests** library.
- Dive deeper into **web scraping ethics**.
- Explore advanced techniques for **handling JavaScript-based content**.

<h2 style="color: #008080;">Web Scraping?</h2>

![legtsgo](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExbzQ4eDdobnZlenhtN3c5MndmcDZpMW4wdXZzZTcxaDl1Zmo2YWt3dSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/SpopD7IQN2gK3qN4jS/giphy.gif)

Understanding how to scrape data from the web is a valuable skill for any data professional. In the digital era, **data is the new gold**, and web scraping is the mining equipment that helps you extract it. Web scraping allows you to automatically collect and analyze data from websites, providing insights that can be a game-changer in competitive fields.

#### Why Learn Web Scraping?
- **Data Availability**: The internet is a vast, dynamic source of data that can be used for all kinds of analyses, from understanding market trends to conducting academic research.
- **Automation**: Manual data collection can be tedious and time-consuming. Web scraping automates the process, saving time and ensuring a consistent data pipeline.
- **Competitive Advantage**: In fields like marketing, finance, and e-commerce, having timely and relevant data is crucial. Web scraping provides a competitive edge by allowing you to access real-time information.

#### How Does Web Scraping Work?
- **Fetching Web Pages**: Web scraping starts with making a request to a webpage to fetch its content.
- **Parsing the Data**: Once the content is fetched, it is parsed using tools like BeautifulSoup or Scrapy to extract relevant information. This is especially useful when the data isn't available through public APIs.
- **Data Usage**: After extracting data, it can be cleaned, analyzed, and stored for further data analytics tasks, enabling deeper insights and more informed decision-making.

#### Real-World Applications
- **Market Research**: Gather insights about competitors, identify customer sentiments, or track market trends by scraping review sites and social media.
- **Price Comparison**: Aggregate pricing data from different e-commerce platforms to create comparison tools.
- **Social Media Analysis**: Collect data from social networks for sentiment analysis, trend spotting, or brand monitoring.

> **Note**: While web scraping is a powerful tool, always ensure compliance with website terms of service and legal regulations regarding data usage.


<h2 style="color: #008080;">Web structure</h2>

The fundamental web technologies that form the structure of the websites we aim to scrape are:

- **HTML**: Standing as the backbone of almost all websites, HTML, the core markup language, is instrumental in creating web pages. It houses all the content available on a webpage.
  
- **CSS**: This stylesheet language works alongside HTML, taking charge of the presentation aspect of the webpages. It controls how HTML elements are displayed, setting the stage for a visually pleasing and organized web interface.

- **JavaScript**: Adding a dynamic touch to the websites, JavaScript comes into play to create interactive and animated content. This programming language has the power to alter webpage content even after it has loaded, bringing a dynamic and responsive element to web designs.

In this lesson, we will work with the **HTML** and **CSS** from the websites using **BeautifulSoup** and **requests** libraries and **Selenium** to handle **JavaScript** content.

# HTML

![image.png](attachment:image.png)

The most fundamental web pages are constructed using HTML and CSS. These technologies serve two primary purposes: **HTML (Hypertext Markup Language)** structures and stores the content, making it the primary target for web scraping, while **CSS (Cascading Style Sheets)** formats and styles the content, highlighting visual elements like fonts, colors, borders, and layout.

HTML is a markup language typically rendered by web browsers. It uses 'tags' to define elements on a web page. A typical tag format includes a tag name, attributes (if any), and the content between opening and closing tags.

#### Key Components of an HTML File

- **HTML Structure**:
   - **Hierarchical**: HTML documents are structured hierarchically, meaning elements are nested within other elements, forming a tree-like structure.
   - **Tags**: These are the building blocks of HTML, defining elements that hold different types of content.
   - **Attributes**: HTML tags can have attributes, which define properties of an element and are used to set various characteristics such as class, ID, and style.

- **DOCTYPE Declaration**: 
  - Begins with `<!DOCTYPE html>`, indicating the use of HTML5.
  - Earlier HTML versions had different DOCTYPEs.

- **HTML Tag**: 
  - The `html` tag (and its closing `/html` tag) encloses the entire web page content.

- **Head and Body**: 
  - The `head` section often includes the `title` tag, defining the webpage's name, links to CSS stylesheets, and JavaScript files for dynamic behavior.
  - The `body` contains the visible webpage content.

- **Common HTML Elements**:
  - **Headings and Paragraphs**: Use `h#` (where # is a number) for headings and `p` for paragraphs.
  - **Hyperlinks**: Defined with the `href` attribute in `a` (anchor) tags.
  - **Images**: Embedded using `img` tags with the `src` attribute. Note: `img` is self-closing.

In [None]:
%%html

<!-- Start of the HTML head section -->
<head>
    <!-- Title of the webpage -->
    <title>
        Basic knowledge for web scraping.
    </title>	
</head>
<!-- Start of the HTML body section -->
<body>
    <!-- Header 1 indicating the subject of the content -->
    <h1>About HTML
    </h1>

    <!-- Image of a rubber ducky; this one is not clickable -->
    <p>

    </p>
</body>

In [None]:
%%html

<!-- Start of the HTML head section -->
<head>
    <!-- Title of the webpage -->
    <title>
        Basic knowledge for web scraping.
    </title>	
</head>
<!-- Start of the HTML body section -->
<body>
    <!-- Header 1 indicating the subject of the content -->
    <h1>About HTML
    </h1>
    <!-- Paragraph explaining what HTML is and providing a link for further information -->
    <p>Html (Hypertext markdown language) is the basic language to provide contents in the web. It is a tagged language. You can check more about it in <a href="http://www.w3.org/community/webed/wiki/HTML">World Wide Web Consortium.</a></p>
    
    <!-- Paragraph indicating that one of the following images is clickable -->
    <p> One of the following rubberduckies is clickable
    </p>
    <!-- Image of a rubber ducky; this one is not clickable -->
    <p>
        <img src = "files/rubberduck.jpg"/>
    
        <!-- Clickable image (hyperlinked) of a rubber ducky -->
        <a href="http://www.pinterest.com/misscannabliss/rubber-duck-mania/"><img src = "files/rubberduck.jpg"/></a>
    </p>
</body>

![image.png](attachment:image.png)

<h3 style="color: #00BFFF;">Use Case: Three Sisters doc from Alice in Wonderland</h3>

In [1]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [None]:
html_doc

<h2 style="color: #00BFFF;">Creating the Soup</h2>

In [3]:
# pip install bs4

In [4]:
from bs4 import BeautifulSoup

![Cool GIF](https://i.pinimg.com/originals/e2/9e/f7/e29ef7444d2b503da0720e67ffee4002.gif)

In [None]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser') # standard
soup

In [None]:
type(soup)

#### `.prettify()`

In [None]:
# Makes things pretty!!!
print(soup.prettify())

#### `.title`

In [None]:
soup.title
soup.title.string

In [None]:
title = soup.title.string
title

In [None]:
soup.title.parent
soup.title.name

#### `.p`

In [None]:
soup.p

In [None]:
soup.p["class"]

#### `.a`

In [None]:
soup.a

In [None]:
soup.find_all("a")

#### `.find()`

In [None]:
soup.find(id="link1")

#### `.body`

In [None]:
soup.body.parent.name

In [None]:
soup.find("p")

#### `.find_all()`

`find` and `findAll` (or its equivalent `find_all`) are methods used to search the soup tree for tags that match a certain criterion.

1. **find**:
    - Returns only the **first** tag that matches a given set of criteria.
    - Useful when you know there's only one tag of interest or you only want the first occurrence.
    - Example: If you have multiple `<p>` tags on a page and you use `soup.find('p')`, you'll get only the first `<p>` tag.

2. **findAll (or find_all)**:
    - Returns a **list** of tags that match the given criteria.
    - Useful when you want to capture all occurrences of a particular tag or set of tags.
    - Example: Using `soup.find_all('p')` will give you a list containing all `<p>` tags on the page.

Here's a simple illustration:

```html
<html>
    <body>
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
        <div>Some div.</div>
    </body>
</html>
```

Using `find('p')` would return the "First paragraph." while `find_all('p')` would return a list containing both "First paragraph." and "Second paragraph.".


A common web scraping task is to extract all urls from a website:

To get all content text from a website:

In [None]:
p_tags = soup.find_all("p")
len(p_tags)

In [None]:
p_tags

A common web scraping task is to extract all urls from a website:

#### `.get()`

In [None]:
soup.find_all("a")

In [None]:
for link in soup.find_all("a"):
    print(link.get("href"))

#### `.get_text()`

In [None]:
for link in soup.find_all("a"):
    print(link.get_text())

**Access the attribute**: Once you have the element, use the `.get()` method to access the attribute value.

    ```python
    link_url = link_element.get('href')

To search for HTML elements by class in a webpage using BeautifulSoup, you can also use the `find` and `find_all` methods.

1. **Using `find` method to get the first matching element**:
   
   ```python
   result = soup.find(class_='your-class-name')
   ```

2. **Using `find_all` method to get a list of all matching elements**:

   ```python
   results = soup.find_all(class_='your-class-name')
   ```
   
Note that we are using the `class_` parameter because `class` is a reserved keyword in Python.

### 💡 **Activity**: Your turn:

Write code to print the following contents (not including the html tags, only human-readable text):

1. All the "fun facts".

2. The names of all the places.

3. The content (name and fact) of all the cities (only cities, not countries!)

4. The names (not facts!) of all the cities (not countries!)

In [23]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [24]:
soup = BeautifulSoup(geography, 'html.parser')
# print(soup.prettify())

In [None]:
# hint: all the fun facts
soup.find_all("p")

# CSS

![image.png](attachment:image.png)

**CSS selectors** are patterns used to select and manipulate one or more elements in an HTML or XML document. When web scraping with Python, CSS selectors can be used to target specific elements of interest within the page's content.

The `select` method in BeautifulSoup allows you to pass a CSS selector and returns a list of elements matching that selector.

1. **Tag Selector**: Targets elements by their tag name.
   - `p`: selects all `<p>` elements.
   - `soup.select("p")` will retrieve all `<p>` elements

2. **Class Selector**: Targets elements by their class attribute.
   - `.classname`: selects all elements with `class="classname"`.
   - If class name has spaces, they must be changed by `.`
   - `soup.select(".classname")`
   - To combine both, we can have `soup.select("tagname.classname")`

3. **Descendant Selector**: Targets an element that is a descendant of another element.
   - `div p`: selects all `<p>` elements inside a `<div>` element.
   - `.class1 .class2`: selects all elements with class2 that is a descendant of an element with class1.
   
4. **Attribute Selector**: Targets elements based on their attributes and values.
   - `a[href]`: selects all `<a>` elements with an `href` attribute.
   - `a[href="https://www.example.com"]`: selects all `<a>` elements with an `href` value of "https://www.example.com".

### 💡 **Activity**: Using CSS selectors

One. Step. At the time:

- Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

**Everyone should reach level 6!**

### 💡 **Activity**: Using HTML selectors

More exercises with solutions to practise: https://www.w3resource.com/python-exercises/BeautifulSoup/index.php

In [None]:
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<title>An example of HTML page</title>
</head>
<body>
<h2>This is an example HTML page</h2>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc at nisi velit,
aliquet iaculis est. Curabitur porttitor nisi vel lacus euismod egestas. In hac
habitasse platea dictumst. In sagittis magna eu odio interdum mollis. Phasellus
sagittis pulvinar facilisis. Donec vel odio volutpat tortor volutpat commodo.
Donec vehicula vulputate sem, vel iaculis urna molestie eget. Sed pellentesque
adipiscing tortor, at condimentum elit elementum sed. Mauris dignissim
elementum nunc, non elementum felis condimentum eu. In in turpis quis erat
imperdiet vulputate. Pellentesque mauris turpis, dignissim sed iaculis eu,
euismod eget ipsum. Vivamus mollis adipiscing viverra. Morbi at sem eget nisl
euismod porta.</p>
<p><a href="https://www.w3resource.com/html/HTML-tutorials.php">Learn HTML from
w3resource.com</a></p>
<p><a href="https://www.w3resource.com/css/CSS-tutorials.php">Learn CSS from
w3resource.com</a></p>
</body>
</html>
"""

In [None]:
### Write a Python program to find the title tags from a given html document.



In [None]:
### Write a Python program to retrieve all the paragraph tags from a given html document



In [None]:
### Write a Python program to get the number of paragraph tags of a given html document.



In [None]:
### Write a Python program to extract the text in the first paragraph tag of a given html document.



In [None]:
### Write a Python program to find the length of the text of the first <h2> tag of a given html document.



In [None]:
### Write a Python program to find the href of the first <a> tag of a given html document.



In [None]:
### Write a Python program to extract all the text from a given web page.



![legtsgo](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExZDhsYmRqMXh5M2t3Njh2dmQ2NW5zaTI4emw3azZpcGZqYzYxZ255cCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/IwTWTsUzmIicM/giphy.gif)

<h1 style="color: #00BFFF;">00 | Use case 1</h1>

In [None]:
# pip install requests

In [None]:
# 📚 Basic libraries
import pandas as pd

#❗New Libraries !
from bs4 import BeautifulSoup
import requests

# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

Let's go to https://books.toscrape.com/, where we'll see a collection of books
Notice how each movie has the following elements:

- Title

- Price

Our objective is going to be to scrape this information and store it in a pandas dataframe.



### Exploring Web Page Structures

To inspect the underlying HTML of a web page, right-click anywhere on the page. 
- Choose "View Page Source" in browsers like Chrome or Firefox.
- For Internet Explorer, choose "View Source," and for Safari, select "Show Page Source."
- (In Safari, if this option isn't visible, navigate to Safari Preferences, click on the Advanced tab, and enable "Show Develop menu in menu bar.")

To embark on your web scraping journey, you just need to grasp **three foundational aspects** of HTML:

### Fact 1: HTML is Built on Tags

At its core, HTML is composed of content enveloped in `<tags>`. Tags come in various types:
 * **Headings**: `<h1>`, `<h2>`, `<h3>`, `<h4>`...
 * **Phrasing**: `<b>`, `<strong>`, `<sub>`, `<i>`, `<a>`...
 * **Embedded Content**: `<audio>`, `<img>`, `<video>`, `<iframe>`...
 * **Tabulated Data**: `<table>`, `<tr>`, `<td>`, `<tbody>`...
 * **Page Sections**: `<header>`, `<section>`, `<nav>`, `<article>`...
 * **Metadata and Scripts**: `<meta>`, `<title>`, `<script>`, `<link>`...

### Fact 2: Tags Can Have Attributes

HTML tags can possess "attributes," which are defined within the opening tag itself. Examine the following example:
- `<a class="text-monospace" id="name_132" href="http://www.example.com"> Page Content </a>
`: This `div` tag encompasses the following attributes:
    + `class`: With the value "text-monospace". Remember, the class isn't unique across the page.
    + `id`: With the value "name_132". IDs are meant to be unique identifiers for tags on the page.
    + `href`: With the value www.example.com. The href commonly represents a link to another section of the page or to an external website.

**Key Notes**:
- The `id` attribute should be unique for a tag; no two tags should share the same `id`.
- The `class` attribute isn't meant to be unique. Instead, it often groups tags exhibiting similar behavior or styles.

For web scraping purposes, **understanding the semantics** behind terms like `<span>`, `class`, or `short-desc` **isn't crucial**.


### Fact 3: Tags Can Be Nested

Imagine the following segment of HTML code:

`Hello <strong><em>Ironhack</em> students</strong>`

Here, the phrase **Ironhack students** would be displayed in bold since it resides between the `<strong>` and `</strong>` tags. Additionally, the word ***Ironhack*** would be italicized due to the `<em>` tag, which signifies italic formatting. However, the word "Hello" remains unaffected by any formatting, as it lies outside both the `<strong>` and `<em>` tags. This results in the display:

Hello ***Ironhack* students**

This example illustrates a key principle: **tags influence the text from their opening to their closing points,** even if they are nested within other tags.

In [None]:
# get the link
link = "https://books.toscrape.com/"

In [None]:
response = requests.get(link)
response.status_code # 200 status code means OK!

![HTTPStatus](https://www.whatismyip.com/static/51e6afd43d8a39f7a6e03805c1328e11/https-codes.webp)

In [None]:
html = response.text

In [None]:
soup = BeautifulSoup(html, 'html.parser')

<h2 style="color: #008080;">Getting all Books title</h2>

### Selecting Specific Elements in Web Scraping

When diving into web scraping, it's essential to target specific elements efficiently. To hone in on the precise content you need, consider filtering tags based on:

 * **Tag Name**: The main type of the element (e.g., `<div>`, `<a>`, `<p>`).
 * **Class**: A descriptor that groups multiple elements with similar characteristics.
 * **ID**: A unique identifier assigned to a particular element.
 * **Other Attributes**: Additional properties like `href`, `title`, or `lang` that can further specify the elements of interest.


In [None]:
# Caique method
for a in soup.find_all("h3"):
    print(a.get_text())

In [None]:
# comprehension method to get titles
titles = [a.get_text() for a in soup.find_all("h3")]
titles

In [None]:
# isi method 1
for element in soup.find_all("ol", {"class": "row"}):
    for i in element.find_all("li"):
        print(i.h3.get_text())

In [None]:
titles = []
for li in soup.select("ol.row li h3 a"):
    titles.append(li["title"])
titles

<h2 style="color: #008080;">Getting all Books Prices</h2>

In [None]:
# simplified version
for element in soup.find_all("p", {"class": "price_color"}):
    print(element.get_text()[2:])

In [None]:
prices = []
for element in soup.find_all("ol", {"class": "row"}):
    for i in element.find_all("p", {"class": "price_color"}):
        prices.append(i.get_text()[2:])

In [None]:
prices

<h1 style="color: #00BFFF;">02 | Extracted Data</h1>

In [None]:
dict_data = {"book_titles": titles,
            "book_prices": prices}

In [None]:
df = pd.DataFrame(dict_data)
df

#### 💡 **Activity** Build a Basic HTML Web Page

Let's put your HTML knowledge into practice:

- Create a file named 'example.html' in your favorite text editor.
- Build a basic HTML web page containing elements like `title`, `h1`, `p`, `img`, and `a` tags. Remember that nearly all tags need to be closed with a `/tag`.

This exercise aims to familiarize you with the basic structure of HTML and how various elements come together to form a web page.

<h1 style="color: #00BFFF;">Use case 2</h1>

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

In [None]:
response = requests.get('https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)')
response.status_code

In [None]:
html = response.text

In [None]:
soup = BeautifulSoup(html, 'html.parser')

BeautifulSoup allows filtering results using combinations, such as filtering by tag and class.

```python
tags = soup.find_all(name=tag_name, class_=class_name)
```

In [None]:
html_table = soup.find_all("table", {"class": "wikitable sortable"})[0]

To extract the names from the provided HTML content, you can:

1. Use the `find_all` or `findAll` method to locate the `<h4>` tags with the specific class (`de-ProductTile-title` in this case).
2. Extract the text from the found tag.

<h1 style="color: #00BFFF;">02 | Extracted Data</h1>

In [None]:
html_pretty = html_table.prettify()

In [None]:
df2 = pd.read_html(html_pretty)[0]
df2

<h1 style="color: #00BFFF;">Ethic Considerations in Web Scraping</h1>

Web scraping, while a powerful technique for data extraction, comes with significant ethical and legal responsibilities. It's crucial to navigate this landscape with a deep understanding and respect for these considerations.

#### Respecting Website Policies and Laws
- **Adhering to Terms of Service**: Every website has its own set of rules, usually outlined in its Terms of Service (ToS). It's important to read and understand these rules before scraping, as violating them can have legal implications.
- **Following Copyright Laws**: The data you scrape is often copyrighted. Ensure that your use of scraped data complies with copyright laws and respects intellectual property rights.
- **Privacy Concerns**: Be mindful of personal data. Scraping and using personal information without consent can breach privacy laws and ethical standards.

#### Example: Understanding `robots.txt`
- **Selective Access**: Websites use `robots.txt` files to communicate their scraping policies, specifying which pages can or cannot be scraped. For example, [Google's `robots.txt`](https://www.google.com/robots.txt) allows certain parts of its site to be crawled while restricting others.
- **Dynamic Nature**: The content of `robots.txt` files can change, reflecting a website's evolving stance on web scraping. Regular checks are necessary for compliance.
- **Respecting the Limits**: Even if a `robots.txt` file allows scraping of some pages, it does not automatically mean all scraping activities are legally or ethically acceptable. It's a guideline, not a blanket permission.

> **Note**: While web scraping is a powerful tool, always ensure compliance with website terms of service and legal regulations regarding data usage, including regional regulations such as GDPR.


<h1 style="color: #00BFFF;">Extra Activities</h1>

#### 💡 **Activity** Learn more on European Regulations (GDPR):

- **General Data Protection Regulation (GDPR)**: In Europe, the GDPR sets strict guidelines on how personal data must be handled, including data obtained through web scraping. GDPR compliance means ensuring that any personal data scraped is done with explicit consent, and that data subjects have rights regarding the access, correction, and deletion of their data.
- **Data Minimization**: GDPR emphasizes data minimization, meaning only the data that is necessary should be collected. When scraping, ensure that you are not collecting excessive or unnecessary personal data.
- **Legal Basis for Data Processing**: Under GDPR, there must be a lawful basis for processing personal data, such as consent or legitimate interest. Make sure that your scraping activities can justify their legal basis under GDPR.

In [None]:
# Discuss with the class the ethics of web scraping

# JavaScript

![image.png](attachment:image.png)

JavaScript is the programming language used to create dynamic and interactive content on websites. Unlike HTML and CSS, which provide structure and style, JavaScript adds behavior to web pages, allowing them to respond to user actions, update content dynamically, and interact with back-end servers.

For web scraping, understanding JavaScript is crucial when dealing with dynamic websites that load content asynchronously. These websites might use JavaScript to load parts of their content after the initial page load, making it difficult for traditional scraping tools like `requests` and BeautifulSoup to capture all the desired information. This is where tools like **Selenium** come in, as they are capable of interacting directly with the browser to wait for JavaScript content to load fully.

![image.png](attachment:image.png)

Selenium is a powerful tool primarily used for automating web browsers. It is highly popular in web scraping, automated testing, and automating web-based administration tasks. Selenium is particularly useful for handling websites that use JavaScript to load content, enabling users to interact directly with the browser and automate actions such as scrolling, clicking buttons, and filling out forms.

#### Introduction to Selenium WebDriver

WebDriver is the main component of Selenium that facilitates interaction with web browsers. It acts as an interface, allowing developers to programmatically control the browser to perform tasks like opening pages, locating elements, interacting with them, and extracting data.

Key Features of Selenium WebDriver:
- **Browser Automation**: It can navigate to URLs, click buttons, and fill forms automatically.
- **JavaScript Handling**: Selenium can wait for JavaScript to execute fully, making it suitable for scraping JavaScript-heavy websites.
- **No Need for External Drivers in Some Cases**: Recent browser versions, like Google Chrome and Microsoft Edge (Chromium version), can be controlled directly using Selenium without requiring a separate driver like ChromeDriver or GeckoDriver, simplifying the setup.

<h3 style="color: #00BFFF;">Use Case: Navigating a Simple Web Page with Selenium</h3>

Let's start with a simple example to demonstrate how Selenium can be used to navigate a web page and interact with elements. Selenium allows us to automate the process of interacting with web pages, which is particularly useful for websites that use JavaScript to load content.

In [None]:
# Install Selenium
# pip install selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Initialize the Chrome WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Open a web page
driver.get("https://www.python.org")

# Find an element by its name and print its text
element = driver.find_element(By.NAME, "q")  # Finds the search bar element by its name attribute
print("Element Tag Name:", element.tag_name)

# Close the browser
driver.quit()

<h3 style="color: #00BFFF;">Using XPath to Select Elements</h3>

Selenium can also be combined with XPath to select elements precisely. For example, you can use XPath to locate specific elements on the page.

#### XPath: Precise Selection for Data Extraction

XPath (XML Path Language) is a versatile tool for navigating and selecting elements within HTML documents. While libraries like BeautifulSoup and `requests` are great for basic scraping, XPath offers a unique and powerful approach to extracting data, especially when dealing with complex or deeply nested HTML structures.

#### Key Advantages of XPath:
- **Granular Selection**: XPath allows precise control over element selection, making it easy to target elements based on attributes, tags, or positions within the document.
- **Hierarchical Navigation**: XPath is well-suited for navigating the hierarchical structure of HTML, allowing traversal up, down, or across branches of the document tree.
- **Advanced Queries**: XPath supports powerful querying capabilities, enabling users to locate elements with specific attributes, perform text extraction, and create complex conditions for data retrieval.

Common XPath Syntax:
- **Absolute Path (`/`)**: Selects elements starting from the root, e.g., `/html/body/p` for all paragraphs within the `<body>`.
- **Relative Path (`//`)**: Selects elements regardless of location in the document, e.g., `//a` selects all `<a>` tags.
- **Attributes (`@`)**: Select elements based on attributes, e.g., `//div[@id='main']` selects a `<div>` with an `id` of 'main'.
- **Built-in Functions**: Functions like `contains()` or `text()` help locate elements based on their content or other conditions, e.g., `//*[contains(@class, 'button')]` selects elements with a class containing 'button'.

Using XPath with Selenium allows for precise data extraction, especially when CSS selectors fall short or when the HTML structure is deeply nested or complex.

In [None]:
# Initialize the Chrome WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Open a web page
driver.get("https://www.python.org")

# Use XPath to find the search bar and print the tag name
search_bar = driver.find_element(By.XPATH, "//input[@name='q']")
print("Search Bar Tag Name:", search_bar.tag_name)

# Close the browser
driver.quit()

<h3 style="color: #00BFFF;">Interacting with Web Elements</h3>

In this example, we will automate filling out a form and clicking a button.

In [None]:
# Initialize the Chrome WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Open a web page
driver.get("https://www.python.org")

# Find the search bar using its name attribute and enter text
search_bar = driver.find_element(By.NAME, "q")
search_bar.send_keys("web scraping")

# Find the submit button and click it
search_button = driver.find_element(By.ID, "submit")
search_button.click()

# Print the current URL after searching
print("Current URL:", driver.current_url)

# Close the browser
driver.quit()

<h3 style="color: #00BFFF;">Handling Dynamic Content with Selenium</h3>

Selenium is also very effective for handling dynamic content, such as pages that load additional elements after an interaction.

In [None]:
from selenium.webdriver.common.action_chains import ActionChains
import time

# Initialize the Chrome WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Open a web page with dynamic content
driver.get("https://example.com")

# Scroll down to load more content
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for dynamic content to load
time.sleep(3)

# Find a dynamically loaded element using XPath
dynamic_element = driver.find_element(By.XPATH, "//div[@class='dynamic-content']")
print("Dynamic Element Text:", dynamic_element.text)

# Close the browser
driver.quit()

<h3 style="color: #00BFFF;">Using Explicit Waits</h3>

Sometimes, elements take time to load, and Selenium needs to wait until the element is available before interacting with it. For such cases, we use explicit waits.

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initialize the Chrome WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Open a web page
driver.get("https://www.python.org")

# Wait until the search bar is present, then interact with it
wait = WebDriverWait(driver, 10)
search_bar = wait.until(EC.presence_of_element_located((By.NAME, "q")))
search_bar.send_keys("selenium waits")

# Close the browser
driver.quit()

# Summary

Refer to the `robots.txt` file on a website (by doing `www.example.com/robots.txt`) to understand the server's guidelines and limitations regarding web scraping.

1. **Web Technologies**:
   - **HTML**: This is the standard markup language that holds the content of the webpage. It is the primary target when we engage in web scraping.
   - **CSS**: Cascading Style Sheets are used to describe the look and formatting of a document written in HTML.
   - **JavaScript**: This is a scripting language used to create and interactive and dynamic website content.

2. **Web Scraping Tools**:
   - **Requests**: A Python library that allows you to send HTTP requests to get the HTML content of a webpage.
   - **Beautiful Soup**: A Python library that facilitates the programmatic analysis of HTML, helping in parsing the HTML and navigating the parse tree.
   - **Selenium**: In cases where the webpage content is dynamic and generated using JavaScript, tools like Selenium are often used. Selenium can interact with JavaScript to load dynamic content, making it accessible for scraping.
   
3. **Finding and Selecting Elements**:
   - **Selection by Tag, Class, and ID**: We can find elements using various attributes such as their tag name, class name, or ID.
   - **CSS Selectors**: These are patterns used to select elements more complexly, leveraging the relationships between different elements to find them in numerous ways.