# Techniques of Webscraping with Python 
### From BeautifulSoup to Selenium

Certainly! When it comes to web scraping with Python, BeautifulSoup and Selenium are commonly used tools. Here's a brief overview of using them:

1. **Beautiful Soup:**
   - **Purpose:** BeautifulSoup is used for parsing HTML and extracting information from web pages.
   - **Installation:** You can install it using `pip install beautifulsoup4`.
   - **Example:**
     ```python
     from bs4 import BeautifulSoup
     import requests

     url = 'your_target_url'
     response = requests.get(url)
     soup = BeautifulSoup(response.text, 'html.parser')

     # Extracting data
     title = soup.title.text
     ```

2. **Selenium:**
   - **Purpose:** Selenium is used for browser automation, which is useful when a website relies heavily on JavaScript.
   - **Installation:** You can install it using `pip install selenium`.
   - **Example:**
     ```python
     from selenium import webdriver

     # Set up the WebDriver (e.g., Chrome)
     driver = webdriver.Chrome(executable_path='path_to_chromedriver')

     # Open a website
     driver.get('your_target_url')

     # Extracting data (after the page has loaded, if it has dynamic content)
     element = driver.find_element_by_css_selector('your_css_selector')
     data = element.text
     ```



### The BeautifulSoup Library

`BeautifulSoup` is a parsing library that extracts the HTML content of a web page, retrieves the interesting data and extracts it into a suitable format.

One of the great things about BeautifulSoup is its ease of use. Indeed, a few lines of code are enough to create a scraper and the library benefits from clear and complete documentation. It is for these and other reasons that `BeautifulSoup` is popular with developers.

However, this library also has significant drawbacks:

- the scraping method with `BeautifulSoup` uses other libraries (like urllib for example) and therefore requires dependencies.
- this package does not allow to perform Dynamic Webscraping. More and more Web pages today have JavaScript within their HTML source code to give dynamism (widgets) and interaction with the user (clicks using of the mouse, use of the keyboard, etc.). The BeautifulSoup library is not able to interpret the JavaScript.
  
The use of this library is therefore recommended for the recovery of a large load of data on static Web sites.

### The Selenium Library

`Selenium` is a library that allows you to control a browser (Chrome, Internet Explorer, Firefox, Safari,...) automatically through a series of programs. Originally created to perform automated web testing, this package is also used for webscraping due to its compatibility with JavaScript. This strong point makes it a real alternative to BeautifulSoup for dynamic web pages, which are more and more prevalent.

On the other hand, the use of Selenium generates a major constraint: the automated control of browsers requires a lot of resources, thus reducing the efficiency and the speed of execution compared to a library like BeautifulSoup.

The use of `Selenium` is therefore recommended (even essential) for Web sites involving JavaScript but is not recommended for recovering a large load of data.

- **Run the next cell.**

In [1]:
import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = bs.BeautifulSoup(source, 'lxml')

js_test= soup.find('p', class_ = 'jstest')
js_test.text


'y u bad tho?'

> Returned text is not 'Look at you shinin!' but 'y u bad tho?'. Let's take a closer look at the html code of the Web page contained in the soup variable.
>
- **Execute the following cell then search, using the shortcut CTRL + F, the word "jstest".**

In [2]:
soup

<html>
<head>
<!--
		palette:
		dark blue: #003F72
		yellow: #FFD166
		salmon: #EF476F
		offwhite: #e7d7d7
		Light Blue: #118AB2
		Light green: #7DDF64
		-->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Python Programming Tutorials</title>
<meta content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free." name="description"/>
<link href="/static/favicon.ico" rel="shortcut icon"/>
<link href="/static/css/materialize.min.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<meta content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" name="google-site-verification"/>
<link href="/static/css/bootstrap.css" rel="stylesheet" type="text/css"/>
<!-- Compiled and minified CSS -->
<!-- Compiled and minified JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
<script src="https://cdnjs.cloudflare.com/a

> We see that the element of class jstest actually contains the text 'y u bad tho?' then a JavaScript script is then run to turn the text into 'Look at you shinin!'. The BeautifulSoup library not being able to process the
> JavaScript, it recovers the untransformed text.
> 
> Let's now try to retrieve the text using the Selenium library.
>
- **Run the next cell.**

In [4]:
%pip install selenium

Collecting selenium
  Obtaining dependency information for selenium from https://files.pythonhosted.org/packages/dc/72/96b5afa16908f9abc7c24b70adfd3a46c9740eb728ddfeab28379e38eaf9/selenium-4.16.0-py3-none-any.whl.metadata
  Using cached selenium-4.16.0-py3-none-any.whl.metadata (6.9 kB)
Collecting trio~=0.17 (from selenium)
  Obtaining dependency information for trio~=0.17 from https://files.pythonhosted.org/packages/3e/14/746bb2b403af4be680ca0ae240d62473c4ec3b836024c2e85f27856d7991/trio-0.23.2-py3-none-any.whl.metadata
  Using cached trio-0.23.2-py3-none-any.whl.metadata (4.9 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Obtaining dependency information for trio-websocket~=0.9 from https://files.pythonhosted.org/packages/48/be/a9ae5f50cad5b6f85bd2574c2c923730098530096e170c1ce7452394d7aa/trio_websocket-0.11.1-py3-none-any.whl.metadata
  Using cached trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting attrs>=20.1.0 (from trio~=0.17->selenium)
  Downloading attrs-2


[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
%pip install webdriver_manager

Collecting webdriver_manager
  Obtaining dependency information for webdriver_manager from https://files.pythonhosted.org/packages/b1/51/b5c11cf739ac4eecde611794a0ec9df420d0239d51e73bc19eb44f02b48b/webdriver_manager-4.0.1-py2.py3-none-any.whl.metadata
  Downloading webdriver_manager-4.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting python-dotenv (from webdriver_manager)
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Downloading webdriver_manager-4.0.1-py2.py3-none-any.whl (27 kB)
Installing collected packages: python-dotenv, webdriver_manager
Successfully installed python-dotenv-1.0.0 webdriver_manager-4.0.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import warnings
warnings.filterwarnings("ignore")

try:
    driver = webdriver.Chrome()
except:
    driver = webdriver.Chrome(ChromeDriverManager(version="114.0.5735.90").install()) 
driver.get('https://pythonprogramming.net/parsememcparseface/')
driver.find_element(by='class name', value='jstest').text

'Look at you shinin!'

> The text has now been successfully retrieved.
>
> This simple example shows the limitations of BeautfiulSoup against websites using JavaScript.

### Identification of an element on a web page 

Webscraping with Selenium consists of controlling a browser automatically and performing actions comparable to those of a classic user in front of his screen. The approach to adopt is therefore always the same:

- identify an element on the web page
- interact with an element
- retrieve interesting information

The objective of this Notebook will be to understand the essential methods for identifying elements on a Web page.

- **Execute the following cell to hide the warnings and import the necessary packages.**

In [5]:
%pip install selenium

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
import warnings
warnings.filterwarnings("ignore")
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager


> The first step is to create an instance of a WebDriver, which will allow us to perform our operations on the website to be scraped. The Chrome(), Firefox(), Edge() and Safari() methods of the webdriver object allow the initialization of a WebDriver Chrome, Firefox, Edge and Safari respectively.

- **(a) Create a driver instance of aChrome Webdriver.**

>  If the command does not find the driver you downloaded, use the following command : `webdriver.Chrome(ChromeDriverManager().install())`.

In [10]:
driver = webdriver.Chrome()

In [11]:
try:
    driver = webdriver.Chrome()
except:
    driver = webdriver.Chrome(ChromeDriverManager(version="114.0.5735.90").install())

This operation is supposed to open an empty Chrome window, as in the image below.

Once the WebDriver has been built, the procedure to follow is the same as if you were a standard user. To access the website to be scraped, you must enter the address using the get() method of the driver object. The objective will be to retrieve information on the Football World Cup 2018.

- **(b) Go to the following Wikipedia page : https://en.wikipedia.org/wiki/2018_FIFA_World_Cup**

In [12]:
driver.get("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup") 

This operation is supposed to load the World Cup 2018 Wikipedia page, as in the image below.

Once the target Web page is reached, we can `identify the elements` of the Web page to extract information from them or to interact with them.

To find an element, we use the `find_element` method which requires 2 parameters:

- **by:** parameter indicating how to access the element. This parameter can take several values:
  - `id` to identify a web element by its id.
  - `name` to identify a web element by its name.
  - `class name` to identify a Web element thanks to its class.
  - `link text` to identify a Web element thanks to the hyperlink attached to it.
  - `xpath` to identify a Web element using the XPath language, used to find nodes in an XML document. This method is very effective when it is not possible to locate an element by its id or its name.
- **value:** parameter indicating the value to identify

This method returns a WebElement type object whose click() method allows you to click on the element. It also exists in the plural (find_elements) and in this case returns the list of all the matching elements.

In [13]:
webelement = driver.find_element(by='id', value='siteSub')

# Get the text of the element
webelement.text


'From Wikipedia, the free encyclopedia'

- (e) Search for the Main Page element on the left by identifying it by its link and store it in a webelement variable.
- (f) Click on the element thanks to a line of code using the click() method.

In [14]:
webelement = driver.find_element(by='link text', value='Main page')

# Click on the element
webelement.click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"Main page"}
  (Session info: chrome=120.0.6099.110); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00007FF6BC512142+3514994]
	(No symbol) [0x00007FF6BC130CE2]
	(No symbol) [0x00007FF6BBFD76AA]
	(No symbol) [0x00007FF6BC021860]
	(No symbol) [0x00007FF6BC02197C]
	(No symbol) [0x00007FF6BC064EE7]
	(No symbol) [0x00007FF6BC04602F]
	(No symbol) [0x00007FF6BC0628F6]
	(No symbol) [0x00007FF6BC045D93]
	(No symbol) [0x00007FF6BC014BDC]
	(No symbol) [0x00007FF6BC015C64]
	GetHandleVerifier [0x00007FF6BC53E16B+3695259]
	GetHandleVerifier [0x00007FF6BC596737+4057191]
	GetHandleVerifier [0x00007FF6BC58E4E3+4023827]
	GetHandleVerifier [0x00007FF6BC2604F9+689705]
	(No symbol) [0x00007FF6BC13C048]
	(No symbol) [0x00007FF6BC138044]
	(No symbol) [0x00007FF6BC1381C9]
	(No symbol) [0x00007FF6BC1288C4]
	BaseThreadInitThunk [0x00007FFE979E257D+29]
	RtlUserThreadStart [0x00007FFE98CEAA58+40]


(g) Close the WebDriver.

In [15]:
driver.close()

WebDriverException: Message: disconnected: not connected to DevTools
  (failed to check if window was closed: disconnected: not connected to DevTools)
  (Session info: chrome=120.0.6099.110)
Stacktrace:
	GetHandleVerifier [0x00007FF6BC512142+3514994]
	(No symbol) [0x00007FF6BC130CE2]
	(No symbol) [0x00007FF6BBFD76AA]
	(No symbol) [0x00007FF6BBFBE1E9]
	(No symbol) [0x00007FF6BBFBF7CE]
	(No symbol) [0x00007FF6BBFD7CC3]
	(No symbol) [0x00007FF6BBFB0580]
	(No symbol) [0x00007FF6BC053737]
	(No symbol) [0x00007FF6BC0534E7]
	(No symbol) [0x00007FF6BC045FB0]
	(No symbol) [0x00007FF6BC014BDC]
	(No symbol) [0x00007FF6BC015C64]
	GetHandleVerifier [0x00007FF6BC53E16B+3695259]
	GetHandleVerifier [0x00007FF6BC596737+4057191]
	GetHandleVerifier [0x00007FF6BC58E4E3+4023827]
	GetHandleVerifier [0x00007FF6BC2604F9+689705]
	(No symbol) [0x00007FF6BC13C048]
	(No symbol) [0x00007FF6BC138044]
	(No symbol) [0x00007FF6BC1381C9]
	(No symbol) [0x00007FF6BC1288C4]
	BaseThreadInitThunk [0x00007FFE979E257D+29]
	RtlUserThreadStart [0x00007FFE98CEAA58+40]


In [16]:
import warnings
warnings.filterwarnings("ignore")
from time import sleep
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
import pandas as pd
import re
import numpy as np
