# Scraping Data from Nested HTML Pages with Python Selenium 
In this tutorial, I illustrate how to scrape a list of terms, distributed over two levels of nested pages, through Python `selenium`.  As example, I scrape the list of terms from [Bocardi](https://www.brocardi.it).

## Recognize the Web Site Structure
In order to scrape data from a Web site, firstly I need to study the URIs structure. 
In my example, the list of terms is organized alphabetically, and for each letter of the alphabet there is a dedicated page, available at `<basic_url>/dizionario/<current_letter>/` (first level of URI). For example, for the letter `a`, the dedicated page is available at `https://www.brocardi.it/dizionario/a/`. 
In addition, the list of terms for each letter is paginated in different pages. For each letter, the first page is available the the first level of URI, while starting from the second page, the URI changes and is available at `<basic_url>/dizionario/<current_letter>/?page=<page_number>`. For example, for the letter `a`, the list of terms in the second page is available at the link `https://www.brocardi.it/dizionario/a/?page=2`.

## Environment Setup
In my code, I need to implement two loops: an external loop for letters and an internal loop for pages. I note that some letters are missing (`jkwxy`). For the external loop, I build a list containing all the letters, but the missing ones. I exploit `string.ascii_lowercase` to build the list of letters.

In [None]:
import string
letters = string.ascii_lowercase
letters = letters.replace('jk', '')
letters = letters.replace('wxy', '')
letters = list(letters)

Then I define two variables, `basic_url`, which contains the basic url to the Web site, and `table`, which will contain the list of all extracted terms. Initially `table` is an empty list.

In [None]:
table = []
basic_url = "https://www.brocardi.it"

Now I import all the `selenium` drivers and the `NoSuchElementException` exception, which will be used to catch some kind of exceptions, while performing the internal loop. I also import the `pandas` library.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options  
from selenium.common.exceptions import NoSuchElementException

## Nested Loops
I implement the external loop through a `for` ranging from `a` to `z`. At each step of the external loop, I build the url. Then I implement the internal infinite loop through a `while`. Within the internal loop I build a driver, which performs scraping. I exploit a `Chrome()` webdriver, which receives as input the `--headless` and the `--lang=it` options. The first options specifies that the browser will not be opened, while the second option specifies the language of the browser.

Once connected, I search for two elements:
* the elements which contain the list of terms
* the element which contains the link to the next page.

Both elements depend on the structure of the HTML page. I exploit the function `find_elements_by_xpath()` to search for a specific XPath.

As already said, the internal loop is an infinite loop, where the break condition is given by a `NoSuchElementException`, raised when there are no further next pages. The list of terms is stored in the `table` variable.

In [None]:
for letter in letters:
    
    url = basic_url + '/dizionario/' + letter + '/'
    while True:
        try:
            print(url)
            options = Options()  
            options.add_argument("--headless") 
            options.add_argument("--lang=it");
            driver = webdriver.Chrome(options=options)

            driver.get(url)

            # get the list of terms
            xpath = '//ul[@class="terms-list"]'
            words = driver.find_element_by_xpath(xpath).text
            table.extend(words.split('\n'))
            
            # get the next page
            xpath = '//a[@class="next"]'
            url = driver.find_element_by_xpath(xpath).get_attribute('href')
            
            driver.close()
        
        except NoSuchElementException:
            break
    
    

## Store results

The variable `table` contains the list of all terms. I can store it to a CSV file. This can be done by building a `pandas` Dataframe.

In [None]:
import pandas as pd

df = pd.DataFrame(table, columns=['word'])
df.head()

In [None]:
df['word'] = df['word'].str.lower()

In [None]:
df.to_csv('outputs/glossary.csv')