## Extracting Content from Website

Let us read the content from the website and get the length of the content using `BeautifulSoup` and Python pure collections.
* First, we will explore how to read the main content for single page.
* We will then create list for all the urls from which we want to extract the content.

In [1]:
import requests

page_url = 'https://python.itversity.com/04_postgres_database_operations/04_ddl_data_definition_language.html'
page = requests.get(page_url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

* This will read the content from the entire page. However, we are interested in the main content at the center.

In [2]:
soup.get_text()

"\n\n\n\n\n\nDDL – Data Definition Language — Mastering Python\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMastering Python\n\n\n\n\n\n\n\n\n\n\n   Mastering Python\n  \n\n\n\n\n\n   Overview of Windows Operating System\n  \n\n\n\n   Setup Ubuntu VM on GCP\n  \n\n\n\n   Setup Postgres Database\n  \n\n\n\n   Perform Database Operations\n  \n\n\n\n     Overview of SQL\n    \n\n\n\n     Create Database and Users Table\n    \n\n\n\n     DDL – Data Definition Language\n    \n\n\n\n     DML – Data Manipulation Language\n    \n\n\n\n     DQL – Data Query Language\n    \n\n\n\n     CRUD Operations – DML and DQL\n    \n\n\n\n     TCL – Transaction Control Language\n    \n\n\n\n     Example - Data Engineering\n    \n\n\n\n     Example - Web Application\n    \n\n\n\n     Exercise - Database Operations\n    \n\n\n\n\n\n   Getting Started with Python\n  \n\n\n\n   Basic Programming Constructs\n  \n\n\n\n   Pre-defined Functions\n  \n\n\n\n   User Defined Funct

* Main content at the center is under div tag with id `main-content`. We can find for that tag and use `get_text` to extract the main content from a single page.

In [4]:
soup.find('div', id='main-content').get_text()

"\n\n\n\nDDL – Data Definition Language¶\nLet us get an overview of DDL Statements which are typically used to create database objects such as tables.\n\nDDL Stands for Data Definition Language.\nWe execute DDL statements less frequently as part of the application development process.\nTypically DDL Scripts are maintained separately than the code.\nFollowing are the common DDL tasks.\n\nCreating Tables - Independent Objects\nCreating Indexes for performance - Typically dependent on tables\nAdding constraints to existing tables\n\n\n\nCREATE TABLE users (\n  user_id SERIAL PRIMARY KEY,\n  user_first_name VARCHAR(30) NOT NULL,\n  user_last_name VARCHAR(30) NOT NULL,\n  user_email_id VARCHAR(50) NOT NULL,\n  user_email_validated BOOLEAN DEFAULT FALSE,\n  user_password VARCHAR(200),\n  user_role VARCHAR(1) NOT NULL DEFAULT 'U', --U and A\n  is_active BOOLEAN DEFAULT FALSE,\n  created_dt DATE DEFAULT CURRENT_DATE,\n  last_updated_ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n);\n\n\n\nFollowing a

* Now let us get the text content for all the pages.
  * Get all the urls that need to be scraped in a list.
  * For each url, extract the content and add to a list along with the url.
* We should get the content as well as url in the new list.

In [5]:
import requests

python_base_url = 'https://python.itversity.com'
python_url = f'{python_base_url}/mastering-python.html'
python_page = requests.get(python_url)

from bs4 import BeautifulSoup

soup = BeautifulSoup(python_page.content, 'html.parser')

In [6]:
nav = soup.find('nav', {'id': 'bd-docs-nav'})

In [7]:
first_level_urls = []
for a in nav.find_all('a', {'class': 'reference internal'}):
    if a['href'] != '#':
        first_level_urls.append(a['href'])

In [8]:
all_urls = []
for first_level_url in first_level_urls:
    url = f"{python_base_url}/{first_level_url}"
    all_urls.append(url)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    current_nav = soup.find('nav', {'id': 'bd-docs-nav'})
    current_href = current_nav.find('li', {'class': 'toctree-l1 current active'})
    for second_level_href in current_href.find_all('a', {'class': 'reference internal'}):
        all_urls.append(f"{'/'.join(url.split('/')[:-1])}/{second_level_href['href']}")

In [9]:
%%time
url_and_content_list = []
for content_url in all_urls:
    content_page = requests.get(content_url)
    content_soup = BeautifulSoup(content_page.content, 'html.parser')
    content_text = content_soup.find('div', id='main-content').get_text()
    url_and_content_list.append((content_url, content_text))

CPU times: total: 16.5 s
Wall time: 1min 26s


In [10]:
for url in url_and_content_list[:10]:
    print(f'{url[0]} : {len(url[1])}')

https://python.itversity.com/01_overview_of_windows_os/01_overview_of_windows_os.html : 233
https://python.itversity.com/01_overview_of_windows_os/02_getting_system_details.html : 463
https://python.itversity.com/01_overview_of_windows_os/03_managing_windows_system.html : 475
https://python.itversity.com/01_overview_of_windows_os/04_overview_of_microsoft_office.html : 573
https://python.itversity.com/01_overview_of_windows_os/05_overview_of_editors_and_ides.html : 39
https://python.itversity.com/01_overview_of_windows_os/06_power_shell_and_command_prompt.html : 41
https://python.itversity.com/01_overview_of_windows_os/07_connecting_to_linux_servers.html : 38
https://python.itversity.com/01_overview_of_windows_os/08_folders_and_files.html : 28
https://python.itversity.com/02_setup_ubuntu_vm_on_gcp/01_setup_ubuntu_vm_on_gcp.html : 323
https://python.itversity.com/02_setup_ubuntu_vm_on_gcp/02_signing_up_for_gcp.html : 465
