# Scraping Websites with Beautiful Soup

## Overview

Web scraping is the process of extracting data from websites. This can be useful for a variety of purposes, such as collecting data for research or building a dataset for a machine learning model. In Python, one popular library for web scraping is Beautiful Soup.

Beautiful Soup is a Python library designed for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

## Installation

Before we begin, you'll need to install the Beautiful Soup library. You can do this using pip, the package installer for Python:

In [1]:
!pip install beautifulsoup4



## Example 1: Scraping a Simple Website

Let's start with a simple example. We'll scrape the contents of a webpage and print them to the console.

First, let's import the necessary libraries:
- The requests module is used to send HTTP requests to the website we want to scrap.
- BeautifulSoup is used to parse the HTML code (in simple words parsing is decomposing the html code into the different parts of its structure. "Parsing" comes from the latin word "Pars" which literally means part)

In [2]:
import requests
from bs4 import BeautifulSoup

Next, we'll specify the URL of the page we want to scrape and use the requests library to retrieve the page and store it into the "response" variable:

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
response = requests.get(url)

Now we can create a BeautifulSoup object, which will allow us to parse the HTML, transforming the HTML code into a tree of python objects:

In [6]:
soup = BeautifulSoup(response.content, 'html.parser')

Now, you could print the contents in the console, obtaining the entire HTML content of the page in a readable way

In [7]:
#print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of countries and dependencies by population - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-f

## Example 2: Extracting Elements

For extracting elements from a website it is important to understand how the structure of the HTML code works. This is what we are going to use for finding the elements that we want.

Take a look at the url we are scraping: 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'

Select the title, right click on it and select "inspect". The HTML code will appear on the right side. In this case, for the title, we have the following:

<h1 id="firstHeading" class="firstHeading mw-first-heading">
    <span class="mw-page-title-main"> List of countries and dependencies by population
    </span>
</h1>

These elements that appear in the html structure, are the keys that we will use as attributes in the find() and find_all () methods for finding the element we want.

### find () method
Allows to search objects in an HTML code, absed on the given attributes, returning only the first element found

### find_all() method
Allows you to find all matching tags on the page, rather than just the first one. Returns a list of matching tags, which you can then loop over to extract the data you need.

### find() and find_all() attributes
find() and find_all() expect attributes to find the element you want. This attributes are based on the HTML code, as we saw previously. The possible attributes are:

- **'name':** The name of the tag you want to find, such as "div" or "a".

- **'attrs':** A dictionary of attribute-value pairs that you want to search for. For example, you could search for all <a> tags that have a href attribute with the value "https://example.com".

- **'id':** The value of the id attribute of the tag you want to find. This is a shortcut for searching for a tag with a specific id attribute value.

- **'class_':** The value of the class attribute of the tag you want to find. This is a shortcut for searching for a tag with a specific class attribute value.

- **'string':** The string content that you want to search for within the tag. This can be useful for finding specific text within a larger HTML document.

Now, we can start scraping

### Title (h1)
For extraxting the title, we use the soup.find() method to find the title tag in the HTML content of the website and extract its text using the .text attribute. As we saw before, the tag for the title is h1:

In [11]:
title = soup.find('h1').text
print(title)

List of countries and dependencies by population


### Subtitles (h2)

Now, if we want to extract the subtitles of the web, we need a function that searchs all the elements following the condition. Here, instead of the find() function, we use the find_all() function, specifying the "h2" tag. For printing the elements we create a loop:

In [16]:
subtitles = soup.find_all("h2")

for subtitle in subtitles:
    print(subtitle.text)

Contents
Method[edit]
Sovereign states and dependencies by population[edit]
Notes[edit]
References[edit]


We could also specify more than one tag in the same search:

In [17]:
all_subtitles = soup.find_all(["h2", "h3", "h4"])

for subtitle in all_subtitles:
    print(subtitle.text)

Contents
Method[edit]
Sovereign states and dependencies by population[edit]
Notes[edit]
References[edit]


In this case, the result is the same because there are not h3 and h4 elements in the website

### Table

Now let's move to extarct a table from the website. For this, we specify the HTML tag (table) and the class name (wikitable) and we store the result in the variable "table".

In [19]:
table = soup.find("table", class_="wikitable")

Next, we use a loop. For each row, we find all the table cells (<td> tags) and extract the text content of the first ([0]) and second ([1]) cells, which correspond to the country name and population, respectively. Finally, we print the name and population of each country.

In [24]:
for row in table.find_all("tr")[1:]:
    columns = row.find_all("td")
    if len(columns) >= 2:
        name = columns[0].text.strip()
        population = columns[1].text.strip()
        print(f"{name}: {population}")

–: World
China: 1,411,750,000
India: 1,388,163,000
United States: 334,524,000
Indonesia: 275,773,800
Pakistan: 235,825,000
Nigeria: 218,541,000
Brazil: 215,916,931
Bangladesh: 169,828,911
Russia: 146,424,729
Mexico: 128,665,641
Japan: 124,490,000
Philippines: 110,515,406
Ethiopia: 105,163,988
Egypt: 104,587,438
Vietnam: 99,460,000
DR Congo: 99,010,000
Iran: 86,285,704
Turkey: 85,279,553
Germany: 84,270,625
France: 68,042,591
United Kingdom: 67,026,292
Thailand: 66,908,267
Tanzania: 61,741,120
South Africa: 60,604,992
Italy: 58,887,359
Myanmar: 55,294,979
South Korea: 51,439,038
Colombia: 51,049,498
Spain: 47,615,034
Kenya: 47,564,296
Argentina: 46,044,703
Algeria: 45,400,000
Sudan: 45,246,845
Uganda: 42,885,900
Iraq: 41,190,700
Ukraine: 41,130,432
Canada: 39,155,563
Poland: 37,767,000
Morocco: 36,901,040
Uzbekistan: 36,117,144
Saudi Arabia: 34,110,821
Yemen: 33,697,000
Peru: 33,396,698
Angola: 33,086,278
Afghanistan: 32,890,171
Malaysia: 32,777,400
Mozambique: 32,419,747
Ghana: 30,832,

### URLS
Now we are goinf to get a list of all the urls cited at the reference section of the website.
1. We use the find.all() function to get all the elements in the reflist class
2. With a loop, we search into the elements to find the url of each one, through the use of the attribute "href"
3. We print the elements that fill the condition of starting with http


In [30]:
ref_urls = soup.find_all('div', class_='reflist')

for ref in ref_urls:
    for link in ref.find_all('a', href=True):
        href = link['href']
        if href.startswith('http'):
            print(href)

https://ec.europa.eu/eurostat/databrowser/view/tps00001/default/table?lang=en
https://www.unfpa.org/data/world-population-dashboard
https://population.un.org/wpp/
https://population.un.org/wpp/
http://www.stats.gov.cn/english/PressRelease/202302/t20230227_1918979.html
https://main.mohfw.gov.in/sites/default/files/Population%20Projection%20Report%202011-2036%20-%20upload_compressed_0.pdf
https://www.census.gov/popclock/
https://www.bps.go.id/indicator/12/1975/1/mid-year-population.html
https://www.ibge.gov.br/apps/populacao/projecao/index.html
https://bdnews24.com/bangladesh/7tnmyurq0p
https://rosstat.gov.ru/storage/mediabank/PrPopul2023_Site_.xlsx
https://www.economy.com/mexico/population/not-seasonally-adjusted
https://www.stat.go.jp/english/data/jinsui/tsuki/index.html
https://www.pna.gov.ph/articles/1163852
https://popcom.gov.ph/
https://www.statsethiopia.gov.et/wp-content/uploads/2022/07/Population-Size-by-Sex-Zone-and-Wereda-July-2022.pdf
http://www.capmas.gov.eg/
https://web.arch