# Scraping the Web using Python and BeautifulSoup
#### Follow this link for full post: [https://www.gettingstarted.ai/scraping-the-web-using-python-and-beautifulsoup/](https://www.gettingstarted.ai/scraping-the-web-using-python-and-beautifulsoup/)
For this project, we're going to scrape country data from: https://www.scrapethissite.com. As you can tell from the domain name, the site's purpose is to help people understand web scraping. 

## Import required modules

In [1]:
# Import required libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

## Retrieve content using GET method

In [2]:
url = "https://www.scrapethissite.com/pages/simple/"

# Make a GET request
response = requests.get(url)

# Store the HTML content in html_content
html_content = response.content

## Parse HTML content using BeautifulSoup

In [3]:
## Instantiate BeautifulSoup object

soup = BeautifulSoup(html_content, "html.parser")

## Retrieve all countries using a for loop

In [4]:
## Find all the div tags with class 'country'

countries = soup.find_all("div", class_="country")

data = []

## Look through each tag and extract target text
for country in countries:
    name = country.find("h3", class_="country-name").text.strip()
    capital = country.find("span", class_="country-capital").text.strip()
    population = country.find("span", class_="country-population").text.strip()
    area = country.find("span", class_="country-area").text.strip()

    ## Add country and details to list `data`
    data.append([name, capital, population, area])

## Convert to pandas DataFrame

In [5]:
## Instantiate DataFrame with column names
df = pd.DataFrame(data, columns=["Country Name", "Capital", "Population", "Area"])

## Convert column values to numeric
df["Area"] = pd.to_numeric(df["Area"])
df["Population"] = pd.to_numeric(df["Population"])

## Return first five rows in DataFrame

In [6]:
# Print first five rows in DataFrame
df.head()

Unnamed: 0,Country Name,Capital,Population,Area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0


## Sort alphabetically by country name

In [7]:
# Sort the dataframe by the "Country Name" column in alphabetical order
df = df.sort_values(by="Country Name")

# Print first five rows in DataFrame
df.head()

Unnamed: 0,Country Name,Capital,Population,Area
2,Afghanistan,Kabul,29121286,647500.0
5,Albania,Tirana,2986952,28748.0
61,Algeria,Algiers,34586184,2381740.0
10,American Samoa,Pago Pago,57881,199.0
0,Andorra,Andorra la Vella,84000,468.0


## Get top 5 largest countries by area

In [8]:
# Sort the dataframe by the "Area" column in descending order
sorted_by_area = df.sort_values(by="Area", ascending=False)

# Retrieve the top five countries with the biggest area
top_countries_by_area = sorted_by_area.head(10)

print("Top 5 Countries with the Biggest Area:")
top_countries_by_area

Top 5 Countries with the Biggest Area:


Unnamed: 0,Country Name,Capital,Population,Area
190,Russia,Moscow,140702000,17100000.0
8,Antarctica,,0,14000000.0
37,Canada,Ottawa,33679000,9984670.0
232,United States,Washington,310232863,9629091.0
47,China,Beijing,1330044000,9596960.0
30,Brazil,Brasília,201103330,8511965.0
12,Australia,Canberra,21515754,7686850.0
104,India,New Delhi,1173108018,3287590.0
9,Argentina,Buenos Aires,41343201,2766890.0
124,Kazakhstan,Astana,15340000,2717300.0


## Get top 5 most populated countries

In [9]:
# Sort the dataframe by the "Population" column in descending order
sorted_by_population = df.sort_values(by="Population", ascending=False)

# Retrieve the top five countries with the largest population
top_countries_by_population = sorted_by_population.head(10)

top_countries_by_population

Unnamed: 0,Country Name,Capital,Population,Area
47,China,Beijing,1330044000,9596960.0
104,India,New Delhi,1173108018,3287590.0
232,United States,Washington,310232863,9629091.0
100,Indonesia,Jakarta,242968342,1919440.0
30,Brazil,Brasília,201103330,8511965.0
177,Pakistan,Islamabad,184404791,803940.0
18,Bangladesh,Dhaka,156118464,144000.0
163,Nigeria,Abuja,154000000,923768.0
190,Russia,Moscow,140702000,17100000.0
113,Japan,Tokyo,127288000,377835.0


## Calculate population density

In [10]:
# Calculate population density as population divided by area
df["Population Density"] = df["Population"] / df["Area"]

# Sort the dataframe by the "Population Density" column in descending order
sorted_by_density = df.sort_values(by="Population Density", ascending=False)

# Retrieve the top five countries with the highest population density compared to area percentage
top_countries_by_density = sorted_by_density.head(5)

print("Top 5 Countries with the Highest Population Density (%):")
top_countries_by_density

Top 5 Countries with the Highest Population Density (%):


Unnamed: 0,Country Name,Capital,Population,Area,Population Density
137,Monaco,Monaco,32965,1.95,16905.128205
197,Singapore,Singapore,4701069,692.7,6786.587267
94,Hong Kong,Hong Kong,6898686,1092.0,6317.478022
82,Gibraltar,Gibraltar,27884,6.5,4289.846154
235,Vatican City,Vatican City,921,0.44,2093.181818


## Save DataFrame as a CSV file

In [11]:
df.to_csv("country_data.csv", index=False)

### Author
jeff @ [gettingstarted.ai](https://www.gettingstarted.ai) &copy; 2023