# Web Scraping Project

This Python notebook demonstrates a my simple web scraping project to extract information about world countries from the website [ScrapeThisSite](https://www.scrapethissite.com/pages/simple/). The BeautifulSoup library is utilized to parse the HTML content and extract relevant data from the target web page.

In [33]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [8]:
url = "https://www.scrapethissite.com/pages/simple/"
response = requests.get(url)

In [9]:
html = response.text
soup = BeautifulSoup(html,'html.parser')

In [10]:
soup.title.text

'Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping'

In [11]:
# Initializing the list of store data
countries = []
capitals = []
populations = []
areas = []

In [12]:
# Find all the div elements with class "col-md-4 country"
country_divs = soup.find_all('div', class_='col-md-4 country')

In [13]:
# Loop through each div element and extract information
for country_div in country_divs:
    # Extract country name
    country_name = country_div.find('h3', class_='country-name').text.strip()
    countries.append(country_name)

    # Extract capital
    capital = country_div.find('span', class_='country-capital').text.strip()
    capitals.append(capital)

    # Extract population
    population = country_div.find('span', class_='country-population').text.strip()
    populations.append(population)

    # Extract area
    area = country_div.find('span', class_='country-area').text.strip()
    areas.append(area)

In [14]:
# Create a DataFrame from the collected data
world_countries = pd.DataFrame({
    'Country': countries,
    'Capital': capitals,
    'Population': populations,
    'Area': areas
})

In [15]:
# Save the DataFrame to a CSV file
world_countries.to_csv('world_countries.csv', index=False)

In [17]:
# Display the DataFrame
world_countries

Unnamed: 0,Country,Capital,Population,Area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


In [18]:
world_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     250 non-null    object
 1   Capital     250 non-null    object
 2   Population  250 non-null    object
 3   Area        250 non-null    object
dtypes: object(4)
memory usage: 7.9+ KB


In [19]:
# Convert 'Population' to int and 'Area' to float
world_countries['Population'] = world_countries['Population'].astype(int)
world_countries['Area'] = world_countries['Area'].astype(float)

In [20]:
# Save the DataFrame to a CSV file
world_countries.to_csv('world_countries.csv', index=False)

In [21]:
world_countries

Unnamed: 0,Country,Capital,Population,Area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


In [22]:
world_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     250 non-null    object 
 1   Capital     250 non-null    object 
 2   Population  250 non-null    int32  
 3   Area        250 non-null    float64
dtypes: float64(1), int32(1), object(2)
memory usage: 7.0+ KB


In [23]:
top5_population = world_countries.nlargest(5, 'Population')
print("Top 5 Countries by Population:")
print(top5_population[['Country', 'Population']])

Top 5 Countries by Population:
           Country  Population
47           China  1330044000
104          India  1173108018
232  United States   310232863
100      Indonesia   242968342
30          Brazil   201103330


In [24]:
top5_area = world_countries.nlargest(5, 'Area')
print("Top 5 Countries by Area in kilometers:")
print(top5_area [['Country', 'Area']])

Top 5 Countries by Area in kilometers:
           Country        Area
190         Russia  17100000.0
8       Antarctica  14000000.0
37          Canada   9984670.0
232  United States   9629091.0
47           China   9596960.0
