## Web Scraping Tutorial
For this assignment, using the techniques learnt in the previous session, scrape the following website: "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"
<br>For web scraping, use the following libraries
1. BeautifulSoup
2. requests 
3. pandas

Objective: 
* Create a Dataframe containing all countries listed on the Wikipedia website

Steps:
1. Import the libraries
* Pandas 
* Requests 
* BeautifulSoup 
2. Ping the website and return the HTML of the website
3. Use the prettify function to view how the tags are nested in the document
4. Find class 'sortable wikitable sticky-header col2left' in the HTML script
5. Extract all the links within a tag using find_all().
6. From the links found earlier, find extract the title by using the 'get' method to find the titles
* Note: Create a list to append the countries in and name the list variable as 'countries'.
7. Create the dataframe called df_countries
8. Set the column ‘Country’ in df_countries to countries

### 1. Import Libraries

In [1]:

import pandas as pd
import urllib
import urllib.request
from bs4 import BeautifulSoup
import requests
from pprint import pprint # pretty print helps for formatting

### 2. Ping the website and return the HTML of the website

In [2]:
# Scraping HTML
url = "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"
page = requests.get(url) # using requests to pull the page

In [None]:
# Making HTML more presentable
soup = BeautifulSoup(page.content, "html.parser")
print(soup)
print(soup.prettify())

In [4]:
# Finding tags
soup.find_all('p')

[<p> 
 Below is a <b>list of countries in <a href="/wiki/Asia" title="Asia">Asia</a> by area</b>.<sup class="reference" id="cite_ref-unstats21_1-0"><a href="#cite_note-unstats21-1"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</span></a></sup> <a href="/wiki/Russia" title="Russia">Russia</a> is the largest country in Asia and the world, even after excluding its European portion. The <a href="/wiki/Maldives" title="Maldives">Maldives</a> is the smallest country in Asia.
 </p>]

In [8]:
soup.find_all('p')[0].get_text()

' \nBelow is a list of countries in Asia by area.[1] Russia is the largest country in Asia and the world, even after excluding its European portion. The Maldives is the smallest country in Asia.\n'

In [None]:
soup.find_all(class_ = 'sortable wikitable sticky-header col2left')

In [14]:
a_tags = soup.find_all('a')
links = [a.get('href') for a in a_tags if a.get('href')]

In [15]:
print("Extracted links:", links)

Extracted links: ['#bodyContent', '/wiki/Main_Page', '/wiki/Wikipedia:Contents', '/wiki/Portal:Current_events', '/wiki/Special:Random', '/wiki/Wikipedia:About', '//en.wikipedia.org/wiki/Wikipedia:Contact_us', '/wiki/Help:Contents', '/wiki/Help:Introduction', '/wiki/Wikipedia:Community_portal', '/wiki/Special:RecentChanges', '/wiki/Wikipedia:File_upload_wizard', '/wiki/Main_Page', '/wiki/Special:Search', 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en', '/w/index.php?title=Special:CreateAccount&returnto=List+of+Asian+countries+by+area', '/w/index.php?title=Special:UserLogin&returnto=List+of+Asian+countries+by+area', 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en', '/w/index.php?title=Special:CreateAccount&returnto=List+of+Asian+countries+by+area', '/w/index.php?title=Special:UserLogin&returnto

In [16]:
links_and_titles = [
    {"link": a.get('href'), "title": a.get('title')}
    for a in a_tags if a.get('href')
]

In [19]:
print("Extracted links and titles:")
for item in links_and_titles:
    print(f"Link: {item['link']}, Title: {item['title']}")

Extracted links and titles:
Link: #bodyContent, Title: None
Link: /wiki/Main_Page, Title: Visit the main page [z]
Link: /wiki/Wikipedia:Contents, Title: Guides to browsing Wikipedia
Link: /wiki/Portal:Current_events, Title: Articles related to current events
Link: /wiki/Special:Random, Title: Visit a randomly selected article [x]
Link: /wiki/Wikipedia:About, Title: Learn about Wikipedia and how it works
Link: //en.wikipedia.org/wiki/Wikipedia:Contact_us, Title: How to contact Wikipedia
Link: /wiki/Help:Contents, Title: Guidance on how to use and edit Wikipedia
Link: /wiki/Help:Introduction, Title: Learn how to edit Wikipedia
Link: /wiki/Wikipedia:Community_portal, Title: The hub for editors
Link: /wiki/Special:RecentChanges, Title: A list of recent changes to Wikipedia [r]
Link: /wiki/Wikipedia:File_upload_wizard, Title: Add images or other media for use on Wikipedia
Link: /wiki/Main_Page, Title: None
Link: /wiki/Special:Search, Title: Search Wikipedia [f]
Link: https://donate.wikimedi

In [20]:
countries = []  # Initialize the list to store country names

for a in a_tags:
    title = a.get('title')  # Extract the title attribute
    if title:  # Ensure there's a title
        countries.append(title)

In [21]:
df_countries = pd.DataFrame({'Country': countries})  # Create the DataFrame

In [22]:
# 8. Display the DataFrame
print("Countries DataFrame:")
print(df_countries)

Countries DataFrame:
                                               Country
0                              Visit the main page [z]
1                         Guides to browsing Wikipedia
2                   Articles related to current events
3                Visit a randomly selected article [x]
4               Learn about Wikipedia and how it works
..                                                 ...
244           Category:Articles with short description
245  Category:Short description is different from W...
246  Wikipedia:Text of the Creative Commons Attribu...
247  foundation:Special:MyLanguage/Policy:Terms of Use
248  foundation:Special:MyLanguage/Policy:Privacy p...

[249 rows x 1 columns]
