# CM4125 Extra Topic 2 (Web Scraping) Lab 1

The purpose of this activity is that you scrape Wikipedia to build a data frame that contains the population of the **eight** localities considered "cities" in Scotland (i.e. Aberdeen, Dundee, Dunfermline, Edinburgh, Glasgow, Inverness, Perth and Stirling). 

The info is contained [here](https://en.wikipedia.org/wiki/List_of_towns_and_cities_in_Scotland_by_population)

First, you have to install and import the two packages seen in the lecture: ``requests`` and `BeautifulSoup`:

In [None]:
# install necessary packages
!pip install requests
!pip install bs4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# import required packages
import requests
from bs4 import BeautifulSoup

Now we define an URL to target and get the content:

In [None]:
# define the url
url = "https://en.wikipedia.org/wiki/List_of_towns_and_cities_in_Scotland_by_population"
# request the url
r = requests.get(url)
# soup it
soup = BeautifulSoup(r.content,"html.parser")
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of towns and cities in Scotland by population - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"cae910eb-8fc2-447c-aa90-e7de2bda29c9","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_towns_and_cities_in_Scotland_by_population","wgTitle":"List of towns and cities in Scotland by population","wgCurRevisionId":1104723865,"wgRevisionId":1104723865,"wgArticleId":14184004,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short 

Now you get the rows of the table by finding all entries that have the `tr` tag

In [None]:
# get rows of the table
rows = soup.find("table", {"class":"wikitable sortable"}).find_all("tr")
# print all rows
rows

[<tr>
 <th>Rank</th>
 <th>Locality</th>
 <th>Population</th>
 <th>Status</th>
 <th>Council area
 </th></tr>, <tr>
 <th>1
 </th>
 <td><a href="/wiki/Glasgow" title="Glasgow">Glasgow</a><span class="anchor" id="Glasgow"></span>
 </td>
 <td>632,350
 </td>
 <td>City
 </td>
 <td><a href="/wiki/Glasgow_City_Council" title="Glasgow City Council">Glasgow City</a>
 </td></tr>, <tr>
 <th>2
 </th>
 <td><a href="/wiki/Edinburgh" title="Edinburgh">Edinburgh</a><span class="anchor" id="Edinburgh"></span>
 </td>
 <td>506,520
 </td>
 <td>City
 </td>
 <td><a href="/wiki/City_of_Edinburgh_Council" title="City of Edinburgh Council">City of Edinburgh</a>
 </td></tr>, <tr>
 <th>3
 </th>
 <td><a href="/wiki/Aberdeen" title="Aberdeen">Aberdeen</a><span class="anchor" id="Aberdeen"></span>
 </td>
 <td>198,590
 </td>
 <td>City
 </td>
 <td><a href="/wiki/Aberdeen_City_Council" title="Aberdeen City Council">Aberdeen City</a>
 </td></tr>, <tr>
 <th>4
 </th>
 <td><a href="/wiki/Dundee" title="Dundee">Dundee</a><sp

Just for you to see an example, if I query `rows[0].find_all("th")` I can access the first row, which contains the headers

In [None]:
# query all columns of the first row (the header)
rows[0].find_all("th")

[<th>Rank</th>,
 <th>Locality</th>,
 <th>Population</th>,
 <th>Status</th>,
 <th>Council area
 </th>]

If I query `rows[1]` then I get my first entry

In [None]:
# This is the first row
rows[1].find_all("td")

[<td><a href="/wiki/Glasgow" title="Glasgow">Glasgow</a><span class="anchor" id="Glasgow"></span>
 </td>, <td>632,350
 </td>, <td>City
 </td>, <td><a href="/wiki/Glasgow_City_Council" title="Glasgow City Council">Glasgow City</a>
 </td>]

With the following code, I will scrape this table. This is largely based on the code seen in class, but it has been adapted to find `City`, then split the `\\` in each entry (which is noise for us) and remove the comma from the numbers

In [None]:
# create two lists to put names and populations
cities = []
population = []
# Go over al rows to find the cities 
# i.e entry with the word "city" in the third column
# "if cell" lets me skip the header
for row in rows:
    cells = row.find_all("td")
    if cells and "City" in cells[2].get_text(): 
        # if city, get name and pop
        cities.append(cells[0].get_text().split("\n")[0])
        population.append(int(cells[1].get_text().split("\\")[0].replace(",","")))
print(cities)
print(population)

['Glasgow', 'Edinburgh', 'Aberdeen', 'Dundee', 'Dunfermline', 'Inverness', 'Perth', 'Stirling']
[632350, 506520, 198590, 148210, 54990, 47790, 47350, 37910]


This is how the data can be converted into a `Pandas` data frame, which is more useful when importing and managing the data!

In [None]:
import pandas as pd
scotland = pd.DataFrame({'City':cities,'Population':population})
scotland

Unnamed: 0,City,Population
0,Glasgow,632350
1,Edinburgh,506520
2,Aberdeen,198590
3,Dundee,148210
4,Dunfermline,54990
5,Inverness,47790
6,Perth,47350
7,Stirling,37910


Another (more complicated option) would be to scrape the data from each Wikipedia entry:
     
    https://en.wikipedia.org/wiki/Dundee
    https://en.wikipedia.org/wiki/Edinburgh
    https://en.wikipedia.org/wiki/Glasgow
    https://en.wikipedia.org/wiki/Aberdeen
    https://en.wikipedia.org/wiki/Perth,_Scotland
    https://en.wikipedia.org/wiki/Inverness
    https://en.wikipedia.org/wiki/Stirling
    https://en.wikipedia.org/wiki/Dunfermline

You may want to do this just to corroborate if the population numbers on their respectives sites are equal to the ones on the "List of cities" page!

In [None]:
### To scrpae the different websites, uncomment the url

url = "https://en.wikipedia.org/wiki/Aberdeen"
#url = "https://en.wikipedia.org/wiki/Dundee"
#url = "https://en.wikipedia.org/wiki/Dunfermline"
#url = "https://en.wikipedia.org/wiki/Edinburgh"
#url = "https://en.wikipedia.org/wiki/Glasgow"
#url = "https://en.wikipedia.org/wiki/Inverness"
#url = "https://en.wikipedia.org/wiki/Perth,_Scotland"
#url = "https://en.wikipedia.org/wiki/Stirling"


import requests
from bs4 import BeautifulSoup
pop=int()
r = requests.get(url)
soup = BeautifulSoup(r.content,"html.parser")

if "Stirling" in url or 'Dunfermline' in url:
    rows = soup.find("table", {"class":"infobox ib-uk-place vcard"}).find_all("tr")
else:
    rows = soup.find("table", {"class":"infobox ib-settlement vcard"}).find_all("tr")

for i, row in enumerate(rows):
    cells = row.find_all("th")
    if cells and "population" in cells[0].get_text().lower():
        if row.find_all("td"):
            data_cells = row.find_all("td")
        else:
            data_cells = rows[i+1].find_all("td") # The entry is AFTER the population!
        pop = int(data_cells[0].get_text().split("(")[0].split("[")[0].strip().replace(",",""))
print("The pupulation of "+url.split('/')[-1]+" is "+str(pop))

The pupulation of Aberdeen is 198590


As you can see this solution is not very optimal, since you have to consider that different pages have distinct syntax, class names for tables, etc.For instance, the wikipedia entries for Stirling and Dunfermline have an `infobox-ib-uk-place vkard` class, while the rest call it `infobox-ib-settlement vkard`!

**BONUS:** Can you optimse my code to handle this exception more effectively? Also, can you "ask" the user to enter only the city and make the code query the website by itself?