# Lab | Web Scraping Multiple Pages

- Business goal:
Check the case_study_gnod.md file.

Make sure you've understood the big picture of your project:

- the goal of the company (Gnod),
- their current product (Gnoosic),
- their strategy, and
- how your project fits into this context.


Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

## Instructions Part 1

Prioritize the MVP (Minimum Viable Product)


In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

### Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

### Instructions Part 2
Practice web scraping. This is not involved with the GNOD project of the week
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

In [1]:
# Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
import random
from random import randint

#### 1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'

In [2]:
url = 'https://en.wikipedia.org/wiki/Python'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)


200

In [3]:
#links --> #mw-content-text > div.mw-parser-output > ul:nth-child(7) > li:nth-child(1) > a

In [5]:
pytab =soup.select("#mw-content-text > div.mw-parser-output a")
pytab

[<a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a href="/w/index.php?title=Python&amp;action=edit&amp;section=1" title="Edit section: Snakes">edit</a>,
 <a href="/wiki/Pythonidae" title="Pythonidae">Pythonidae</a>,
 <a href="/wiki/Python_(genus)" title="Python (genus)"><i>Python</i> (genus)</a>,
 <a href="/wiki/Python_(mythology)" title="Python (mythology)">Python (mythology)</a>,
 <a href="/w/index.php?title=Python&amp;action=edit&amp;section=2" title="Edit section: Computing">edit</a>,
 <a href="/wiki/Python_(programming_language)" title="Python (programming language)">Python (programming language)</a>,
 <a href="/wiki/CMU_Common_Lisp" title="CMU Common Lisp">CMU Common Lisp</a>,
 <a href="/wiki/PERQ#PERQ_3" title="PERQ">PERQ 3</a>,
 <a href="/w/index.php?title=Python&amp;action=edit&amp;section=3" title="Edit section: People">edi

In [17]:
link_list = []
for py in pytab:
    link = py.get("href")
    if link is not None:
        if (("/wiki/" in link)&
            ("/en.wiktionary.org/" not in link)&
            (":Disambig" not in link)): 
            link_list.append(py["href"])

            
print(len(link), len(link_list))
print(link_list)

83 26
['/wiki/Pythonidae', '/wiki/Python_(genus)', '/wiki/Python_(mythology)', '/wiki/Python_(programming_language)', '/wiki/CMU_Common_Lisp', '/wiki/PERQ#PERQ_3', '/wiki/Python_of_Aenus', '/wiki/Python_(painter)', '/wiki/Python_of_Byzantium', '/wiki/Python_of_Catana', '/wiki/Python_Anghelo', '/wiki/Python_(Efteling)', '/wiki/Python_(Busch_Gardens_Tampa_Bay)', '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)', '/wiki/Python_(automobile_maker)', '/wiki/Python_(Ford_prototype)', '/wiki/Python_(missile)', '/wiki/Python_(nuclear_primary)', '/wiki/Colt_Python', '/wiki/Python_(codename)', '/wiki/Python_(film)', '/wiki/Monty_Python', '/wiki/Python_(Monty)_Pictures', '/wiki/Timon_of_Phlius', '/wiki/Pyton', '/wiki/Pithon']


In [14]:
links = pd.DataFrame({"links":link_list})
links

Unnamed: 0,links
0,/wiki/Pythonidae
1,/wiki/Python_(genus)
2,/wiki/Python_(mythology)
3,/wiki/Python_(programming_language)
4,/wiki/CMU_Common_Lisp
5,/wiki/PERQ#PERQ_3
6,/wiki/Python_of_Aenus
7,/wiki/Python_(painter)
8,/wiki/Python_of_Byzantium
9,/wiki/Python_of_Catana


#### 2. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'


In [18]:
url = 'https://www.fbi.gov/wanted/topten'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

200

In [19]:
#query-results-0f737222c5054a81a120bce207b0446a > ul > li:nth-child(1) > h3 > a
soup.select("h3")[0:10]


[<h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/omar-alexander-cardenas">OMAR ALEXANDER CARDENAS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alexis-flores">ALEXIS FLORES</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/yulan-adonay-archaga-carias">YULAN ADONAY ARCHAGA CARIAS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/ruja-ignatova">RUJA IGNATOVA</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/michael-james-pratt">MICHAEL JAMES PRATT</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fb

In [20]:
top = []

# Iterations
num_iter = len(soup.select("h3 a"))

top_top = soup.select("h3 a")

for i in range(num_iter):
    top.append(top_top[i].get_text())
    
print(top[0:10])

['OMAR ALEXANDER CARDENAS', 'ALEXIS FLORES', 'BHADRESHKUMAR CHETANBHAI PATEL', 'ALEJANDRO ROSALES CASTILLO', 'YULAN ADONAY ARCHAGA CARIAS', 'RUJA IGNATOVA', 'ARNOLDO JIMENEZ', 'MICHAEL JAMES PRATT', 'JOSE RODOLFO VILLARREAL-HERNANDEZ', 'RAFAEL CARO-QUINTERO']


In [21]:
display(len(top))

10

In [22]:
top_wanted = pd.DataFrame({"TOP 10":top})
top_wanted

Unnamed: 0,TOP 10
0,OMAR ALEXANDER CARDENAS
1,ALEXIS FLORES
2,BHADRESHKUMAR CHETANBHAI PATEL
3,ALEJANDRO ROSALES CASTILLO
4,YULAN ADONAY ARCHAGA CARIAS
5,RUJA IGNATOVA
6,ARNOLDO JIMENEZ
7,MICHAEL JAMES PRATT
8,JOSE RODOLFO VILLARREAL-HERNANDEZ
9,RAFAEL CARO-QUINTERO


#### 3. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'

In [23]:
url = 'https://www.data.gov.uk/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

200

In [24]:
soup.select(" h3 > a")

[<a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Health">Health</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Mapping">Mapping</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Society">Society</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Towns+and+cities">Towns and cities</a>,
 <a class="govuk-link" href="/search?f

In [25]:
datasets = []

# Iterations
num_iter = len(soup.select("h3 > a"))

data = soup.select("h3 > a")


for i in range(num_iter):
    datasets.append(data[i].get_text())
    
    
print(datasets[0:20])

['Business and economy', 'Crime and justice', 'Defence', 'Education', 'Environment', 'Government', 'Government spending', 'Health', 'Mapping', 'Society', 'Towns and cities', 'Transport', 'Digital service performance', 'Government reference data']


In [26]:
dataset_uk = pd.DataFrame({"datasets":datasets})
dataset_uk

Unnamed: 0,datasets
0,Business and economy
1,Crime and justice
2,Defence
3,Education
4,Environment
5,Government
6,Government spending
7,Health
8,Mapping
9,Society


#### Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [27]:
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

200

In [None]:
# Get date, time, latitude, longitude and region name 

#reg0
# region --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(2) > td.point2
# date & time --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(3) > td.point2
# location --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(5) > td.point2


In [28]:
soup.select("tbody tr td")

[<td class="tabev0"></td>,
 <td class="tabev0"></td>,
 <td class="tabev0"></td>,
 <td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=1242938">2023-03-28   06:08:05.6</a></b><i class="ago" id="ago0">03min ago</i></td>,
 <td class="tabev1">38.23 </td>,
 <td class="tabev2">N  </td>,
 <td class="tabev1">38.81 </td>,
 <td class="tabev2">E  </td>,
 <td class="tabev3">5</td>,
 <td class="tabev5" id="magtyp0">M </td>,
 <td class="tabev2">2.9</td>,
 <td class="tb_region" id="reg0"> EASTERN TURKEY</td>,
 <td class="comment updatetimeno" id="upd0" style="text-align:right;">2023-03-28 06:11</td>,
 <td class="tabev0"></td>,
 <td class="tabev0"></td>,
 <td class="tabev0"></td>,
 <td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=1242935">2023-03-28   05:53:54.3</a></b><i class="ago" id="ago1">17min ago</i></td>,
 <td class="tabev1">37.34 </td>,
 <td class="tabev2">S  </td>,
 <td class="tabev1">177.29 </

In [29]:
region_list = []

# Iterations
num_iter = len(soup.select("tbody tr td"))

region = soup.select("tbody tr td")


for i in range(num_iter):
    region_list.append(region[i].get_text())
    
    
print(region_list[0:20])


['', '', '', 'earthquake2023-03-28\xa0\xa0\xa006:08:05.603min ago', '38.23\xa0', 'N\xa0\xa0', '38.81\xa0', 'E\xa0\xa0', '5', 'M ', '2.9', '\xa0EASTERN TURKEY', '2023-03-28 06:11', '', '', '', 'earthquake2023-03-28\xa0\xa0\xa005:53:54.317min ago', '37.34\xa0', 'S\xa0\xa0', '177.29\xa0']


#### Attempt with links.........

#### Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [None]:
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

In [None]:
# link --> #\31 242674 > td.tabev6 > b > a

In [None]:
prestab=soup.select("b a")
prestab[0:5]

In [None]:
# Getting links for the 20 most recent earthquakes
earthquake = []

for quake in prestab:
    link = quake.get("href")
    print(link)
    if(len(earthquake)<20):
        if link is not None:
            if (("/Earthquake" in link) & ("earthquake" in link)):
                earthquake.append(quake["href"])
    else:
        break
            

In [None]:
display(len(earthquake))

In [None]:
# Creating full URL for each earthquake and sending requests
url = "https://www.emsc-csem.org" + earthquake[0]
response = requests.get(url)
print(response.status_code)

In [None]:
# Parse & store html
soup = BeautifulSoup(response.content, "html.parser")
#soup.select("table")

In [None]:
# Links for latest 20 earthquake --> earthquakes

# 2. Finding url and store it in a variable
quake_soups = []

for quake in earthquake:
    # send request
    url =  "https://www.emsc-csem.org" + quake
    response = requests.get(url)
    print(quake, response.status_code)

    # parse & store html
    soup = BeautifulSoup(response.content, "html.parser")
    quake_soups.append(soup.select("table"))

    # respectful nap:
    wait_time = random.randint(1,4)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

In [None]:
quake_soups

In [None]:
# Get date, time, latitude, longitude and region name AND MAGNITUDE...

# magnitude --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(1) > td.point2
# region --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(2) > td.point2
# date & time --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(3) > td.point2
# location --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(5) > td.point2


In [None]:
quake_soups[-1][0].select("td.point2")

# ?

- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- 
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'