# Lab | Web Scraping Multiple Pages

- Business goal:
Check the case_study_gnod.md file.

Make sure you've understood the big picture of your project:

- the goal of the company (Gnod),
- their current product (Gnoosic),
- their strategy, and
- how your project fits into this context.


Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

## Instructions Part 1

Prioritize the MVP (Minimum Viable Product)


In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

### Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

### Instructions Part 2
Practice web scraping. This is not involved with the GNOD project of the week
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

In [2]:
# Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
import random
from random import randint

#### 1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'

In [11]:
url = 'https://en.wikipedia.org/wiki/Python'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)


200

In [24]:
#links --> #mw-content-text > div.mw-parser-output > ul:nth-child(7) > li:nth-child(1) > a

In [35]:
pytab =soup.select("#mw-content-text > div.mw-parser-output a")
pytab[0:3]

[<a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a href="/w/index.php?title=Python&amp;action=edit&amp;section=1" title="Edit section: Snakes">edit</a>]

In [37]:
link_list = []
for py in pytab:
    link = py.get("href")
    print(link)
    if link is not None:
        if (("/wiki" in link)): 
            link_list.append(py["href"])

https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/wiki/Python_(mythology)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(programming_language)
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/wiki/Python_Anghelo
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(Efteling)
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/w/index.php?title=Python&action=edit&section=5
/wiki/Python_(automobile_maker)
/wiki/Python_(Ford_prototype)
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(missile)
/wiki/Python_(nuclear_primary)
/wiki/Colt_Python
/w/index.php?title=Python&action=edit&section=7
/wiki/Python_(codename)
/wiki/Python_(film)
/wiki/Monty_Python
/wiki/Python_(Monty)_Picture

In [39]:
links = pd.DataFrame({"links":link_list})
links

Unnamed: 0,links
0,https://en.wiktionary.org/wiki/Python
1,https://en.wiktionary.org/wiki/python
2,/wiki/Pythonidae
3,/wiki/Python_(genus)
4,/wiki/Python_(mythology)
5,/wiki/Python_(programming_language)
6,/wiki/CMU_Common_Lisp
7,/wiki/PERQ#PERQ_3
8,/wiki/Python_of_Aenus
9,/wiki/Python_(painter)


#### 2. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'


In [65]:
url = 'https://www.fbi.gov/wanted/topten'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

200

In [78]:
#query-results-0f737222c5054a81a120bce207b0446a > ul > li:nth-child(1) > h3 > a
soup.select("h3")[0:10]


[<h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/omar-alexander-cardenas">OMAR ALEXANDER CARDENAS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alexis-flores">ALEXIS FLORES</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/yulan-adonay-archaga-carias">YULAN ADONAY ARCHAGA CARIAS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/ruja-ignatova">RUJA IGNATOVA</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/michael-james-pratt">MICHAEL JAMES PRATT</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fb

In [81]:
top = []

# Iterations
num_iter = len(soup.select("h3 a"))

top_top = soup.select("h3 a")

for i in range(num_iter):
    top.append(top_top[i].get_text())
    
print(top[0:10])

['OMAR ALEXANDER CARDENAS', 'ALEXIS FLORES', 'BHADRESHKUMAR CHETANBHAI PATEL', 'ALEJANDRO ROSALES CASTILLO', 'YULAN ADONAY ARCHAGA CARIAS', 'RUJA IGNATOVA', 'ARNOLDO JIMENEZ', 'MICHAEL JAMES PRATT', 'JOSE RODOLFO VILLARREAL-HERNANDEZ', 'RAFAEL CARO-QUINTERO']


In [82]:
display(len(top))

10

In [83]:
top_wanted = pd.DataFrame({"TOP 10":top})
top_wanted

Unnamed: 0,TOP 10
0,OMAR ALEXANDER CARDENAS
1,ALEXIS FLORES
2,BHADRESHKUMAR CHETANBHAI PATEL
3,ALEJANDRO ROSALES CASTILLO
4,YULAN ADONAY ARCHAGA CARIAS
5,RUJA IGNATOVA
6,ARNOLDO JIMENEZ
7,MICHAEL JAMES PRATT
8,JOSE RODOLFO VILLARREAL-HERNANDEZ
9,RAFAEL CARO-QUINTERO


#### 3. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'

In [135]:
url = 'https://www.data.gov.uk/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

200

In [136]:
soup.select(" h3 > a")

[<a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Health">Health</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Mapping">Mapping</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Society">Society</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Towns+and+cities">Towns and cities</a>,
 <a class="govuk-link" href="/search?f

In [138]:
datasets = []

# Iterations
num_iter = len(soup.select("h3 > a"))

data = soup.select("h3 > a")


for i in range(num_iter):
    datasets.append(data[i].get_text())
    
    
print(datasets[0:20])

['Business and economy', 'Crime and justice', 'Defence', 'Education', 'Environment', 'Government', 'Government spending', 'Health', 'Mapping', 'Society', 'Towns and cities', 'Transport', 'Digital service performance', 'Government reference data']


In [139]:
dataset_uk = pd.DataFrame({"datasets":datasets})
dataset_uk

Unnamed: 0,datasets
0,Business and economy
1,Crime and justice
2,Defence
3,Education
4,Environment
5,Government
6,Government spending
7,Health
8,Mapping
9,Society


#### Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [142]:
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

200

In [None]:
# Get date, time, latitude, longitude and region name 

#reg0
# region --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(2) > td.point2
# date & time --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(3) > td.point2
# location --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(5) > td.point2


In [162]:
soup.select("tbody tr td")

[<td class="tabev0"></td>,
 <td class="tabev0"></td>,
 <td class="tabev0"></td>,
 <td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=1242721">2023-03-27   17:13:33.0</a></b><i class="ago" id="ago0">07min ago</i></td>,
 <td class="tabev1">12.50 </td>,
 <td class="tabev2">N  </td>,
 <td class="tabev1">88.20 </td>,
 <td class="tabev2">W  </td>,
 <td class="tabev3">27</td>,
 <td class="tabev5" id="magtyp0"> M</td>,
 <td class="tabev2">3.1</td>,
 <td class="tb_region" id="reg0"> OFFSHORE EL SALVADOR</td>,
 <td class="comment updatetimeno" id="upd0" style="text-align:right;">2023-03-27 17:20</td>,
 <td class="tabev0" style="text-align:center;"><a href="https://www.emsc-csem.org/Earthquake/Testimonies/comments.php?id=1242720" onmouseout="info_b2('notshow','');" onmouseover="info_b2('show','See the &lt;b&gt;4 testimonies&lt;/b&gt; for this earthquake');"><span class="" style="vertical-align:middle;">4</span></a></td>,
 <td class="tabev0"></td>,


In [170]:
region_list = []

# Iterations
num_iter = len(soup.select("tbody tr td"))

region = soup.select("tbody tr td")


for i in range(num_iter):
    region_list.append(region[i].get_text())
    
    
print(region_list[0:20])


['', '', '', 'earthquake2023-03-27\xa0\xa0\xa017:13:33.007min ago', '12.50\xa0', 'N\xa0\xa0', '88.20\xa0', 'W\xa0\xa0', '27', ' M', '3.1', '\xa0OFFSHORE EL SALVADOR', '2023-03-27 17:20', '4', '', 'III', 'earthquake2023-03-27\xa0\xa0\xa017:13:16.808min ago', '37.11\xa0', 'N\xa0\xa0', '36.92\xa0']


#### Attempt with links.........

#### Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [84]:
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

display(response.status_code)
#soup

200

In [85]:
# link --> #\31 242674 > td.tabev6 > b > a

In [86]:
prestab=soup.select("b a")
prestab[0:5]

[<a href="/Earthquake/earthquake.php?id=1242703">2023-03-27   16:27:25.0</a>,
 <a href="/Earthquake/earthquake.php?id=1242709">2023-03-27   16:26:06.4</a>,
 <a href="/Earthquake/earthquake.php?id=1242700">2023-03-27   16:23:56.8</a>,
 <a href="/Earthquake/earthquake.php?id=1242702">2023-03-27   16:21:20.2</a>,
 <a href="/Earthquake/earthquake.php?id=1242696">2023-03-27   16:05:12.1</a>]

In [87]:
# Getting links for the 20 most recent earthquakes
earthquake = []

for quake in prestab:
    link = quake.get("href")
    print(link)
    if(len(earthquake)<20):
        if link is not None:
            if (("/Earthquake" in link) & ("earthquake" in link)):
                earthquake.append(quake["href"])
    else:
        break
            

/Earthquake/earthquake.php?id=1242703
/Earthquake/earthquake.php?id=1242709
/Earthquake/earthquake.php?id=1242700
/Earthquake/earthquake.php?id=1242702
/Earthquake/earthquake.php?id=1242696
/Earthquake/earthquake.php?id=1242697
/Earthquake/earthquake.php?id=1242695
/Earthquake/earthquake.php?id=1242691
/Earthquake/earthquake.php?id=1242693
/Earthquake/earthquake.php?id=1242692
/Earthquake/earthquake.php?id=1242694
/Earthquake/earthquake.php?id=1242689
/Earthquake/earthquake.php?id=1242688
/Earthquake/earthquake.php?id=1242687
/Earthquake/earthquake.php?id=1242684
/Earthquake/earthquake.php?id=1242681
/Earthquake/earthquake.php?id=1242680
/Earthquake/earthquake.php?id=1242682
/Earthquake/earthquake.php?id=1242685
/Earthquake/earthquake.php?id=1242678
/Earthquake/earthquake.php?id=1242677


In [88]:
display(len(earthquake))

20

In [89]:
# Creating full URL for each earthquake and sending requests
url = "https://www.emsc-csem.org" + earthquake[0]
response = requests.get(url)
print(response.status_code)

200


In [90]:
# Parse & store html
soup = BeautifulSoup(response.content, "html.parser")
#soup.select("table")

In [91]:
# Links for latest 20 earthquake --> earthquakes

# 2. Finding url and store it in a variable
quake_soups = []

for quake in earthquake:
    # send request
    url =  "https://www.emsc-csem.org" + quake
    response = requests.get(url)
    print(quake, response.status_code)

    # parse & store html
    soup = BeautifulSoup(response.content, "html.parser")
    quake_soups.append(soup.select("table"))

    # respectful nap:
    wait_time = random.randint(1,4)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

/Earthquake/earthquake.php?id=1242703 200
I will sleep for 4 second/s.
/Earthquake/earthquake.php?id=1242709 200
I will sleep for 2 second/s.
/Earthquake/earthquake.php?id=1242700 200
I will sleep for 2 second/s.
/Earthquake/earthquake.php?id=1242702 200
I will sleep for 4 second/s.
/Earthquake/earthquake.php?id=1242696 200
I will sleep for 2 second/s.
/Earthquake/earthquake.php?id=1242697 200
I will sleep for 1 second/s.
/Earthquake/earthquake.php?id=1242695 200
I will sleep for 4 second/s.
/Earthquake/earthquake.php?id=1242691 200
I will sleep for 3 second/s.
/Earthquake/earthquake.php?id=1242693 200
I will sleep for 4 second/s.
/Earthquake/earthquake.php?id=1242692 200
I will sleep for 4 second/s.
/Earthquake/earthquake.php?id=1242694 200
I will sleep for 3 second/s.
/Earthquake/earthquake.php?id=1242689 200


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


I will sleep for 2 second/s.
/Earthquake/earthquake.php?id=1242688 200
I will sleep for 4 second/s.
/Earthquake/earthquake.php?id=1242687 200
I will sleep for 3 second/s.
/Earthquake/earthquake.php?id=1242684 200
I will sleep for 3 second/s.
/Earthquake/earthquake.php?id=1242681 200


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


I will sleep for 2 second/s.
/Earthquake/earthquake.php?id=1242680 200
I will sleep for 3 second/s.
/Earthquake/earthquake.php?id=1242682 200
I will sleep for 3 second/s.
/Earthquake/earthquake.php?id=1242685 200
I will sleep for 3 second/s.
/Earthquake/earthquake.php?id=1242678 200
I will sleep for 2 second/s.


In [None]:
quake_soups

In [None]:
# Get date, time, latitude, longitude and region name AND MAGNITUDE...

# magnitude --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(1) > td.point2
# region --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(2) > td.point2
# date & time --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(3) > td.point2
# location --> #ong1 > table:nth-child(1) > tbody > tr > td:nth-child(1) > table > tbody > tr:nth-child(5) > td.point2


In [None]:
quake_soups[-1][0].select("td.point2")

# ?

- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- 
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'