# Lab | Web Scraping Multiple Pages

## Scraping multiple pages
Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
2. Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
3. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
5. List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
6. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
7. Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

## Loading the libraries

In [99]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## 1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'

## Setting the web page to scrape

In [2]:
url ='https://en.wikipedia.org/wiki/Python'

## Downloading the web's html content with requests

In [3]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

## Pasring the hmtl

In [4]:
soup = BeautifulSoup(response.content, "html.parser")

## Checking our soup

In [5]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"59d26441-5007-412e-9e02-abf3c608f721","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":987482924,"wgRevisionId":987482924,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal common name disambiguatio

## Requesting all the pages

## Storing the web pages to scrape

In [6]:
##mw-content-text > div > ul > li > a

links = soup.select("div > ul > li > a")

# Checking the output
for i in range(len(links)):
    print(links[i])    

<a class="mw-redirect" href="/wiki/Pythons" title="Pythons">Pythons</a>
<a href="#Computing"><span class="tocnumber">1</span> <span class="toctext">Computing</span></a>
<a href="#People"><span class="tocnumber">2</span> <span class="toctext">People</span></a>
<a href="#Roller_coasters"><span class="tocnumber">3</span> <span class="toctext">Roller coasters</span></a>
<a href="#Vehicles"><span class="tocnumber">4</span> <span class="toctext">Vehicles</span></a>
<a href="#Weaponry"><span class="tocnumber">5</span> <span class="toctext">Weaponry</span></a>
<a href="#Other_uses"><span class="tocnumber">6</span> <span class="toctext">Other uses</span></a>
<a href="#See_also"><span class="tocnumber">7</span> <span class="toctext">See also</span></a>
<a href="/wiki/Python_(programming_language)" title="Python (programming language)">Python (programming language)</a>
<a href="/wiki/CMU_Common_Lisp" title="CMU Common Lisp">CMU Common Lisp</a>
<a href="/wiki/PERQ#PERQ_3" title="PERQ">PERQ 3</a>
<

In [7]:
soup.find_all('a')

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a class="mw-redirect" href="/wiki/Pythons" title="Pythons">Pythons</a>,
 <a href="/wiki/Python_(genus)" title="Python (genus)"><i>Python</i> (genus)</a>,
 <a href="#Computing"><span class="tocnumber">1</span> <span class="toctext">Computing</span></a>,
 <a href="#People"><span class="tocnumber">2</span> <span class="toctext">People</span></a>,
 <a href="#Roller_coasters"><span class="tocnumber">3</span> <span class="toctext">Roller coasters</span></a>,
 <a href="#Vehicles"><span class="tocnumber">4</span> <span class="toctext">Vehicles</span></a>,
 <a href="#Weaponry"><span class="tocnumber">5</span> <span class="toctext">Weaponry</span></

As you can see, the link we get is "relative" a common webpage:
"https://en.wikipedia.org". We need to correct this.

In [8]:
urls = []

base = "https://en.wikipedia.org"

for link in links:
    urls.append(base + link['href'])

urls

['https://en.wikipedia.org/wiki/Pythons',
 'https://en.wikipedia.org#Computing',
 'https://en.wikipedia.org#People',
 'https://en.wikipedia.org#Roller_coasters',
 'https://en.wikipedia.org#Vehicles',
 'https://en.wikipedia.org#Weaponry',
 'https://en.wikipedia.org#Other_uses',
 'https://en.wikipedia.org#See_also',
 'https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.wikipedia.org/wiki/CMU_Common_Lisp',
 'https://en.wikipedia.org/wiki/PERQ#PERQ_3',
 'https://en.wikipedia.org/wiki/Python_of_Aenus',
 'https://en.wikipedia.org/wiki/Python_(painter)',
 'https://en.wikipedia.org/wiki/Python_of_Byzantium',
 'https://en.wikipedia.org/wiki/Python_of_Catana',
 'https://en.wikipedia.org/wiki/Python_Anghelo',
 'https://en.wikipedia.org/wiki/Python_(Efteling)',
 'https://en.wikipedia.org/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 'https://en.wikipedia.org/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 'https://en.wikipedia.org/wiki/Python_(automobile_maker)',
 'https://en.wik

## 2. Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'

In [9]:
url = 'http://uscode.house.gov/download/download.shtml'
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [10]:
soup = BeautifulSoup(response.content, "html.parser")
soup

<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="no-cache" http-equiv="pragma"/><!-- HTTP 1.0 -->
<meta content="no-cache,must-revalidate" http-equiv="cache-control"/><!-- HTTP 1.1 -->
<meta content="0" http-equiv="expires"/>
<link href="/javax.faces.resource/favicon.ico.xhtml?ln=images" rel="shortcut icon"/><link href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" rel="stylesheet" type="text/css"/><script src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces" type="text/javascript"></script><link href="/javax.faces.resource/static.css.xhtml?ln=css" rel="stylesheet" type="text/css"/></head><body><script src="/javax.faces.resource/browserPreferences.js.xhtml?ln=scripts" type="text/javasc

In [11]:
# us\/usc\/t11
# class: usctitlechanged
# in bold:

soup.select(".usctitlechanged")

[<div class="usctitlechanged" id="us/usc/t7">
 
           Title 7 - Agriculture
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t11">
 
           Title 11 - Bankruptcy <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitlechanged" id="us/usc/t13">
 
           Title 13 - Census <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitlechanged" id="us/usc/t14">
 
           Title 14 - Coast Guard <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitlechanged" id="us/usc/t15">
 
           Title 15 - Commerce and Trade
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t16">
 
           Title 16 - Conservation
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t21">
 
           Title 21 - Food and Drugs
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t24">
 
           Title 24 - Hospitals and Asylums
 
         </div>,
 <div class="uscti

In [16]:
#looking at just text in bold

l = len(soup.select(".usctitlechanged"))
changed = []

for i in range(l):
    #changed.append(soup.select(".usctitlechanged")[i].get_text().replace("\n\n          ","").replace('\n\n        ',"").replace(' ٭\n', ''))
    changed.append(soup.select(".usctitlechanged")[i].get_text())    

In [17]:
changed

['\n\n          Title 7 - Agriculture\n\n        ',
 '\n\n          Title 11 - Bankruptcy ٭\n',
 '\n\n          Title 13 - Census ٭\n',
 '\n\n          Title 14 - Coast Guard ٭\n',
 '\n\n          Title 15 - Commerce and Trade\n\n        ',
 '\n\n          Title 16 - Conservation\n\n        ',
 '\n\n          Title 21 - Food and Drugs\n\n        ',
 '\n\n          Title 24 - Hospitals and Asylums\n\n        ',
 '\n\n          Title 27 - Intoxicating Liquors\n\n        ',
 '\n\n          Title 32 - National Guard ٭\n',
 '\n\n          Title 33 - Navigation and Navigable Waters\n\n        ',
 '\n\n          Title 34 - Crime Control and Law Enforcement\n\n        ',
 '\n\n          Title 36 - Patriotic and National Observances, Ceremonies, and Organizations ٭\n',
 "\n\n          Title 38 - Veterans' Benefits ٭\n",
 '\n\n          Title 42 - The Public Health and Welfare\n\n        ',
 '\n\n          Title 45 - Railroads\n\n        ',
 '\n\n          Title 49 - Transportation ٭\n',
 '\n\n 

In [18]:
#Looking for just numbers:

import re #regex

In [25]:
p = re.compile("\d+") # integer 
numbers = list(map(lambda x: int(p.search(x).group(0)),changed))
#.group(0) always returns the fully matched string

In [26]:
#Number of titles:
numbers

[7, 11, 13, 14, 15, 16, 21, 24, 27, 32, 33, 34, 36, 38, 42, 45, 49, 54]

## 3. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'

In [28]:
url = 'https://www.fbi.gov/wanted/topten'
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [29]:
soup = BeautifulSoup(response.content, "html.parser")

In [36]:
#query-results-0f737222c5054a81a120bce207b0446a > ul > li:nth-child(1) > h3 > a

l = len(soup.select('h3 > a'))
top10 = []

for i in range(l):
    top10.append(soup.select('h3 > a')[i].get_text())
    print(top10[i])

BHADRESHKUMAR CHETANBHAI PATEL
ALEJANDRO ROSALES CASTILLO
ARNOLDO JIMENEZ
JASON DEREK BROWN
ALEXIS FLORES
JOSE RODOLFO VILLARREAL-HERNANDEZ
EUGENE PALMER
RAFAEL CARO-QUINTERO
ROBERT WILLIAM FISHER
YASER ABDEL SAID


## 4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [63]:
url = 'https://www.emsc-csem.org/Earthquake/'
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [64]:
soup = BeautifulSoup(response.content, "html.parser")

In [92]:
soup.select("td > b > a")[0].get_text()


'2020-11-23\xa0\xa0\xa022:38:01.0'

In [93]:
soup.select('.tb_region')[0]

<td class="tb_region" id="reg0"> TARAPACA, CHILE</td>

In [95]:
for i in range(20):
    print(soup.select('.tabev1')[i]) #shows both lat & long

<td class="tabev1">20.76 </td>
<td class="tabev1">69.18 </td>
<td class="tabev1">18.02 </td>
<td class="tabev1">71.45 </td>
<td class="tabev1">56.72 </td>
<td class="tabev1">135.87 </td>
<td class="tabev1">53.44 </td>
<td class="tabev1">165.81 </td>
<td class="tabev1">22.64 </td>
<td class="tabev1">66.18 </td>
<td class="tabev1">17.93 </td>
<td class="tabev1">66.99 </td>
<td class="tabev1">42.94 </td>
<td class="tabev1">0.17 </td>
<td class="tabev1">0.78 </td>
<td class="tabev1">127.53 </td>
<td class="tabev1">36.83 </td>
<td class="tabev1">121.56 </td>
<td class="tabev1">44.66 </td>
<td class="tabev1">22.35 </td>


In [120]:
#date_time 
#selector path: [id="\39 23739"] > td.tabev6 > b > a
date = []
time = []

lat_n = []  #class tabev1
lat_d = []  #class tabev2
long_n = []
long_d = []
region = [] #.tb_region

In [121]:
for i in range(20):
    date.append(soup.select("td > b > a")[i].
                     get_text().replace('\xa0\xa0\xa0',',').split(',')[0])
    time.append(soup.select("td > b > a")[i].
                     get_text().replace('\xa0\xa0\xa0',',').split(',')[1])
    region.append(soup.select('.tb_region')[0].get_text().replace('\xa0',''))

len(region)

20

In [122]:
for i in range(0,40,2):                     
    lat_n.append(soup.select('.tabev1')[i].get_text().replace('\xa0',''))   
    long_n.append(soup.select('.tabev1')[i+1].get_text().replace('\xa0','')) #yes

len(lat_n)

20

In [123]:
for i in range(0,60,3):                     
    lat_d.append(soup.select('.tabev2')[i].get_text().replace('\xa0',''))    
    long_d.append(soup.select('.tabev2')[i+1].get_text().replace('\xa0',''))

len(lat_d)

20

In [124]:
eq = pd.DataFrame({"date":date,
                       "time":time,
                       "lat_n":lat_n,
                       "lat_d":lat_d,
                       "long_n":long_n,
                       "long_d":long_d,
                       "region": region
                      })

In [125]:
eq

Unnamed: 0,date,time,lat_n,lat_d,long_n,long_d,region
0,2020-11-23,22:38:01.0,20.76,S,69.18,W,"TARAPACA, CHILE"
1,2020-11-23,22:36:50.0,18.02,N,71.45,W,"TARAPACA, CHILE"
2,2020-11-23,22:19:56.6,56.72,N,135.87,W,"TARAPACA, CHILE"
3,2020-11-23,22:14:29.7,53.44,N,165.81,W,"TARAPACA, CHILE"
4,2020-11-23,22:07:51.1,22.64,S,66.18,W,"TARAPACA, CHILE"
5,2020-11-23,22:07:35.8,17.93,N,66.99,W,"TARAPACA, CHILE"
6,2020-11-23,21:50:27.9,42.94,N,0.17,E,"TARAPACA, CHILE"
7,2020-11-23,21:36:47.0,0.78,S,127.53,E,"TARAPACA, CHILE"
8,2020-11-23,21:22:29.3,36.83,N,121.56,W,"TARAPACA, CHILE"
9,2020-11-23,21:05:50.9,44.66,N,22.35,E,"TARAPACA, CHILE"
