<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping

INSTRUCTIONS:

- Run the cells
- Observe and understand the results
- Answer the questions

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

![image.png](attachment:image.png)

In [1]:
pip install regex

Note: you may need to restart the kernel to use updated packages.


In [2]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [3]:
# specify the url
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'

### Retrieve the page
- Require Internet connection

In [4]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 428058


### Convert the stream of bytes into a BeautifulSoup representation

In [5]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [6]:
print(soup.prettify()[:1000])

<!DOCTYPE doctype html>
<html class="" dir="ltr" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, user-scalable=yes" name="viewport"/>
  <meta content="MediaWiki 1.19.24" name="generator">
   <meta content="The Big Bang Theory Wiki,bigbangtheory,Barry Kripke,Amy Farrah Fowler,Beverly Hofstadter,The Killer Robot Instability,Season 2,Season 3,Season 4,Season 5,Season 6,Season 7,Season 8" name="keywords">
    <meta content="Barry Kripke, Ph.D. is a Caltech plasma-physicist-turned-string-theorist and he is a colleague of Leonard and Sheldon. He has a case of rhotacism, where he pronounces &amp;quot;r&amp;quot; and &amp;quot;l&amp;quot; as &amp;quot;w&amp;quot;, much like Elmer Fudd in Looney Tunes. With a knack for ribbing people, he is a friend to Leonard, Howard..." name="description"/>
    <meta content="summary" name="twitter:card"/>
    <meta content="@getfandom" name="twitter:site"/>
    <meta content="http

### Check the HTML's Title

In [7]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Barry Kripke | The Big Bang Theory Wiki | Fandom</title>:
Title text:Barry Kripke | The Big Bang Theory Wiki | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

In [8]:
tag = 'article'
article = soup.find_all(tag)[0]
print('Type of the variable \'article\':', article.__class__.__name__)

Type of the variable 'article': Tag


### Get some of the text
- Plain text without HTML tags

In [9]:
# show the first 500 characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', article.text)[:500])


watch						01:50
The Loop (TV)
 
	Do you like this video?	
 
				define('wikia.articleVideo.featuredVideo.data', function () {
					return {"mediaId":"CR0MZ2ZP","impressionsPerSession":1,"title":"The Loop (TV)","description":"","kind":"DYNAMIC","feedid":"CR0MZ2ZP","links":{"first":"https:\/\/cdn.jwplayer.com\/v2\/media\/CR0MZ2ZP?resource_id=CR0MZ2ZP&internal=false&page_offset=1&page_limit=10","last":"https:\/\/cdn.jwplayer.com\/v2\/media\/CR0MZ2ZP?resource_id=CR0MZ2ZP&internal=false&page_offset


### Find the links in the text

In [10]:
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in article.find_all(tag)]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 393


['https://vignette.wikia.nocookie.net/bigbangtheory/images/e/e7/Plag16.jpg/revision/latest?cb=20190506103936',
 'https://vignette.wikia.nocookie.net/bigbangtheory/images/f/f6/BarryKripke.png/revision/latest?cb=20110225223327',
 '/wiki/John_Ross_Bowie',
 '/wiki/Amy_Farrah_Fowler',
 '/wiki/Beverly_Hofstadter',
 '/wiki/The_Killer_Robot_Instability',
 '/wiki/The_Change_Constant',
 '/wiki/Season_2',
 '/wiki/Season_3',
 '/wiki/Season_4',
 '/wiki/Season_5',
 '/wiki/Season_6',
 '/wiki/Season_7',
 '/wiki/Season_8',
 '/wiki/Season_9',
 '/wiki/Season_10',
 '/wiki/Season_11',
 '/wiki/Season_12',
 '/wiki/Caltech',
 '/wiki/String_theory',
 '/wiki/Leonard',
 '/wiki/Sheldon',
 '/wiki/Leonard',
 '/wiki/Howard',
 '/wiki/Rajesh',
 '/wiki/Amy',
 '#',
 '/wiki/Season_1',
 '/wiki/Kripke_Krippler',
 '/wiki/M.O.N.T.E.',
 '/wiki/Caltech',
 '/wiki/The_Killer_Robot_Instability',
 '/wiki/Penny',
 '/wiki/Howard',
 '/wiki/Penny',
 '/wiki/The_Friendship_Algorithm',
 '/wiki/Season_3',
 '/wiki/The_Electric_Can_Opener_F

In [11]:
# keep only the links to the wiki itself
tag_list = [t[6:] for t in tag_list if (t) and (t.startswith('/wiki/'))]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 380


['John_Ross_Bowie',
 'Amy_Farrah_Fowler',
 'Beverly_Hofstadter',
 'The_Killer_Robot_Instability',
 'The_Change_Constant',
 'Season_2',
 'Season_3',
 'Season_4',
 'Season_5',
 'Season_6',
 'Season_7',
 'Season_8',
 'Season_9',
 'Season_10',
 'Season_11',
 'Season_12',
 'Caltech',
 'String_theory',
 'Leonard',
 'Sheldon',
 'Leonard',
 'Howard',
 'Rajesh',
 'Amy',
 'Season_1',
 'Kripke_Krippler',
 'M.O.N.T.E.',
 'Caltech',
 'The_Killer_Robot_Instability',
 'Penny',
 'Howard',
 'Penny',
 'The_Friendship_Algorithm',
 'Season_3',
 'The_Electric_Can_Opener_Fluctuation',
 'Sheldon%27s_office',
 'The_Cafeteria',
 'Leonard',
 'Raj',
 'Sheldon',
 'President_Siebert',
 'The_Vengeance_Formulation',
 'Apartment_4A',
 'Zack_Johnson',
 'Stuart_Bloom',
 'LeVar_Burton',
 'Raj%27s_apartment',
 'The_Toast_Derivation',
 'Raj',
 'Professor_Rothman',
 'Siri',
 'The_Beta_Test_Initiation',
 'The_Rothman_Disintegration',
 'Professor_Rothman',
 'President_Siebert',
 'The_Stag_Convergence',
 'File:Kirpike.jpg',
 

In [12]:
# create a filter for undesired links
filter  = '(%s)' % '|'.join([
    'Season_',
    'Category:',
    'File:',
    'Help:',
    'Portal:',
    'action=',
    'Special:',
    'Talk:'
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 249


['John_Ross_Bowie',
 'Amy_Farrah_Fowler',
 'Beverly_Hofstadter',
 'The_Killer_Robot_Instability',
 'The_Change_Constant',
 'Caltech',
 'String_theory',
 'Leonard',
 'Sheldon',
 'Leonard',
 'Howard',
 'Rajesh',
 'Amy',
 'Kripke_Krippler',
 'M.O.N.T.E.',
 'Caltech',
 'The_Killer_Robot_Instability',
 'Penny',
 'Howard',
 'Penny',
 'The_Friendship_Algorithm',
 'The_Electric_Can_Opener_Fluctuation',
 'Sheldon%27s_office',
 'The_Cafeteria',
 'Leonard',
 'Raj',
 'Sheldon',
 'President_Siebert',
 'The_Vengeance_Formulation',
 'Apartment_4A',
 'Zack_Johnson',
 'Stuart_Bloom',
 'LeVar_Burton',
 'Raj%27s_apartment',
 'The_Toast_Derivation',
 'Raj',
 'Professor_Rothman',
 'Siri',
 'The_Beta_Test_Initiation',
 'The_Rothman_Disintegration',
 'Professor_Rothman',
 'President_Siebert',
 'The_Stag_Convergence',
 'The_Cooper/Kripke_Inversion',
 'Amy',
 'The_Tenure_Turbulence',
 'Janine_Davis',
 'The_Discovery_Dissipation',
 'Sheldon',
 'The_Relationship_Diremption',
 'String_Theory',
 'Stephen_Hawking',

In [13]:
# remove duplicates
tag_list = list(set(tag_list))
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 201


['The_Geology_Elevation',
 'Keith_Carradine',
 'Sarah_Michelle_Gellar',
 'The_Discovery_Dissipation',
 'The_cafeteria',
 'Christine_Baranski',
 'Missy_Cooper',
 'Amy_Farrah_Fowler',
 'The_Cooper/Kripke_Inversion',
 'V._M._Koothrappali',
 'Mark_Hamill',
 'Sara_Gilbert',
 'The_Stag_Convergence',
 'Kaley_Cuoco',
 'Katee_Sackhoff',
 'Stuart_Bloom',
 'LeVar_Burton',
 'Katey_Sagal',
 'Sara_Rue',
 'The_Comic_Book_Store_Regeneration',
 'Fun_with_Flags',
 'Bill_Prady',
 'Mary_T._Quigley',
 'Steve_Wozniak',
 'Brian_Patrick_Wade',
 'Dave_Goetsch',
 'Chuck_Lorre_Productions',
 'Charlie_Sheen',
 'Simon_Helberg',
 'Halley_Wolowitz',
 'Template_talk:Characterf',
 'Judd_Hirsch',
 'Alessandra_Torresani',
 'Janine_Davis',
 'Courtney_Henggeler',
 'The_Valentino_Submergence',
 'Howard_Wolowitz',
 'Susan',
 'The_Electric_Can_Opener_Fluctuation',
 'Chuck_Lorre',
 'Eric_Kaplan',
 'Barenaked_Ladies',
 'Caltech',
 'Emily_Sweeney',
 'Jimmy_Speckerman',
 'The_Big_Bang_Theory',
 'The_History_of_Everything',
 'The

In [14]:
# convert escaped sequences
tag_list = [unquote(t) for t in tag_list]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 201


['The_Geology_Elevation',
 'Keith_Carradine',
 'Sarah_Michelle_Gellar',
 'The_Discovery_Dissipation',
 'The_cafeteria',
 'Christine_Baranski',
 'Missy_Cooper',
 'Amy_Farrah_Fowler',
 'The_Cooper/Kripke_Inversion',
 'V._M._Koothrappali',
 'Mark_Hamill',
 'Sara_Gilbert',
 'The_Stag_Convergence',
 'Kaley_Cuoco',
 'Katee_Sackhoff',
 'Stuart_Bloom',
 'LeVar_Burton',
 'Katey_Sagal',
 'Sara_Rue',
 'The_Comic_Book_Store_Regeneration',
 'Fun_with_Flags',
 'Bill_Prady',
 'Mary_T._Quigley',
 'Steve_Wozniak',
 'Brian_Patrick_Wade',
 'Dave_Goetsch',
 'Chuck_Lorre_Productions',
 'Charlie_Sheen',
 'Simon_Helberg',
 'Halley_Wolowitz',
 'Template_talk:Characterf',
 'Judd_Hirsch',
 'Alessandra_Torresani',
 'Janine_Davis',
 'Courtney_Henggeler',
 'The_Valentino_Submergence',
 'Howard_Wolowitz',
 'Susan',
 'The_Electric_Can_Opener_Fluctuation',
 'Chuck_Lorre',
 'Eric_Kaplan',
 'Barenaked_Ladies',
 'Caltech',
 'Emily_Sweeney',
 'Jimmy_Speckerman',
 'The_Big_Bang_Theory',
 'The_History_of_Everything',
 'The

In [15]:
# convert underscore to space
tag_list = [re.sub('_', ' ', t) for t in tag_list]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 201


['The Geology Elevation',
 'Keith Carradine',
 'Sarah Michelle Gellar',
 'The Discovery Dissipation',
 'The cafeteria',
 'Christine Baranski',
 'Missy Cooper',
 'Amy Farrah Fowler',
 'The Cooper/Kripke Inversion',
 'V. M. Koothrappali',
 'Mark Hamill',
 'Sara Gilbert',
 'The Stag Convergence',
 'Kaley Cuoco',
 'Katee Sackhoff',
 'Stuart Bloom',
 'LeVar Burton',
 'Katey Sagal',
 'Sara Rue',
 'The Comic Book Store Regeneration',
 'Fun with Flags',
 'Bill Prady',
 'Mary T. Quigley',
 'Steve Wozniak',
 'Brian Patrick Wade',
 'Dave Goetsch',
 'Chuck Lorre Productions',
 'Charlie Sheen',
 'Simon Helberg',
 'Halley Wolowitz',
 'Template talk:Characterf',
 'Judd Hirsch',
 'Alessandra Torresani',
 'Janine Davis',
 'Courtney Henggeler',
 'The Valentino Submergence',
 'Howard Wolowitz',
 'Susan',
 'The Electric Can Opener Fluctuation',
 'Chuck Lorre',
 'Eric Kaplan',
 'Barenaked Ladies',
 'Caltech',
 'Emily Sweeney',
 'Jimmy Speckerman',
 'The Big Bang Theory',
 'The History of Everything',
 'The

In [16]:
# order the list
tag_list.sort()
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 201


['Aarti Mann',
 'Adam West',
 'Alessandra Torresani',
 'Alex Jensen',
 'Alfred Hofstadter',
 'Alice Amter',
 'Althea Davis',
 'Amy',
 'Amy Farrah Fowler',
 'Anthony Del Broccolo',
 'Anthony Rich',
 'Anu',
 'Apartment 4A',
 'Barenaked Ladies',
 'Bernadette',
 'Bernadette Rostenkowski-Wolowitz',
 'Bert Kibbler',
 'Beverly Hofstadter',
 'Bill Nye',
 'Bill Prady',
 'Brent Spiner',
 'Brian George',
 'Brian Greene',
 'Brian Patrick Wade',
 'Brian Posehn',
 'Brian Thomas Smith',
 'Buzz Aldrin',
 'Caltech',
 'Carol Ann Susi',
 'Carrie Fisher',
 'Casey Sander',
 'Charlie Sheen',
 'Christine Baranski',
 'Chuck Lorre',
 'Chuck Lorre Productions',
 'Cinnamon',
 'Claire',
 'Colonel Richard Williams',
 'Courtney Henggeler',
 'Dan',
 'Dave Goetsch',
 'David Gibbs',
 'David Saltzberg',
 'Dean Norris',
 'Debbie Wolowitz',
 'Denise',
 'Dennis Kim',
 'Dimitri',
 'Dr. Pemberton',
 'Emily Sweeney',
 'Eric Gablehauser',
 'Eric Kaplan',
 'Fun with Flags',
 'George Cooper Jr.',
 'George Cooper Sr.',
 'George 

### Create a filter for unwanted types of articles

In [17]:
filter  = '(%s)' % '|'.join([
    'The '
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 170


['Aarti Mann',
 'Adam West',
 'Alessandra Torresani',
 'Alex Jensen',
 'Alfred Hofstadter',
 'Alice Amter',
 'Althea Davis',
 'Amy',
 'Amy Farrah Fowler',
 'Anthony Del Broccolo',
 'Anthony Rich',
 'Anu',
 'Apartment 4A',
 'Barenaked Ladies',
 'Bernadette',
 'Bernadette Rostenkowski-Wolowitz',
 'Bert Kibbler',
 'Beverly Hofstadter',
 'Bill Nye',
 'Bill Prady',
 'Brent Spiner',
 'Brian George',
 'Brian Greene',
 'Brian Patrick Wade',
 'Brian Posehn',
 'Brian Thomas Smith',
 'Buzz Aldrin',
 'Caltech',
 'Carol Ann Susi',
 'Carrie Fisher',
 'Casey Sander',
 'Charlie Sheen',
 'Christine Baranski',
 'Chuck Lorre',
 'Chuck Lorre Productions',
 'Cinnamon',
 'Claire',
 'Colonel Richard Williams',
 'Courtney Henggeler',
 'Dan',
 'Dave Goetsch',
 'David Gibbs',
 'David Saltzberg',
 'Dean Norris',
 'Debbie Wolowitz',
 'Denise',
 'Dennis Kim',
 'Dimitri',
 'Dr. Pemberton',
 'Emily Sweeney',
 'Eric Gablehauser',
 'Eric Kaplan',
 'Fun with Flags',
 'George Cooper Jr.',
 'George Cooper Sr.',
 'George 

>

>

>



---



---



> > > > > > > > > © 2019 Institute of Data


---



---



