<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>



# Demo 8.3: Web Scraping

INSTRUCTIONS:

- Run the cells
- Observe and understand the results

# Web Scraping in Python (using BeautifulSoup)

# Basics HTML
Before starting with the code, let’s understand the basics of HTML and some rules of scraping.

## HTML tags
Below is the source code for a simple HTML webpage.

    <!DOCTYPE html>  
    <html>  
        <head>
        </head>
        <body>
            <h1> First Scraping </h1>
            <p> Hello World </p>
        <body>
    </html>
    
This is the basic syntax of an HTML webpage. Every `<tag>` serves a block inside the webpage:
1. `<!DOCTYPE html>` HTML documents must start with a type declaration.
2. The HTML document is contained between `<html>` and `</html>`.
3. The meta and script declaration of the HTML document is between `<head>` and `</head>`.
4. The visible part of the HTML document is between `<body>` and `</body>` tags.
5. Title headings are defined with the `<h1>` through `<h6>` tags.
6. Paragraphs are defined with the `<p>` tag.

Other useful tags include `<a>` for hyperlinks, `<table>` for tables, `<tr>` for table rows, and `<td>` for table columns.

Also, HTML tags sometimes come with `id` or `class` attributes. The `id` attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The `class` attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Inspecting the Page
Let’s take one page from the **Memory Alpha** website as an example.

To investigate some relationships let's get the links from this page.

Open the web page on [Prinadora](https://memory-alpha.fandom.com/wiki/Prinadora) with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the navigation menu text inside a couple of levels of HTML tags, which is `<div class="fandom-sticky-header">` → `<nav class="fandom-community-header__local-navigation">`.

This section of the data is what we focus on.

In [1]:
# ! pip3 install regex

In [2]:
# ! pip3 install bs4

In [3]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [4]:
# specify the url
quote_page = 'http://memory-alpha.wikia.com/wiki/Prinadora'

### Retrieve the page
- Require Internet connection

In [5]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 421899


### Convert the stream of bytes into a BeautifulSoup representation

In [7]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [14]:
print(soup.prettify()[:980])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Prinadora | Memory Alpha | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Prinadora","wgTitle":"Prinadora","wgCurRevisionId":2846125,"wgRevisionId":2846125,"wgArticleId":8581,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Ferengi"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","


### Check the HTML's Title

In [15]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Prinadora | Memory Alpha | Fandom</title>:
Title text:Prinadora | Memory Alpha | Fandom:


###  nav tag
- This page uses the tag `nav` for navigation links

        <nav class="fandom-community-header__local-navigation">

In [16]:
tag = 'nav'
nav = soup.find_all(tag)[0]
print('Type of the variable \'article\':', nav.__class__.__name__)

Type of the variable 'article': Tag


### Get some of the text
- Plain text without HTML tags

In [25]:
# show the first 500 characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', nav.text)[:500])

# print(re.sub(r'\n\n+', '', nav.text)[:500])


 Explore
 
 Main Page
 Discuss
All Pages
Community
Portals
 
People
 
Bajorans
Borg
Ferengi
Humans
Klingons
Romulans
Vulcans
Starfleet personnel
Society & Culture
 
Borg
Cardassian Union
Dominion
Ferengi Alliance
Federation
Klingon Empire
Romulan Empire
Vulcan
Science
 
Alpha Quadrant
Beta Quadrant
Gamma Quadrant
Delta Quadrant
Technology
 
Spacecraft
Starships
Spacecraft classes
Starship classes
Stations
All technology
The Alternate Reality
 
History 
People
Places
Things
Spacecraft
Starfleet 


### Find the links in the text

In [26]:
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in nav.find_all(tag)]
tag_list

['#',
 'https://memory-alpha.fandom.com/wiki/Portal:Main',
 '/f',
 'https://memory-alpha.fandom.com/wiki/Special:AllPages',
 'https://memory-alpha.fandom.com/wiki/Special:Community',
 'https://memory-alpha.fandom.com/wiki/Category:Memory_Alpha_portals',
 'https://memory-alpha.fandom.com/wiki/Portal:People',
 'https://memory-alpha.fandom.com/wiki/Category:Bajorans',
 'https://memory-alpha.fandom.com/wiki/Category:Borg',
 'https://memory-alpha.fandom.com/wiki/Category:Ferengi',
 'https://memory-alpha.fandom.com/wiki/Category:Humans',
 'https://memory-alpha.fandom.com/wiki/Category:Klingons',
 'https://memory-alpha.fandom.com/wiki/Category:Romulans',
 'https://memory-alpha.fandom.com/wiki/Category:Vulcans',
 'https://memory-alpha.fandom.com/wiki/Category:Starfleet_personnel',
 'https://memory-alpha.fandom.com/wiki/Portal:Society_and_Culture',
 'https://memory-alpha.fandom.com/wiki/Borg',
 'https://memory-alpha.fandom.com/wiki/Cardassian_Union',
 'https://memory-alpha.fandom.com/wiki/Domin

In [22]:
# keep only the links to the wiki itself
tag_list = [t[37:] for t in tag_list if (t) and ('/wiki/' in t)]
tag_list

['Portal:Main',
 'Special:AllPages',
 'Special:Community',
 'Category:Memory_Alpha_portals',
 'Portal:People',
 'Category:Bajorans',
 'Category:Borg',
 'Category:Ferengi',
 'Category:Humans',
 'Category:Klingons',
 'Category:Romulans',
 'Category:Vulcans',
 'Category:Starfleet_personnel',
 'Portal:Society_and_Culture',
 'Borg',
 'Cardassian_Union',
 'Dominion',
 'Ferengi_Alliance',
 'United_Federation_of_Planets',
 'Klingon_Empire',
 'Romulan_Star_Empire',
 'Vulcan',
 'Portal:Science',
 'Alpha_Quadrant',
 'Beta_Quadrant',
 'Gamma_Quadrant',
 'Delta_Quadrant',
 'Portal:Technology',
 'Category:Spacecraft',
 'Category:Starships',
 'Category:Spacecraft_classes',
 'Category:Starship_classes',
 'Category:Stations',
 'Category:Technology',
 'Portal:Alternate_Reality',
 'Alternate_reality',
 'Category:Alternate_reality_inhabitants',
 'Category:Locations_(alternate_reality)',
 'Category:Alternate_reality',
 'Category:Spacecraft_(alternate_reality)',
 'Category:Starfleet_personnel_(alternate_rea

In [27]:
# create a filter for undesired links
filter  = '(%s)' % '|'.join([
    'Category:',
    'File:',
    'Help:',
    'Memory_Alpha:',
    'Portal:',
    'action=',
    'Special:',
    'Star_Trek:',
    'Star_Trek_',
    'Talk:'
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
tag_list

['#',
 '/f',
 'https://memory-alpha.fandom.com/wiki/Borg',
 'https://memory-alpha.fandom.com/wiki/Cardassian_Union',
 'https://memory-alpha.fandom.com/wiki/Dominion',
 'https://memory-alpha.fandom.com/wiki/Ferengi_Alliance',
 'https://memory-alpha.fandom.com/wiki/United_Federation_of_Planets',
 'https://memory-alpha.fandom.com/wiki/Klingon_Empire',
 'https://memory-alpha.fandom.com/wiki/Romulan_Star_Empire',
 'https://memory-alpha.fandom.com/wiki/Vulcan',
 'https://memory-alpha.fandom.com/wiki/Alpha_Quadrant',
 'https://memory-alpha.fandom.com/wiki/Beta_Quadrant',
 'https://memory-alpha.fandom.com/wiki/Gamma_Quadrant',
 'https://memory-alpha.fandom.com/wiki/Delta_Quadrant',
 'https://memory-alpha.fandom.com/wiki/Alternate_reality',
 'https://memory-alpha.fandom.com/wiki/Studio_model',
 'https://memory-alpha.fandom.com/wiki/Retroactive_continuity',
 'https://memory-alpha.fandom.com/wiki/Deleted_scene',
 'https://memory-alpha.fandom.com/wiki/DVD',
 'https://memory-alpha.fandom.com/wiki/B

In [28]:
# remove duplicates
tag_list = list(set(tag_list))
tag_list

['https://memory-alpha.fandom.com/wiki/The_Diviner',
 'https://memory-alpha.fandom.com/wiki/Pavel_Chekov',
 'https://memory-alpha.fandom.com/wiki/Nyota_Uhura',
 'https://memory-alpha.fandom.com/wiki/Thy%27lek_Shran',
 'https://memory-alpha.fandom.com/wiki/Leonard_McCoy_(alternate_reality)',
 'https://memory-alpha.fandom.com/wiki/Charles_Tucker_III',
 'https://memory-alpha.fandom.com/wiki/Rok-Tahk',
 'https://memory-alpha.fandom.com/wiki/Jake_Sisko',
 'https://memory-alpha.fandom.com/wiki/T%27Pol',
 'https://memory-alpha.fandom.com/wiki/DVD',
 'https://memory-alpha.fandom.com/wiki/Jonathan_Archer',
 'https://memory-alpha.fandom.com/wiki/D%27Vana_Tendi',
 'https://memory-alpha.fandom.com/wiki/Dal_R%27El',
 'https://memory-alpha.fandom.com/wiki/Hugh_Culber',
 'https://memory-alpha.fandom.com/wiki/Deep_Space_9',
 'https://memory-alpha.fandom.com/wiki/Enterprise_(NX-01)',
 'https://memory-alpha.fandom.com/wiki/Soji_Asha',
 'https://memory-alpha.fandom.com/wiki/USS_Discovery',
 'https://memo

In [29]:
# convert escaped sequences
tag_list = [unquote(t) for t in tag_list]
tag_list

['https://memory-alpha.fandom.com/wiki/The_Diviner',
 'https://memory-alpha.fandom.com/wiki/Pavel_Chekov',
 'https://memory-alpha.fandom.com/wiki/Nyota_Uhura',
 "https://memory-alpha.fandom.com/wiki/Thy'lek_Shran",
 'https://memory-alpha.fandom.com/wiki/Leonard_McCoy_(alternate_reality)',
 'https://memory-alpha.fandom.com/wiki/Charles_Tucker_III',
 'https://memory-alpha.fandom.com/wiki/Rok-Tahk',
 'https://memory-alpha.fandom.com/wiki/Jake_Sisko',
 "https://memory-alpha.fandom.com/wiki/T'Pol",
 'https://memory-alpha.fandom.com/wiki/DVD',
 'https://memory-alpha.fandom.com/wiki/Jonathan_Archer',
 "https://memory-alpha.fandom.com/wiki/D'Vana_Tendi",
 "https://memory-alpha.fandom.com/wiki/Dal_R'El",
 'https://memory-alpha.fandom.com/wiki/Hugh_Culber',
 'https://memory-alpha.fandom.com/wiki/Deep_Space_9',
 'https://memory-alpha.fandom.com/wiki/Enterprise_(NX-01)',
 'https://memory-alpha.fandom.com/wiki/Soji_Asha',
 'https://memory-alpha.fandom.com/wiki/USS_Discovery',
 'https://memory-alpha

In [30]:
# convert underscore to space
tag_list = [re.sub('_', ' ', t) for t in tag_list]
tag_list

['https://memory-alpha.fandom.com/wiki/The Diviner',
 'https://memory-alpha.fandom.com/wiki/Pavel Chekov',
 'https://memory-alpha.fandom.com/wiki/Nyota Uhura',
 "https://memory-alpha.fandom.com/wiki/Thy'lek Shran",
 'https://memory-alpha.fandom.com/wiki/Leonard McCoy (alternate reality)',
 'https://memory-alpha.fandom.com/wiki/Charles Tucker III',
 'https://memory-alpha.fandom.com/wiki/Rok-Tahk',
 'https://memory-alpha.fandom.com/wiki/Jake Sisko',
 "https://memory-alpha.fandom.com/wiki/T'Pol",
 'https://memory-alpha.fandom.com/wiki/DVD',
 'https://memory-alpha.fandom.com/wiki/Jonathan Archer',
 "https://memory-alpha.fandom.com/wiki/D'Vana Tendi",
 "https://memory-alpha.fandom.com/wiki/Dal R'El",
 'https://memory-alpha.fandom.com/wiki/Hugh Culber',
 'https://memory-alpha.fandom.com/wiki/Deep Space 9',
 'https://memory-alpha.fandom.com/wiki/Enterprise (NX-01)',
 'https://memory-alpha.fandom.com/wiki/Soji Asha',
 'https://memory-alpha.fandom.com/wiki/USS Discovery',
 'https://memory-alpha

In [31]:
# order the list
tag_list.sort()
tag_list

['#',
 '/f',
 'https://memory-alpha.fandom.com/wiki/Adira Tal',
 'https://memory-alpha.fandom.com/wiki/Agnes Jurati',
 'https://memory-alpha.fandom.com/wiki/Alpha Quadrant',
 'https://memory-alpha.fandom.com/wiki/Alternate reality',
 "https://memory-alpha.fandom.com/wiki/B'Elanna Torres",
 'https://memory-alpha.fandom.com/wiki/Beckett Mariner',
 'https://memory-alpha.fandom.com/wiki/Benjamin Sisko',
 'https://memory-alpha.fandom.com/wiki/Beta Quadrant',
 'https://memory-alpha.fandom.com/wiki/Beverly Crusher',
 'https://memory-alpha.fandom.com/wiki/Blu-ray Disc',
 'https://memory-alpha.fandom.com/wiki/Borg',
 'https://memory-alpha.fandom.com/wiki/Brad Boimler',
 'https://memory-alpha.fandom.com/wiki/Calendars',
 'https://memory-alpha.fandom.com/wiki/Cardassian Union',
 'https://memory-alpha.fandom.com/wiki/Carol Freeman',
 'https://memory-alpha.fandom.com/wiki/Chakotay',
 'https://memory-alpha.fandom.com/wiki/Charles Tucker III',
 'https://memory-alpha.fandom.com/wiki/Christine Chapel',

### Create a filter for unwanted types of articles

In [32]:
filter  = '(%s)' % '|'.join([
    'episode',
    'lternate_reality', # both Alternate_reality and alternate_reality
    'mirror',
    'rank',
    'production',
    'Season'
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
tag_list

['#',
 '/f',
 'https://memory-alpha.fandom.com/wiki/Adira Tal',
 'https://memory-alpha.fandom.com/wiki/Agnes Jurati',
 'https://memory-alpha.fandom.com/wiki/Alpha Quadrant',
 'https://memory-alpha.fandom.com/wiki/Alternate reality',
 "https://memory-alpha.fandom.com/wiki/B'Elanna Torres",
 'https://memory-alpha.fandom.com/wiki/Beckett Mariner',
 'https://memory-alpha.fandom.com/wiki/Benjamin Sisko',
 'https://memory-alpha.fandom.com/wiki/Beta Quadrant',
 'https://memory-alpha.fandom.com/wiki/Beverly Crusher',
 'https://memory-alpha.fandom.com/wiki/Blu-ray Disc',
 'https://memory-alpha.fandom.com/wiki/Borg',
 'https://memory-alpha.fandom.com/wiki/Brad Boimler',
 'https://memory-alpha.fandom.com/wiki/Calendars',
 'https://memory-alpha.fandom.com/wiki/Cardassian Union',
 'https://memory-alpha.fandom.com/wiki/Carol Freeman',
 'https://memory-alpha.fandom.com/wiki/Chakotay',
 'https://memory-alpha.fandom.com/wiki/Charles Tucker III',
 'https://memory-alpha.fandom.com/wiki/Christine Chapel',



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



