<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [2]:
!pip install regex

Collecting regex
  Downloading regex-2021.9.24-cp38-cp38-win_amd64.whl (273 kB)
Installing collected packages: regex
Successfully installed regex-2021.9.24


In [4]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
Collecting soupsieve>1.2
  Downloading soupsieve-2.2.1-py3-none-any.whl (33 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1271 sha256=10b1512a69d81526b2eedf7767681a6e9c44387be9774cc2d44348aee7ad0c48
  Stored in directory: c:\users\nnama\appdata\local\pip\cache\wheels\75\78\21\68b124549c9bdc94f822c02fb9aa3578a669843f9767776bca
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.10.0 bs4-0.0.1 soupsieve-2.2.1


In [5]:
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [6]:
# specify the url
page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'

### Retrieve the page
- Require Internet connection

In [7]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 518424


### Convert the stream of bytes into a BeautifulSoup representation

In [8]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [11]:
print(soup.prettify()[:2000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Barry Kripke | The Big Bang Theory Wiki | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Barry_Kripke","wgTitle":"Barry Kripke","wgCurRevisionId":352395,"wgRevisionId":352395,"wgArticleId":2273,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Caltech Faculty","Scientists","Physicists","Experimental Physicists","Theoretical Physicists","Particle Physicists","Recurring Characters","Season 2","Season 3","Season 4","Season 5","Season 6","Season 7","Season 8","Season 9","The Big Bang Theory","Kripke","Single","Sheldon

### Check the HTML's Title

In [12]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Barry Kripke | The Big Bang Theory Wiki | Fandom</title>:
Title text:Barry Kripke | The Big Bang Theory Wiki | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

In [None]:
# Article was out of range by the time of completing this lab.

### Get some of the text
- Plain text without HTML tags

In [32]:
print(soup.get_text())





Barry Kripke | The Big Bang Theory Wiki | Fandom



















































The Big Bang Theory Wiki





 Explore

 




 Main Page




 Discuss




All Pages




Community




Recent Blog Posts








Characters






Big Bang Theory

 




Main Characters
 




Leonard Hofstadter




Penny Hofstadter




Sheldon Cooper




Amy Farrah Fowler




Howard Wolowitz




Bernadette Rostenkowski-Wolowitz




Rajesh Koothrappali




Stuart Bloom




Leslie Winkle




Emily Sweeney







Recurring Characters
 




Beverly Hofstadter




Mary Cooper




Debbie Wolowitz




Mike Rostenkowski




V. M. Koothrappali




Priya Koothrappali




Denise




Barry Kripke




Wil Wheaton




Zack Johnson







Seasons (1-6)
 




Season 1




Season 2




Season 3




Season 4




Season 5




Season 6







Seasons (7-12)
 




Season 7




Season 8




Season 9




Season 10




Season 11




Season 12











Young Sheldon

 




Main Characters
 




Sheldon Coope

In [38]:
# show the text characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', soup.text))


Barry Kripke | The Big Bang Theory Wiki | Fandom
The Big Bang Theory Wiki
 Explore
 
 Main Page
 Discuss
All Pages
Community
Recent Blog Posts
Characters
Big Bang Theory
 
Main Characters
 
Leonard Hofstadter
Penny Hofstadter
Sheldon Cooper
Amy Farrah Fowler
Howard Wolowitz
Bernadette Rostenkowski-Wolowitz
Rajesh Koothrappali
Stuart Bloom
Leslie Winkle
Emily Sweeney
Recurring Characters
 
Beverly Hofstadter
Mary Cooper
Debbie Wolowitz
Mike Rostenkowski
V. M. Koothrappali
Priya Koothrappali
Denise
Barry Kripke
Wil Wheaton
Zack Johnson
Seasons (1-6)
 
Season 1
Season 2
Season 3
Season 4
Season 5
Season 6
Seasons (7-12)
 
Season 7
Season 8
Season 9
Season 10
Season 11
Season 12
Young Sheldon
 
Main Characters
 
Sheldon Cooper
Mary Cooper
George Cooper Sr.
George Cooper Jr.
Missy Cooper
Meemaw
Jeff Difford
Recurring Characters
 
Tam Nguyen
Veronica Duncan
Billy Sparks
Brenda Sparks
John Sturgis
Dale Ballard
Paige Swanson
Seasons
 
Season 1
Season 2
Season 3
Season 4
 Explore
 
 Main Page


### Find the links in the text

In [20]:
for i in soup.find_all('a'):
    print(i)

<a class="fandom-sticky-header__logo" href="//bigbangtheory.fandom.com">
<img alt="The Big Bang Theory Wiki" height="65" src="https://static.wikia.nocookie.net/bigbangtheory/images/e/e6/Site-logo.png/revision/latest?cb=20210531192123" width="250"/>
</a>
<a class="fandom-sticky-header__sitename" href="//bigbangtheory.fandom.com">The Big Bang Theory Wiki</a>
<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-comment-tiny"></use></svg> <span>Discuss</span>
</a>
<a data-tracking="explore-all-pages" href="https

In [21]:
for i in soup.find_all('b'):
    print(i)

<b>Barry Kripke</b>
<b><a href="/wiki/Season_1" title="Season 1">Season 1</a></b>
<b><a href="/wiki/Season_2" title="Season 2">Season 2</a></b>
<b><a href="/wiki/Season_3" title="Season 3">Season 3</a></b>
<b><a href="/wiki/Season_4" title="Season 4">Season 4</a></b>
<b><a href="/wiki/Season_5" title="Season 5">Season 5</a></b>
<b><a href="/wiki/Season_6" title="Season 6">Season 6</a></b>
<b><a href="/wiki/Season_7" title="Season 7">Season 7</a></b>
<b><a href="/wiki/Season_8" title="Season 8">Season 8</a></b>
<b><a href="/wiki/Season_9" title="Season 9">Season 9</a></b>
<b><a href="/wiki/Season_10" title="Season 10">Season 10</a></b>
<b><a href="/wiki/Season_11" title="Season 11">Season 11</a></b>
<b><a href="/wiki/Season_12" title="Season 12">Season 12</a></b>
<b><a href="/wiki/Season_1_(Young_Sheldon)" title="Season 1 (Young Sheldon)">Season 1</a></b>
<b><a href="/wiki/Season_2_(Young_Sheldon)" title="Season 2 (Young Sheldon)">Season 2</a></b>
<b><a href="/wiki/Season_3_(Young_Sheld

In [23]:
# identify the type of tag to retrieve
link_tag = 'a'

# create a list with the links from the `<a>` tag
tag_list = []
for t in soup.find_all(link_tag):
    tag_list.append(t.get('href'))

# List comprehension version:
# tag_list = [t.get('href') for t in article.find_all(link_tag)]

print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 1048


['//bigbangtheory.fandom.com',
 '//bigbangtheory.fandom.com',
 '#',
 'https://bigbangtheory.fandom.com/wiki/Main_Page',
 '/f',
 'https://bigbangtheory.fandom.com/wiki/Special:AllPages',
 'https://bigbangtheory.fandom.com/wiki/Special:Community',
 '/wiki/Blog:Recent_posts',
 'https://bigbangtheory.fandom.com/wiki/Category:Characters',
 'https://bigbangtheory.fandom.com/wiki/Big_Bang_Theory',
 'https://bigbangtheory.fandom.com/wiki/Category:Main_Characters',
 'https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper',
 'https://bigbangtheory.fandom.com/wiki/Amy_Farrah_Fowler',
 'https://bigbangtheory.fandom.com/wiki/Howard_Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
 'https://bigbangtheory.fandom.com/wiki/Stuart_Bloom',
 'https://bigbangtheory.fandom.com/wiki/Leslie_Winkl

In [24]:
# keep only the links to the wiki itself
wiki_tag_list = []
for link in tag_list:
    if link is not None and link[:6] == '/wiki/':
        wiki_link = link[6:]
        wiki_tag_list.append(wiki_link)

# List comprehension:
# wiki_tag_list = [link[6:] for link in tag_list if link is not None and link[:6] == '/wiki/']

print('Size of \'wiki_tag_list\':', len(wiki_tag_list))
wiki_tag_list

Size of 'wiki_tag_list': 387


['Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Special:Search',
 'Special:Search',
 'Special:Search',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Barry_Kripke?action=edit',
 'Category:Characters',
 'Category:Caltech_Faculty',
 'Category:Scientists',
 'Category:Physicists',
 'Category:Experimental_Physicists',
 'Category:Theoretical_Physicists',
 'Category:Particle_Physicists',
 'Category:Recurring_Characters',
 'Category:Season_2',
 'Category:Season_3',
 'Category:Season_4',
 'Category:Season_5',
 'Category:Season_6',
 'Category:Season_7',
 'Category:Season_8',
 'Category:Season_9',
 'Category:The_Big_Bang_Theory',
 'Category:Kripke',
 'Category:Single',
 'Category:Sheldon%27s_Mortal_Enemies',
 'Category:Ph.D.',
 'Category:Season_3_Characters',
 'Category:Season_4_Characters',
 'Category:Season_5_Characters',
 'Category:Season_6_Characters',
 'Category:Season_8_Characters',
 'Category:Season_9_Char

In [37]:

# remove the links that are found in the filter
filtered_tag_list = []
for t in wiki_tag_list:
    if not re.search(filter, t):
        filtered_tag_list.append(t)

# filtered_tag_list = [t for t in wiki_tag_list if not re.search(filter, t)]
print('Size of \'filtered_tag_list\':', len(filtered_tag_list))
filtered_tag_list

Size of 'filtered_tag_list': 264


['Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'John_Ross_Bowie',
 'Amy_Farrah_Fowler',
 'Beverly_Hofstadter',
 'The_Killer_Robot_Instability',
 'The_Change_Constant',
 'The_Relationship_Diremption',
 'Caltech',
 'String_theory',
 'Leonard_Hofstadter',
 'Sheldon_Cooper',
 'Leonard_Hofstadter',
 'Howard_Wolowitz',
 'Rajesh',
 'Amy_Farrah_Fowler',
 'Kripke_Krippler',
 'M.O.N.T.E.',
 'Caltech',
 'The_Killer_Robot_Instability',
 'Penny',
 'Howard_Wolowitz',
 'Penny',
 'The_Friendship_Algorithm',
 'The_Electric_Can_Opener_Fluctuation',
 'Sheldon%27s_office',
 'The_Cafeteria',
 'Leonard_Hofstadter',
 'Rajesh_Koothrappali',
 'Sheldon_Cooper',
 'President_Siebert',
 'The_Vengeance_Formulation',
 'Apartment_4A',
 'Zack_Johnson',
 'Stuart_Bloom',
 'LeVar_Burton',
 'Raj%27s_apartment',
 'The_Toast_Derivation',
 'Rajesh_Koothrappali',
 'Professor_Rothman',
 'Siri',
 'Th

In [43]:
print('Size of \'unique_tags\':', len(list(set(filtered_tag_list))))
list(set(filtered_tag_list))


Size of 'unique_tags': 197


['June_Squibb',
 'Laura_Spencer',
 'Kate_Micucci',
 'Sara_Gilbert',
 'Mark_Hamill',
 'The_Killer_Robot_Instability',
 'V._M._Koothrappali',
 'Raj%27s_apartment',
 'Mike_Rostenkowski',
 'Meemaw',
 'Priya_Koothrappali',
 'The_Helium_Insufficiency',
 'Dimitri',
 'James_Earl_Jones',
 'Stan_Lee',
 'Claire',
 'Mrs._Koothrappali',
 'Jimmy_Speckerman',
 'Laurie_Metcalf',
 'Brian_George',
 'Neil_deGrasse_Tyson',
 'Chuck_Lorre',
 'Stuart_Bloom',
 'Steven_V._Silver',
 'Eric_Kaplan',
 'The_Perspiration_Implementation',
 'Anu',
 'Professor_Proton',
 'Kareem_Abdul-Jabbar',
 'Anthony_Rich',
 'Rajesh',
 'Jim_Parsons',
 'Dr._Pemberton',
 'Dave_Goetsch',
 'Stephen_Hawking',
 'Eric_Gablehauser',
 'The_Tesla_Recoil',
 'Tara_Hernandez',
 'Steven_Molaro',
 'M.O.N.T.E.',
 'Dennis_Kim',
 'The_Discovery_Dissipation',
 'Christine_Baranski',
 'William_Shatner',
 'Mike_Massimino',
 'The_Cafeteria',
 'Jeanie',
 'Mrs._Rostenkowski',
 'Mayim_Bialik',
 'Bill_Nye',
 'David_Saltzberg',
 'Dan',
 'Meagen_Fay',
 'Kevin_Su

In [44]:
unique_tags = list(set(filtered_tag_list))
print('Size of \'unique_tag_list\':', len(unique_tags))
unique_tags

Size of 'unique_tag_list': 197


['June_Squibb',
 'Laura_Spencer',
 'Kate_Micucci',
 'Sara_Gilbert',
 'Mark_Hamill',
 'The_Killer_Robot_Instability',
 'V._M._Koothrappali',
 'Raj%27s_apartment',
 'Mike_Rostenkowski',
 'Meemaw',
 'Priya_Koothrappali',
 'The_Helium_Insufficiency',
 'Dimitri',
 'James_Earl_Jones',
 'Stan_Lee',
 'Claire',
 'Mrs._Koothrappali',
 'Jimmy_Speckerman',
 'Laurie_Metcalf',
 'Brian_George',
 'Neil_deGrasse_Tyson',
 'Chuck_Lorre',
 'Stuart_Bloom',
 'Steven_V._Silver',
 'Eric_Kaplan',
 'The_Perspiration_Implementation',
 'Anu',
 'Professor_Proton',
 'Kareem_Abdul-Jabbar',
 'Anthony_Rich',
 'Rajesh',
 'Jim_Parsons',
 'Dr._Pemberton',
 'Dave_Goetsch',
 'Stephen_Hawking',
 'Eric_Gablehauser',
 'The_Tesla_Recoil',
 'Tara_Hernandez',
 'Steven_Molaro',
 'M.O.N.T.E.',
 'Dennis_Kim',
 'The_Discovery_Dissipation',
 'Christine_Baranski',
 'William_Shatner',
 'Mike_Massimino',
 'The_Cafeteria',
 'Jeanie',
 'Mrs._Rostenkowski',
 'Mayim_Bialik',
 'Bill_Nye',
 'David_Saltzberg',
 'Dan',
 'Meagen_Fay',
 'Kevin_Su

In [45]:
# convert escaped sequences
unquoted_tags = [unquote(t) for t in unique_tags]
print('Size of \'unquoted_tag_list\':', len(unquoted_tags))
unquoted_tags

Size of 'unquoted_tag_list': 197


['June_Squibb',
 'Laura_Spencer',
 'Kate_Micucci',
 'Sara_Gilbert',
 'Mark_Hamill',
 'The_Killer_Robot_Instability',
 'V._M._Koothrappali',
 "Raj's_apartment",
 'Mike_Rostenkowski',
 'Meemaw',
 'Priya_Koothrappali',
 'The_Helium_Insufficiency',
 'Dimitri',
 'James_Earl_Jones',
 'Stan_Lee',
 'Claire',
 'Mrs._Koothrappali',
 'Jimmy_Speckerman',
 'Laurie_Metcalf',
 'Brian_George',
 'Neil_deGrasse_Tyson',
 'Chuck_Lorre',
 'Stuart_Bloom',
 'Steven_V._Silver',
 'Eric_Kaplan',
 'The_Perspiration_Implementation',
 'Anu',
 'Professor_Proton',
 'Kareem_Abdul-Jabbar',
 'Anthony_Rich',
 'Rajesh',
 'Jim_Parsons',
 'Dr._Pemberton',
 'Dave_Goetsch',
 'Stephen_Hawking',
 'Eric_Gablehauser',
 'The_Tesla_Recoil',
 'Tara_Hernandez',
 'Steven_Molaro',
 'M.O.N.T.E.',
 'Dennis_Kim',
 'The_Discovery_Dissipation',
 'Christine_Baranski',
 'William_Shatner',
 'Mike_Massimino',
 'The_Cafeteria',
 'Jeanie',
 'Mrs._Rostenkowski',
 'Mayim_Bialik',
 'Bill_Nye',
 'David_Saltzberg',
 'Dan',
 'Meagen_Fay',
 'Kevin_Suss

In [46]:
# convert underscore to space
spaced_tag_list = []
for tag in unquoted_tags:
    processed_tag = re.sub('_', ' ', tag)
    spaced_tag_list.append(processed_tag)

# spaced_tag_list = [re.sub('_', ' ', t) for t in unquoted_tag_list]
print('Size of \'tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'tag_list': 197


['June Squibb',
 'Laura Spencer',
 'Kate Micucci',
 'Sara Gilbert',
 'Mark Hamill',
 'The Killer Robot Instability',
 'V. M. Koothrappali',
 "Raj's apartment",
 'Mike Rostenkowski',
 'Meemaw',
 'Priya Koothrappali',
 'The Helium Insufficiency',
 'Dimitri',
 'James Earl Jones',
 'Stan Lee',
 'Claire',
 'Mrs. Koothrappali',
 'Jimmy Speckerman',
 'Laurie Metcalf',
 'Brian George',
 'Neil deGrasse Tyson',
 'Chuck Lorre',
 'Stuart Bloom',
 'Steven V. Silver',
 'Eric Kaplan',
 'The Perspiration Implementation',
 'Anu',
 'Professor Proton',
 'Kareem Abdul-Jabbar',
 'Anthony Rich',
 'Rajesh',
 'Jim Parsons',
 'Dr. Pemberton',
 'Dave Goetsch',
 'Stephen Hawking',
 'Eric Gablehauser',
 'The Tesla Recoil',
 'Tara Hernandez',
 'Steven Molaro',
 'M.O.N.T.E.',
 'Dennis Kim',
 'The Discovery Dissipation',
 'Christine Baranski',
 'William Shatner',
 'Mike Massimino',
 'The Cafeteria',
 'Jeanie',
 'Mrs. Rostenkowski',
 'Mayim Bialik',
 'Bill Nye',
 'David Saltzberg',
 'Dan',
 'Meagen Fay',
 'Kevin Suss

In [47]:
# order the list
spaced_tag_list.sort()
print('Size of \'spaced_tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'spaced_tag_list': 197


['Aarti Mann',
 'Adam West',
 'Alessandra Torresani',
 'Alex Jensen',
 'Alfred Hofstadter',
 'Alice Amter',
 'Althea Davis',
 'Amy Farrah Fowler',
 'Anthony Del Broccolo',
 'Anthony Rich',
 'Anu',
 'Apartment 4A',
 'Barenaked Ladies',
 'Bernadette Rostenkowski-Wolowitz',
 'Bert Kibbler',
 'Beverly Hofstadter',
 'Bill Nye',
 'Bill Prady',
 'Blog:Recent posts',
 'Brent Spiner',
 'Brian George',
 'Brian Greene',
 'Brian Patrick Wade',
 'Brian Posehn',
 'Brian Thomas Smith',
 'Buzz Aldrin',
 'Caltech',
 'Carol Ann Susi',
 'Carrie Fisher',
 'Casey Sander',
 'Charlie Sheen',
 'Christine Baranski',
 'Chuck Lorre',
 'Chuck Lorre Productions',
 'Cinnamon',
 'Claire',
 'Colonel Richard Williams',
 'Courtney Henggeler',
 'Dan',
 'Dave Goetsch',
 'David Gibbs',
 'David Saltzberg',
 'Dean Norris',
 'Debbie Wolowitz',
 'Denise',
 'Dennis Kim',
 'Dimitri',
 'Dr. Pemberton',
 'Emily Sweeney',
 'Eric Gablehauser',
 'Eric Kaplan',
 'Fun with Flags',
 'George Cooper Jr.',
 'George Cooper Sr.',
 'George S

### Create a filter for unwanted types of articles

In [51]:
no_episodes_tag_list = []
for tag in spaced_tag_list:
    if tag.startswith('The'):
        no_episodes_tag_list.append(tag)

#no_episodes_tag_list = [t for t in tag_list if tag.startswith('The')]

print('Size of \'no_episodes_tag_list\':', len(no_episodes_tag_list))
no_episodes_tag_list

Size of 'no_episodes_tag_list': 31


['The Allowance Evaporation',
 'The Athenaeum Allocation',
 'The Beta Test Initiation',
 'The Big Bang Theory',
 'The Bow Tie Asymmetry',
 'The Cafeteria',
 'The Celebration Experimentation',
 'The Champagne Reflection',
 'The Change Constant',
 'The Comic Book Store Regeneration',
 'The Cooper/Kripke Inversion',
 'The Discovery Dissipation',
 'The Electric Can Opener Fluctuation',
 'The Friendship Algorithm',
 'The Geology Elevation',
 'The Grant Allocation Derivation',
 'The Helium Insufficiency',
 'The History of Everything',
 'The Killer Robot Instability',
 'The Perspiration Implementation',
 'The Plagiarism\ufeff\ufeff\ufeff \ufeffSchism\ufeff\ufeff\ufeff\ufeff\ufeff\ufeff',
 'The Relationship Diremption',
 'The Rothman Disintegration',
 'The Social Group',
 'The Stag Convergence',
 'The Tenure Turbulence',
 'The Tesla Recoil',
 'The Toast Derivation',
 'The Valentino Submergence',
 'The Vengeance Formulation',
 'The cafeteria']



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



