<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
## Import Libraries
import urllib3
from bs4 import BeautifulSoup
from urllib.parse import unquote
import warnings
warnings.filterwarnings('ignore')
import regex as re

### Define the content to retrieve (webpage's URL)

In [2]:
quote_page = 'https://www.fandom.com/articles/loki-psychology-mcu-marvel'

### Retrieve the page
- Require Internet connection

In [23]:
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
    print('Page variable type \'page\':', page.__class__.__name__)    
else:
    print('Error. Request Status: %s' % r.status)

Retrieved. Request Status: 200, Page Size: 128382
Page variable type 'page': bytes


### Convert the stream of bytes into a BeautifulSoup representation

In [24]:
soup = BeautifulSoup(page, 'html.parser')
print('Page variable type \'soup\':', soup.__class__.__name__)

Page variable type 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [5]:
print(soup.prettify()[:2000])

<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <title>
   The God of Mischief Who Would be King: The Psychology of Loki | Fandom
  </title>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://www.fandom.com/f2/assets/favicons/apple-touch-icon.png?v=76825e58ec45f2db300a1ad70b034309c5474765" rel="apple-touch-icon" sizes="180x180"/>
  <link href="https://www.fandom.com/f2/assets/favicons/favicon-32x32.png?v=76825e58ec45f2db300a1ad70b034309c5474765" rel="icon" sizes="32x32" type="image/png"/>
  <link href="https://www.fandom.com/f2/assets/favicons/favicon-16x16.png?v=76825e58ec45f2db300a1ad70b034309c5474765" rel="icon" sizes="16x16" type="image/png"/>
  <link href="https://www.fandom.com/f2/assets/favicons/manifest.json?v=76825e58ec45f2db300a1ad70b034309c5474765" rel="manifest"/>
  <link href="https://www.fandom.com/f2/assets/favicons/favic

### Check the HTML's Title

In [26]:
print('Text-Title:%s:' % soup.title.string)
print('Tag-Title :%s:' % soup.title)

Text-Title:The God of Mischief Who Would be King: The Psychology of Loki | Fandom:
Tag-Title :<title>The God of Mischief Who Would be King: The Psychology of Loki | Fandom</title>:


### Find the main content
- Check if it is possible to use only the relevant data

In [27]:
article_tag = 'article'
article = soup.find_all(article_tag)[0]
print('Variable type \'article\':', article.__class__.__name__)

Variable type 'article': Tag


### Get some of the text
- Plain text without HTML tags

In [8]:
print(re.sub(r'\n\n+', '\n', article.text)[:1000])


The God of Mischief Who Would be King: The Psychology of Loki
Drea Letamendi
			3d
		
TV
Movies
TV
Movies
Comics
Streaming
Marvel
You can’t keep a good God of Mischief down! Loki is about to return in the new Disney+ series, Loki, debuting June 9 – well, sort of, given Loki really and truly died in Avengers: Infinity War. But as seen in Avengers: Endgame, there is now a divergent timeline Loki running amok, who escaped with the Tesseract. We’ve seen Loki at some huge extremes, as both hero and villain, but what motivates him? Clinical psychologist Dr. Drea Letamendi provides us with all the info below…
Meet Loki
Loki Odinson wished to be extraordinary, remarkable, and illustrious. Surrounded by the ostentatious milieu of gods, rulers, and royalty, and raised among the symbology and mythos of intergalactic warfare, Loki developed the belief that one’s worthiness was intrinsically derived from power. In truth, Loki earned his feelings of supremacy, mastering an impressive amount of form

### Find the links in the text

In [9]:
for t in article.find_all('a'):
    print(t)

<a class="article-header__author" href="https://www.fandom.com/u/Doctor%20Drea">
<span class="author vcard">
<span class="author-name fn">Drea Letamendi</span>
</span>
</a>
<a class="article-topic-tags__tag" data-tracking='{"category":"card","label":"topic.link","action":136080,"post_id":136080}' href="https://www.fandom.com/topics/tv" target="_top">TV</a>
<a class="article-topic-tags__tag" data-tracking='{"category":"card","label":"topic.link","action":136080,"post_id":136080}' href="https://www.fandom.com/topics/movies" target="_top">Movies</a>
<a class="article-topic-tags__tag" data-tracking='{"category":"card","label":"topic.link","action":136080,"post_id":136080}' href="https://www.fandom.com/topics/tv" target="_top">TV</a>
<a class="article-topic-tags__tag" data-tracking='{"category":"card","label":"topic.link","action":136080,"post_id":136080}' href="https://www.fandom.com/topics/movies" target="_top">Movies</a>
<a class="article-topic-tags__tag" data-tracking='{"category":"card

In [29]:
tag = 'a'

list_tag = []
for t in article.find_all(tag):
    list_tag.append(t.get('href'))

print('Size \'tag_list\':', len(list_tag))
tag_list

Size 'tag_list': 41


['https://www.fandom.com/u/Doctor%20Drea',
 'https://www.fandom.com/topics/tv',
 'https://www.fandom.com/topics/movies',
 'https://www.fandom.com/topics/tv',
 'https://www.fandom.com/topics/movies',
 'https://www.fandom.com/topics/comics',
 'https://www.fandom.com/topics/streaming',
 'https://www.fandom.com/topics/marvel',
 'https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fwww.fandom.com%2Farticles%2Floki-psychology-mcu-marvel',
 'https://twitter.com/share?url=https%3A%2F%2Fwww.fandom.com%2Farticles%2Floki-psychology-mcu-marvel',
 'http://www.reddit.com/submit?url=https%3A%2F%2Fwww.fandom.com%2Farticles%2Floki-psychology-mcu-marvel',
 'https://marvelcinematicuniverse.fandom.com/wiki/Loki_(TV_series)',
 'https://marvelcinematicuniverse.fandom.com/wiki/Avengers:_Infinity_War',
 'https://marvelcinematicuniverse.fandom.com/wiki/Avengers:_Endgame',
 'https://marvelcinematicuniverse.fandom.com/wiki/Tesseract',
 'https://marvelcinematicuniverse.fandom.com/wiki/Loki',
 'https://marve

### Create a filter for unwanted types of articles



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



