![image.png](https://i.imgur.com/1WaY7aA.png)

---



---



#  Data Science and AI
## Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
## Import Libraries
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import re


### Define the content to retrieve (webpage's URL)

In [2]:
url = 'https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki'

### Retrieve the page
- Require Internet connection

In [3]:
def is_good_response(resp):
    """
    Ensures that the response is a html object.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 and 
            content_type is not None 
            and content_type.find('html') > -1)


def get_content(url):
    with closing(get(url)) as resp:
        if is_good_response(resp):
            print ("Success: ",url)
            return resp.content
        else:
            # Unable to get the url response
            return None



### Convert the stream of bytes into a BeautifulSoup representation

In [4]:
content = get_content(url)
soup = BeautifulSoup(content, 'html.parser')  

Success:  https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [5]:
print (soup.prettify()[:1000])

<!DOCTYPE doctype html>
<html class="" dir="ltr" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, user-scalable=yes" name="viewport"/>
  <meta content="MediaWiki 1.19.24" name="generator">
   <meta content="Game of Thrones Wiki,gameofthrones,Game of Thrones Wiki,Special Report: Max Borenstein prequel leaks to Game of Thrones Wiki,Game of Thrones prequel projects,Game of Thrones: The Long Night,Rhaegar Targaryen,Wight Hunt,Daenerys Targaryen's war for Westeros,Great War,Timeline,Dragonpit,King's Landing" name="keywords">
    <meta content="...an encyclopedic guide to the HBO television series Game of Thrones that anyone can edit. Our content is up to date with the latest aired episode so beware of unwanted plot details if you are not. &amp;quot;Empire of Ash&amp;quot; Report on Max Borenstein's prequel TV series about the Doom of Valyria leaks to..." name="description"/>
    <meta content="summary" name="twitt

### Check the HTML's Title

In [6]:
soup.find('title').text

'Game of Thrones Wiki | FANDOM powered by Wikia'

### Find the main content
- Check if it is possible to use only the relevant data

In [7]:
tag = 'article'
article = soup.find_all(tag)[0]


### Get some of the text
- Plain text without HTML tags

In [8]:
print (len(article.text))

9219


In [9]:
print(re.sub(r'\n\n+', '\n', article.text)[:500])


									window.adslots2.push(["top_boxad"]);
							
...an encyclopedic guide to the HBO television series Game of Thrones that anyone can edit. Our content is up to date with the latest aired episode so beware of unwanted plot details if you are not.
Jon
Sansa
Arya
Bran
Brienne
Daenerys
Tyrion
Missandei
Grey Worm
Jorah
Cersei
Jaime
Bronn
Gregor
Qyburn
Euron
Yara
Theon
Davos
Gendry
Varys
Samwell
Tormund
Sandor
Melisandre
"Empire of Ash"
Report on Max Borenstein's prequel TV series about the Do


### Find the links in the text

In [10]:
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
links = [t.get('href') for t in article.find_all(tag)]
print('Number of links:', len(links))
links





Number of links: 537


['/wiki/HBO',
 '/wiki/Game_of_Thrones',
 '/wiki/Jon_Snow',
 '/wiki/Jon_Snow',
 '/wiki/Sansa_Stark',
 '/wiki/Sansa_Stark',
 '/wiki/Arya_Stark',
 '/wiki/Arya_Stark',
 '/wiki/Bran_Stark',
 '/wiki/Bran_Stark',
 '/wiki/Brienne_of_Tarth',
 '/wiki/Brienne_of_Tarth',
 '/wiki/Daenerys_Targaryen',
 '/wiki/Daenerys_Targaryen',
 '/wiki/Tyrion_Lannister',
 '/wiki/Tyrion_Lannister',
 '/wiki/Missandei',
 '/wiki/Missandei',
 '/wiki/Grey_Worm',
 '/wiki/Grey_Worm',
 '/wiki/Jorah_Mormont',
 '/wiki/Jorah_Mormont',
 '/wiki/Cersei_Lannister',
 '/wiki/Cersei_Lannister',
 '/wiki/Jaime_Lannister',
 '/wiki/Jaime_Lannister',
 '/wiki/Bronn',
 '/wiki/Bronn',
 '/wiki/Gregor_Clegane',
 '/wiki/Gregor_Clegane',
 '/wiki/Qyburn',
 '/wiki/Qyburn',
 '/wiki/Euron_Greyjoy',
 '/wiki/Euron_Greyjoy',
 '/wiki/Yara_Greyjoy',
 '/wiki/Yara_Greyjoy',
 '/wiki/Theon_Greyjoy',
 '/wiki/Theon_Greyjoy',
 '/wiki/Davos_Seaworth',
 '/wiki/Davos_Seaworth',
 '/wiki/Gendry',
 '/wiki/Gendry',
 '/wiki/Varys',
 '/wiki/Varys',
 '/wiki/Samwell_Tarl

### Create a filter for unwanted types of articles

In [11]:
filter = 'House'

# Retrieve the links about the houses in GOT
links = [t for t in links if re.search(filter, t)]

# remove duplicates
links = list(set(links))
print('Number of links:', len(links))
links


Number of links: 27


['/wiki/House_Glover',
 '/wiki/House_Baratheon',
 '/wiki/House_Manderly',
 '/wiki/House_words',
 '/wiki/House_Mormont',
 '/wiki/House_Forrester',
 '/wiki/House_Targaryen',
 '/wiki/House_Baratheon_of_Dragonstone',
 '/wiki/House_Blackfyre',
 '/wiki/House_Frey',
 '/wiki/House_Florent',
 '/wiki/Great_House',
 '/wiki/House_Baratheon_of_King%27s_Landing',
 '/wiki/House_Arryn',
 '/wiki/House_Umber',
 '/wiki/House_Tully',
 '/wiki/House_Stark',
 '/wiki/House_Bolton',
 '/wiki/House_Reed',
 '/wiki/House_Tarly',
 '/wiki/House_Lannister',
 '/wiki/House_Whitehill',
 '/wiki/House_Royce',
 '/wiki/House_Tyrell',
 '/wiki/House_Martell',
 '/wiki/House_Karstark',
 '/wiki/House_Greyjoy']