<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.3: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [23]:
## Import Libraries
import regex as re

from urllib.parse import unquote 
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [4]:
OP_page = 'https://onepiece.fandom.com/wiki/One_Piece_Wiki'

### Retrieve the page
- Require Internet connection

In [7]:
http = urllib3.PoolManager()
r = http.request('GET', OP_page)
if r.status == 200:
    page = r.data
    print('Type of variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occured. Request Status: %s' % r.status)

Type of variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 488756


### Convert the stream of bytes into a BeautifulSoup representation

In [8]:
# pass the page to beautifulsoup  
soup = BeautifulSoup(page, 'html.parser')
print('Type of the varible \'soup\':', soup.__class__.__name__)

Type of the varible 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [10]:
print(soup.prettify()[:3000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   One Piece Wiki | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"One_Piece_Wiki","wgTitle":"One Piece Wiki","wgCurRevisionId":1838613,"wgRevisionId":1838613,"wgArticleId":291028,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages using DynamicPageList parser tag","One Piece Encyclopedia"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May

### Check the HTML's Title

In [11]:
print('Title tag : %s:' % soup.title)
print('Title text : %s:' % soup.title.string)


Title tag : <title>One Piece Wiki | Fandom</title>:
Title text : One Piece Wiki | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

In [15]:
article = results[0]
article.text

'\n\n\nWelcome to the One Piece Wiki!\n\n\nRead the first chapter or latest chapter | Watch the first episode or latest episode\n\n\n\n\n\n#portal_content-0\n#portal_content-1\n#portal_content-2\n#portal_content-3\n#portal_content-4\n#portal_content-5\n\n\n\nSail into the World of One Piece\nThe Manga\nThe Anime\nCharacters\nStampede\n\n\n\n\n\n\n\n\n\nedit\n\n\n\nCharacters\nSocieties\nLocations\nOrganizations\nDevil Fruits\nFor New Users\n\nMonkey D. Luffy\nStraw Hat Pirates\nList of Canon Characters\nList of Non-Canon Characters\nPirates\nMarines\nBy Race\nAntagonists\n\nOccupation\nBy Race\nTechnology\nAnimal Species\n\nGrand Line\nIslands\nEast Blue\nNew World\n\nMarines\nPirates\nSeven Warlords of the Sea\nFour Emperors\nWorld Government\nRevolutionary Army\n\n\nAbout Devil Fruits\nGomu Gomu no Mi\nParamecia\nZoan\nLogia\nArtificial Devil Fruits\nNatural Devil Fruits\nSMILES\nNon-Canon Devil Fruits\n\n\nGuidelines to read before editing\nReferencing Information\nCustomizing and U

In [14]:
tag = 'div'
class_name = {"class": "mw-parser-output"}
results = soup.find_all(tag, class_name)
print('Type of the variable \'article\':', article.__class__.__name__)

Type of the variable 'article': Tag


### Get some of the text
- Plain text without HTML tags

In [18]:
print(re.sub(r'\n\n+','\n', article.text)[:1000])

NameError: name 're' is not defined

### Find the links in the text

In [19]:
tag ='a'

tag_list = [t.get('href') for t in article.find_all(tag)]
tag_list

['https://mangaplus.shueisha.co.jp/viewer/1000486',
 'https://mangaplus.shueisha.co.jp/viewer/1012878',
 'https://www.crunchyroll.com/one-piece/episode-1-im-luffy-the-man-whos-gonna-be-king-of-the-pirates-650673',
 'https://www.crunchyroll.com/one-piece/episode-1013-yamatos-past-the-man-who-came-for-an-emperor-of-the-sea-825467',
 '#portal_content-0',
 '#portal_content-1',
 '#portal_content-2',
 '#portal_content-3',
 '#portal_content-4',
 '#portal_content-5',
 'https://static.wikia.nocookie.net/onepiece/images/6/6b/Slide_1_preview.png/revision/latest?cb=20210905184758',
 'https://static.wikia.nocookie.net/onepiece/images/b/b8/Slide_2_preview.png/revision/latest?cb=20171010172517',
 'https://static.wikia.nocookie.net/onepiece/images/d/d7/Slide_3_preview.png/revision/latest?cb=20220215140036',
 'https://static.wikia.nocookie.net/onepiece/images/1/1d/Slide_4_preview.png/revision/latest?cb=20180307050112',
 'https://static.wikia.nocookie.net/onepiece/images/d/d7/Slide_5_preview.png/revisio

In [20]:
tag_list_link = [t[6:] for t in tag_list if (t) and (t.startswith('/wiki/'))]
tag_list_link

['Monkey_D._Luffy',
 'Straw_Hat_Pirates',
 'List_of_Canon_Characters',
 'List_of_Non-Canon_Characters',
 'Pirates',
 'Marines',
 'Category:Races_and_Tribes',
 'Category:Antagonists',
 'Category:Occupations',
 'Category:Races_and_Tribes',
 'Category:Technology',
 'Animal_Species',
 'Grand_Line',
 'Category:Islands',
 'East_Blue',
 'New_World',
 'Marines',
 'Pirates',
 'Seven_Warlords_of_the_Sea',
 'Four_Emperors',
 'World_Government',
 'Revolutionary_Army',
 'Devil_Fruit',
 'Gomu_Gomu_no_Mi',
 'Paramecia',
 'Zoan',
 'Logia',
 'Artificial_Devil_Fruit',
 'Devil_Fruit#Natural_Devil_Fruits',
 'SMILE',
 'Category:Non-Canon_Devil_Fruits',
 'One_Piece_Wiki:Guidebook',
 'One_Piece_Wiki:Guidebook/Referencing_Information',
 'User_blog:Leviathan_89/What_you_have_to_know_about_signatures',
 'One_Piece_Wiki:FAQ',
 'One_Piece_Wiki:Wiki_Crews',
 'Category:Chapter_Stubs',
 'Category:Episode_Stubs',
 'Chapters_and_Volumes/Volume_1-10',
 'Chapters_and_Volumes/Volume_11-20',
 'Chapters_and_Volumes/Volume_

### Create a filter for unwanted types of articles

In [24]:
word_filter = '(%s)' % '|'.join([
    'Category:',
    'One_Piece_Wiki:',
    'User:',
    'User_blog:',
    'Help:',
    'action=',
    'Special:',
    'SBS_Volume',
    'Chapters_and_Volumes', 
    'Author%27s_Notes',
    'Episode_Guide',
    'Movie_',
    'Episode_Special',
    'Forum:',
    'File_',
    'Category_',
    'Talk:',

    
])
tag_list_final = [t for t in tag_list_link if not re.search(word_filter, t)]
tag_list_final

['Monkey_D._Luffy',
 'Straw_Hat_Pirates',
 'List_of_Canon_Characters',
 'List_of_Non-Canon_Characters',
 'Pirates',
 'Marines',
 'Animal_Species',
 'Grand_Line',
 'East_Blue',
 'New_World',
 'Marines',
 'Pirates',
 'Seven_Warlords_of_the_Sea',
 'Four_Emperors',
 'World_Government',
 'Revolutionary_Army',
 'Devil_Fruit',
 'Gomu_Gomu_no_Mi',
 'Paramecia',
 'Zoan',
 'Logia',
 'Artificial_Devil_Fruit',
 'Devil_Fruit#Natural_Devil_Fruits',
 'SMILE',
 'One_Piece_Red:_Grand_Characters',
 'One_Piece_Blue:_Grand_Data_File',
 'One_Piece_Yellow:_Grand_Elements',
 'One_Piece_Green:_Secret_Pieces',
 'One_Piece_Blue_Deep:_Characters_World',
 'Vivre_Card_-_One_Piece_Visual_Dictionary',
 'Buggy%27s_Crew:_After_the_Battle!',
 'Diary_of_Coby-Meppo',
 'Jango%27s_Dance_Paradise',
 'Hatchan%27s_Sea-Floor_Stroll',
 'Wapol%27s_Omnivorous_Hurrah',
 'Ace%27s_Great_Blackbeard_Search',
 'Gedatsu%27s_Accidental_Blue-Sea_Life',
 'Miss_Goldenweek%27s_%22Operation:_Meet_Baroque_Works%22',
 'Enel%27s_Great_Space_Oper

In [25]:
tag_list_final = list(set(tag_list_final))
tag_list_final

['Krieg',
 'Fight_Together',
 'Pekoms',
 'Marianne',
 'Charlotte_Linlin',
 'Galdino',
 'Inuarashi',
 'FUNimation',
 'Episode_of_East_Blue',
 'Emporio_Ivankov',
 'Template_talk:ChapterPages',
 'Charlotte_Katakuri',
 'Hard_Knock_Days',
 'Gedatsu',
 'Pedro',
 'Baron_Omatsuri_and_the_Secret_Island',
 'CP9%27s_Independent_Report',
 'Gedatsu%27s_Accidental_Blue-Sea_Life',
 'The_Cursed_Holy_Sword',
 'Senor_Pink',
 'Episode_of_Sky_Island',
 'Sakazuki',
 'Mythbusters',
 'Jack',
 'Ruluka_Island_Arc',
 'One_Piece_Yellow:_Grand_Elements',
 'We_Can!',
 'Caesar_Clown',
 'Aim!_The_King_of_Belly',
 'Monkey_D._Luffy',
 'One_Piece_Red:_Grand_Characters',
 'Zoan',
 'Yosaku',
 'One_Piece_Wiki_talk:Guidebook/Spoiler_Rules',
 'Artificial_Devil_Fruit',
 'Karoo',
 'Hatchan',
 'Caribou',
 'Gan_Fall',
 'Oars',
 'Grand_Line',
 'Pagaya',
 'Trebol',
 'Kurozumi_Orochi',
 'Devil_Fruit',
 'Raizo',
 'One_Piece_Wiki_talk:Guidebook/Manual_of_Style',
 'Buffalo',
 'Caribou%27s_Kehihihihi_in_the_New_World',
 'Z%27s_Ambitio

In [26]:
tag_list_final = [unquote(t) for t in tag_list_final]
tag_list_final

['Krieg',
 'Fight_Together',
 'Pekoms',
 'Marianne',
 'Charlotte_Linlin',
 'Galdino',
 'Inuarashi',
 'FUNimation',
 'Episode_of_East_Blue',
 'Emporio_Ivankov',
 'Template_talk:ChapterPages',
 'Charlotte_Katakuri',
 'Hard_Knock_Days',
 'Gedatsu',
 'Pedro',
 'Baron_Omatsuri_and_the_Secret_Island',
 "CP9's_Independent_Report",
 "Gedatsu's_Accidental_Blue-Sea_Life",
 'The_Cursed_Holy_Sword',
 'Senor_Pink',
 'Episode_of_Sky_Island',
 'Sakazuki',
 'Mythbusters',
 'Jack',
 'Ruluka_Island_Arc',
 'One_Piece_Yellow:_Grand_Elements',
 'We_Can!',
 'Caesar_Clown',
 'Aim!_The_King_of_Belly',
 'Monkey_D._Luffy',
 'One_Piece_Red:_Grand_Characters',
 'Zoan',
 'Yosaku',
 'One_Piece_Wiki_talk:Guidebook/Spoiler_Rules',
 'Artificial_Devil_Fruit',
 'Karoo',
 'Hatchan',
 'Caribou',
 'Gan_Fall',
 'Oars',
 'Grand_Line',
 'Pagaya',
 'Trebol',
 'Kurozumi_Orochi',
 'Devil_Fruit',
 'Raizo',
 'One_Piece_Wiki_talk:Guidebook/Manual_of_Style',
 'Buffalo',
 "Caribou's_Kehihihihi_in_the_New_World",
 "Z's_Ambition_Arc",


© 2020 Institute of Data