# Assignment - Web Scraping
---

## Exercise 1: toscrape.com

For this exercise, we will use a site that was actually _made for scraping_: [Web Scraping Sandbox](https://toscrape.com/) 

### 1.1

Import all the required libraries.

In [1]:
# 1.1 Answer 
import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup
# For performing regex operations
import re

### 1.2

Scrape ALL urls from https://toscrape.com/

In [2]:
# 1.2 Answer

response = requests.get('https://toscrape.com/')
    
soup = BeautifulSoup(response.content)

print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Scraping Sandbox
  </title>
  <link href="./css/bootstrap.min.css" rel="stylesheet"/>
  <link href="./css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row">
    <div class="col-md-1">
    </div>
    <div class="col-md-10 well">
     <img class="logo" src="img/zyte.png" width="200px"/>
     <h1 class="text-right">
      Web Scraping Sandbox
     </h1>
    </div>
   </div>
   <div class="row">
    <div class="col-md-1">
    </div>
    <div class="col-md-10">
     <h2>
      Books
     </h2>
     <p>
      A
      <a href="http://books.toscrape.com">
       fictional bookstore
      </a>
      that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at:
      <a href="http://books.toscrape.com">
       books.toscra

In [3]:
anchor_tags = soup.find_all('a', href=True)

for tag in anchor_tags:
    print(tag['href'])

http://books.toscrape.com
http://books.toscrape.com
http://books.toscrape.com
http://quotes.toscrape.com/
http://quotes.toscrape.com
http://quotes.toscrape.com/
http://quotes.toscrape.com/scroll
http://quotes.toscrape.com/js
http://quotes.toscrape.com/js-delayed
http://quotes.toscrape.com/tableful
http://quotes.toscrape.com/login
http://quotes.toscrape.com/search.aspx
http://quotes.toscrape.com/random


### 1.3

1.3 scrape all text ('p') from https://toscrape.com/

In [4]:
# 1.3 Answer
paragraph_tags = soup.find_all('p')

for tag in paragraph_tags:
    print(tag.text)

A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: books.toscrape.com
A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.


## Exercise 2: Wikipedia

For this exercise, you will scrape the side-bar data (text box only) from  [The Office Wikipedia Page](https://en.wikipedia.org/wiki/The_Office_(American_TV_series)).

### 2.1

Scrape the side-bar data.

In [5]:
# 2.1 Answer

url = 'https://en.wikipedia.org/wiki/The_Office_(American_TV_series)'

response = requests.get(url)

soup = BeautifulSoup(response.content)

In [6]:
# Find the infobox
infobox = soup.find('table', {'class': 'infobox'})

if infobox:
    for row in infobox.find_all('tr'):
        cells = row.find_all(['th', 'td'])
        for cell in cells:
            print(cell.text.strip())
else:
    print("Infobox not found")

The Office

Genre
Mockumentary
Workplace comedy
Cringe comedy
Sitcom
Based on
The Officeby Ricky GervaisStephen Merchant
Developed by
Greg Daniels
Showrunners
Greg Daniels
Paul Lieberstein
Jennifer Celotta
Starring
Steve Carell
Rainn Wilson
John Krasinski
Jenna Fischer
B. J. Novak
Melora Hardin
David Denman
Leslie David Baker
Brian Baumgartner
Kate Flannery
Angela Kinsey
Oscar Nunez
Phyllis Smith
Ed Helms
Mindy Kaling
Paul Lieberstein
Creed Bratton
Craig Robinson
Ellie Kemper
Zach Woods
Amy Ryan
James Spader
Catherine Tate
Clark Duke
Jake Lacy
Theme music composer
Jay Ferguson
Country of origin
United States
Original language
English
No. of seasons
9
No. of episodes
201 (list of episodes)
Production
Executive producers
Ben Silverman
Greg Daniels
Ricky Gervais
Stephen Merchant
Howard Klein
Ken Kwapis
Paul Lieberstein
Jennifer Celotta
B. J. Novak
Mindy Kaling
Brent Forrester
Dan Sterling
Producers
Kent Zbornak
Michael Schur
Steve Carell
Lee Eisenberg
Gene Stupnitsky
Randy Cordray
Justin 

### 2.2

Save the date into a dictionary.

In [9]:
# 2.2 Answer
data = {}

if infobox:
    for row in infobox.find_all('tr'):
        cells = row.find_all(['th', 'td'])
        if len(cells) == 2:
            key = cells[0].text.strip()
            value = cells[1].text.strip()
            data[key] = value

data

{'Genre': 'Mockumentary\nWorkplace comedy\nCringe comedy\nSitcom',
 'Based on': 'The Officeby Ricky GervaisStephen Merchant',
 'Developed by': 'Greg Daniels',
 'Showrunners': 'Greg Daniels\nPaul Lieberstein\nJennifer Celotta',
 'Starring': 'Steve Carell\nRainn Wilson\nJohn Krasinski\nJenna Fischer\nB. J. Novak\nMelora Hardin\nDavid Denman\nLeslie David Baker\nBrian Baumgartner\nKate Flannery\nAngela Kinsey\nOscar Nunez\nPhyllis Smith\nEd Helms\nMindy Kaling\nPaul Lieberstein\nCreed Bratton\nCraig Robinson\nEllie Kemper\nZach Woods\nAmy Ryan\nJames Spader\nCatherine Tate\nClark Duke\nJake Lacy',
 'Theme music composer': 'Jay Ferguson',
 'Country of origin': 'United States',
 'Original language': 'English',
 'No. of seasons': '9',
 'No. of episodes': '201 (list of episodes)',
 'Executive producers': 'Ben Silverman\nGreg Daniels\nRicky Gervais\nStephen Merchant\nHoward Klein\nKen Kwapis\nPaul Lieberstein\nJennifer Celotta\nB. J. Novak\nMindy Kaling\nBrent Forrester\nDan Sterling',
 'Produ

### 2.3

Convert the dictionary into a dataframe that looks as follows:

![](../Data/the_office_DF.png)

In [8]:
# 2.3 Answer
df = pd.DataFrame(data.items(), columns=['Key', 'Value'])

df

Unnamed: 0,Key,Value
0,Genre,Mockumentary\nWorkplace comedy\nCringe comedy\...
1,Based on,The Officeby Ricky GervaisStephen Merchant
2,Developed by,Greg Daniels
3,Showrunners,Greg Daniels\nPaul Lieberstein\nJennifer Celotta
4,Starring,Steve Carell\nRainn Wilson\nJohn Krasinski\nJe...
5,Theme music composer,Jay Ferguson
6,Country of origin,United States
7,Original language,English
8,No. of seasons,9
9,No. of episodes,201 (list of episodes)


# The End!