# Assignment - Web Scraping
---

## Exercise 1: toscrape.com

For this exercise, we will use a site that was actually _made for scraping_: [Web Scraping Sandbox](https://toscrape.com/) 

### 1.1

Import all the required libraries.

In [1]:
# 1.1 Answer 
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

### 1.2

Scrape ALL urls from https://toscrape.com/

In [9]:
# 1.2 Answer

# Send a GET request to 'https://toscrape.com/' and parse the HTML content

response = requests.get('https://toscrape.com/')
soup = BeautifulSoup(response.content)

# Find all anchor tags (<a>) in the parsed HTML

links = soup.find_all('a')

# Extract the 'href' attribute from each anchor tag to get URLs

urls = [link.get("href") for link in links]    

# Print each URL along with its index

for i,url in enumerate(urls):
    print("The link N°",i+1,":",url)

The link N° 1 : http://books.toscrape.com
The link N° 2 : http://books.toscrape.com
The link N° 3 : http://books.toscrape.com
The link N° 4 : http://quotes.toscrape.com/
The link N° 5 : http://quotes.toscrape.com
The link N° 6 : http://quotes.toscrape.com/
The link N° 7 : http://quotes.toscrape.com/scroll
The link N° 8 : http://quotes.toscrape.com/js
The link N° 9 : http://quotes.toscrape.com/js-delayed
The link N° 10 : http://quotes.toscrape.com/tableful
The link N° 11 : http://quotes.toscrape.com/login
The link N° 12 : http://quotes.toscrape.com/search.aspx
The link N° 13 : http://quotes.toscrape.com/random


### 1.3

1.3 scrape all text ('p') from https://toscrape.com/

In [10]:
# 1.3 Answer

# Find all <p> tags in the parsed HTML content

para = soup.find_all('p')

# Iterate through each <p> tag and print its text content

for i in para:
    print(i.text)

A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: books.toscrape.com
A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.


## Exercise 2: Wikipedia

For this exercise, you will scrape the side-bar data (text box only) from  [The Office Wikipedia Page](https://en.wikipedia.org/wiki/The_Office_(American_TV_series)).

### 2.1

Scrape the side-bar data.

In [11]:
# 2.1 Answer

# Send a GET request to fetch the HTML content from the Wikipedia page

rep = requests.get("https://en.wikipedia.org/wiki/The_Office_(American_TV_series)")

# Parse the HTML content using BeautifulSoup

soup = BeautifulSoup(rep.content)

# Iterate through the content found in the <tbody> tag and print its text

for i in soup.find("tbody"):
    print(i.text)

The Office

Genre
Mockumentary
Workplace comedy
Cringe comedy
Sitcom

Based onThe Officeby Ricky GervaisStephen Merchant
Developed byGreg Daniels
Showrunners
Greg Daniels
Paul Lieberstein
Jennifer Celotta

Starring
Steve Carell
Rainn Wilson
John Krasinski
Jenna Fischer
B. J. Novak
Melora Hardin
David Denman
Leslie David Baker
Brian Baumgartner
Kate Flannery
Angela Kinsey
Oscar Nunez
Phyllis Smith
Ed Helms
Mindy Kaling
Paul Lieberstein
Creed Bratton
Craig Robinson
Ellie Kemper
Zach Woods
Amy Ryan
James Spader
Catherine Tate
Clark Duke
Jake Lacy

Theme music composerJay Ferguson
Country of originUnited States
Original languageEnglish
No. of seasons9
No. of episodes201 (list of episodes)
Production
Executive producers
Ben Silverman
Greg Daniels
Ricky Gervais
Stephen Merchant
Howard Klein
Ken Kwapis
Paul Lieberstein
Jennifer Celotta
B. J. Novak
Mindy Kaling
Brent Forrester
Dan Sterling

Producers
Kent Zbornak
Michael Schur
Steve Carell
Lee Eisenberg
Gene Stupnitsky
Randy Cordray
Justin Spi

### 2.2

Save the date into a dictionary.

In [12]:
# 2.2 Answer

# Find the sidebar element

infobox = soup.find('table', class_='infobox')

# Initialize an empty dictionary to store the data

office_data = {}

# Extract text from the infobox

if infobox:
    
    # Find all rows (<tr>) in the infobox
    
    rows = infobox.find_all('tr')
    
    # Loop through each row and extract key-value pairs
    
    for row in rows:
        
        # Find the header cell (<th>) which usually contains the category
        
        header = row.find('th')
        
        # Find the data cell (<td>) which usually contains the corresponding value
        
        data = row.find('td')
        
        if header and data:
            
            # Strip any extra whitespace and newlines from text
            
            key = header.get_text(strip=True)
            value = data.get_text(strip=True)
            
            # Add key-value pair to dictionary
            
            office_data[key] = value


print(office_data)

{'Genre': 'MockumentaryWorkplace comedyCringe comedySitcom', 'Based on': 'The OfficebyRicky GervaisStephen Merchant', 'Developed by': 'Greg Daniels', 'Showrunners': 'Greg DanielsPaul LiebersteinJennifer Celotta', 'Starring': 'Steve CarellRainn WilsonJohn KrasinskiJenna FischerB. J. NovakMelora HardinDavid DenmanLeslie David BakerBrian BaumgartnerKate FlanneryAngela KinseyOscar NunezPhyllis SmithEd HelmsMindy KalingPaul LiebersteinCreed BrattonCraig RobinsonEllie KemperZach WoodsAmy RyanJames SpaderCatherine TateClark DukeJake Lacy', 'Theme music composer': 'Jay Ferguson', 'Country of origin': 'United States', 'Original language': 'English', 'No.of seasons': '9', 'No.of episodes': '201(list of episodes)', 'Executive producers': 'Ben SilvermanGreg DanielsRicky GervaisStephen MerchantHoward KleinKen KwapisPaul LiebersteinJennifer CelottaB. J. NovakMindy KalingBrent ForresterDan Sterling', 'Producers': 'Kent ZbornakMichael SchurSteve CarellLee EisenbergGene StupnitskyRandy CordrayJustin Sp

### 2.3

Convert the dictionary into a dataframe that looks as follows:

![](../Data/the_office_DF.png)

In [13]:
# 2.3 Answer

df = pd.DataFrame.from_dict(office_data, orient='index', columns=['Value'])
df

Unnamed: 0,Value
Genre,MockumentaryWorkplace comedyCringe comedySitcom
Based on,The OfficebyRicky GervaisStephen Merchant
Developed by,Greg Daniels
Showrunners,Greg DanielsPaul LiebersteinJennifer Celotta
Starring,Steve CarellRainn WilsonJohn KrasinskiJenna Fi...
Theme music composer,Jay Ferguson
Country of origin,United States
Original language,English
No.of seasons,9
No.of episodes,201(list of episodes)


# The End!