# InterActief Python Session - Intro to Web Scraping

For this exercies, we'll be creating a web scraper to find books, prices and their rating from http://books.toscrape.com/ and export the result in a csv sorted by either price/rating.

Some packages/libraries are required to install before performing this exercise:
* **requests**: Requests is an elegant and simple HTTP library for Python, built for human beings. We will be using this library to make GET request to webpage(s). https://requests.readthedocs.io/en/master/
* **beautifulsoup4**: Beautiful Soup is a Python package for parsing HTML and XML documents. We will use this library to parse the webpages we get using requests. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* **pandas**: Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. We will use this library to manipulate, organize and export the data. https://pandas.pydata.org/

In [1]:
!pip install requests



In [2]:
!pip install beautifulsoup4



In [3]:
!pip install pandas



### Before starting with the exercise, it's better to read some tutorials first. 
Here are some tutorials which explains in brief how to do web scraping:
* https://www.codementor.io/@oluwagbengajoloko/how-to-scrape-data-from-a-website-using-python-n3fmtc63q
* https://medium.com/@kashaziz/scrap-a-web-page-in-20-lines-of-code-with-python-and-beautifulsoup-b95c58e93124
More tutorials can be found on Google.

In [4]:
# Let us import the installed libraries
import urllib.parse # inbuilt library, we will use to get absolute url paths further in this exercise
import pandas
import requests
from bs4 import BeautifulSoup

### If above cell works without any error, you have the required libraries installed

In [5]:
url = 'http://books.toscrape.com/'

In [6]:
# get the webpage html
# Using requests module, get the webpage using the url provided above and store into a variable `resp`

## WRITE YOUR CODE BELOW



In [7]:
print('Status code:', resp.status_code) # let's check the status code of our resp, this should be 200 meaning OK success
print(resp.text[:2000]) # lets print the first 2000 characters of our response

Status code: 200
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
        <meta name="created" content="24th Jun 2016 09:29" />
        <meta name="description" content="" />
        <meta name="viewport" content="width=device-width" />
        <meta name="robots" content="NOARCHIVE,NOCACHE" />

        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
        <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->

        
            <link rel="shortcut icon" href="stat

In [8]:
# Now let's create an object of beautifulsoup which takes the resp object as input and parses the html text. We will store this in variable `soup`

## WRITE YOUR CODE HERE



In [9]:
# Now we can print the pretty and much readable version of the site
print(soup.prettify())

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-us">
 <!--<![endif]-->
 <head>
  <title>
   All products | Books to Scrape - Sandbox
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="tex

In [10]:
# As we have parsed the html in `soup` object, we can access the webpage's tags such as title by using dot operator
print('Title HTML: ', soup.title)

# To get the text from an HTML element, use .text as shown below
print('Title text: ', soup.title.text)

soup.title.text

Title HTML:  <title>
    All products | Books to Scrape - Sandbox
</title>
Title text:  
    All products | Books to Scrape - Sandbox



'\n    All products | Books to Scrape - Sandbox\n'

In [11]:
# As you can see there is some new line empty space before the title, it is better to remove this white space which can be done with the strip() method
soup.title.text.strip()

'All products | Books to Scrape - Sandbox'

In [12]:
# We can also find HTML elements by the class names
side_categories = soup.find(class_='side_categories')
side_categories

<div class="side_categories">
<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">
                            
                                Books
                            
                        </a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">
                            
                                Travel
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/mystery_3/index.html">
                            
                                Mystery
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/historical-fiction_4/index.html">
                            
                                Historical Fiction
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/sequential-art_5/index.html">
                            
                                Sequential Art
          

In [13]:
# we can also get the list elements inside the side_categories and print them using a for loop
sc_list = side_categories.find('li').find_all('li')
for ind, cat in enumerate(sc_list):
    print(ind, cat.text.strip())

0 Travel
1 Mystery
2 Historical Fiction
3 Sequential Art
4 Classics
5 Philosophy
6 Romance
7 Womens Fiction
8 Fiction
9 Childrens
10 Religion
11 Nonfiction
12 Music
13 Default
14 Science Fiction
15 Sports and Games
16 Add a comment
17 Fantasy
18 New Adult
19 Young Adult
20 Science
21 Poetry
22 Paranormal
23 Art
24 Psychology
25 Autobiography
26 Parenting
27 Adult Fiction
28 Humor
29 Horror
30 History
31 Food and Drink
32 Christian Fiction
33 Business
34 Biography
35 Thriller
36 Contemporary
37 Spirituality
38 Academic
39 Self Help
40 Historical
41 Christian
42 Suspense
43 Short Stories
44 Novels
45 Health
46 Politics
47 Cultural
48 Erotica
49 Crime


In [14]:
# let's create a variable `first_book` and it should find the first element with class `product_pod`

## WRITE YOUR CODE HERE



In [15]:
first_book

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

In [16]:
# Let's see the title text of first_book
print(first_book.find('h3').text)

A Light in the ...


In [17]:
# well the title doesn't appear to be complete from text, therefore we will be using the title property (as seen in the HTML of the page) to get full title
print(first_book.find('h3').find('a')['title'])

A Light in the Attic


In [18]:
# Now let's try to get the Price and rating of the `first_book` 

## WRITE YOUR CODE HERE, it should match the output



Â£51.77
Three


In [19]:
# Get all book names, price and rating

In [20]:
# Now let's create a function called `get_book_details` which will perform the above task i.e. given an article element, it will return us the title, price and rating

## WRITE YOUR CODE HERE



In [21]:
get_book_details(first_book)

('A Light in the Attic', 'Â£51.77', 'Three')

In [22]:
# Now let's find all the element with class `product_pod` and print their details

## WRITE YOUR CODE HERE



('A Light in the Attic', 'Â£51.77', 'Three')
('Tipping the Velvet', 'Â£53.74', 'One')
('Soumission', 'Â£50.10', 'One')
('Sharp Objects', 'Â£47.82', 'Four')
('Sapiens: A Brief History of Humankind', 'Â£54.23', 'Five')
('The Requiem Red', 'Â£22.65', 'One')
('The Dirty Little Secrets of Getting Your Dream Job', 'Â£33.34', 'Four')
('The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'Â£17.93', 'Three')
('The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'Â£22.60', 'Four')
('The Black Maria', 'Â£52.15', 'One')
('Starving Hearts (Triangular Trade Trilogy, #1)', 'Â£13.99', 'Two')
("Shakespeare's Sonnets", 'Â£20.66', 'Four')
('Set Me Free', 'Â£17.46', 'Five')
("Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Â£52.29', 'Five')
('Rip it Up and Start Again', 'Â£35.02', 'Five')
('Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Â£57.25', 'Three')
('Olio', 'Â£23.88',

In [23]:
# Once we have all the books details from page 1, we can continue to repeat the process to get page 2, 3 and so on till the last page. 
# This can be done by finding the class next element and repeating the whole process

next_page = soup.find(class_='next')
if next_page: # if condition to check if we found the next page or is it the last page
    print(next_page.find('a')['href']) # as we can see this is relative url path, and we require the whole url
    next_page_url = urllib.parse.urljoin(url, next_page.find('a')['href']) # we will use urljoin function to join absolute path and relative path to get full url for page 2
    print(next_page_url)

catalogue/page-2.html
http://books.toscrape.com/catalogue/page-2.html


In [24]:
# Now we already have code to get book details from a single page url
# let's create a function called `extract_url` which will extract all the books from that url, 
# find the next page and continue to get books from next page, till last page found


## WRITE YOUR CODE HERE



In [25]:
base_url = 'http://books.toscrape.com/'
extract_url(base_url)

('A Light in the Attic', 'Â£51.77', 'Three')
('Tipping the Velvet', 'Â£53.74', 'One')
('Soumission', 'Â£50.10', 'One')
('Sharp Objects', 'Â£47.82', 'Four')
('Sapiens: A Brief History of Humankind', 'Â£54.23', 'Five')
('The Requiem Red', 'Â£22.65', 'One')
('The Dirty Little Secrets of Getting Your Dream Job', 'Â£33.34', 'Four')
('The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'Â£17.93', 'Three')
('The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'Â£22.60', 'Four')
('The Black Maria', 'Â£52.15', 'One')
('Starving Hearts (Triangular Trade Trilogy, #1)', 'Â£13.99', 'Two')
("Shakespeare's Sonnets", 'Â£20.66', 'Four')
('Set Me Free', 'Â£17.46', 'Five')
("Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Â£52.29', 'Five')
('Rip it Up and Start Again', 'Â£35.02', 'Five')
('Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Â£57.25', 'Three')
('Olio', 'Â£23.88',

### Pandas & Dataframes 

Now, we have data from webpages, but we need to store the data into a tabular format, sort it according to price/rating and export it to external data such as csv. This can be done by using Pandas dataframe

In [26]:
# Let's start by creating an empty dataframe and understand how to use it.
df = pandas.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [27]:
# Let's create an empty dataframe with columns
df = pandas.DataFrame(columns=['First Name', 'Last Name', 'Role'])
print(df)

# We'll add row in this table
df = df.append({'First Name': 'Akul', 
                'Last Name': 'Mehra', 
                'Role': 'TA'}, ignore_index=True)
print(df)
print()

# Let's add another row
df = df.append({'First Name': 'Aditya', 
                'Last Name': 'Pappu', 
                'Role': 'Student'}, ignore_index=True)
print(df)
print()

# We can sort the table based on columns too!
print(df.sort_values('First Name'))

Empty DataFrame
Columns: [First Name, Last Name, Role]
Index: []
  First Name Last Name Role
0       Akul     Mehra   TA

  First Name Last Name     Role
0       Akul     Mehra       TA
1     Aditya     Pappu  Student

  First Name Last Name     Role
1     Aditya     Pappu  Student
0       Akul     Mehra       TA


In [28]:
# Now we know how to use Pandas Dataframes.
# Let's modify our `extract_url` function such that it stores all the rows inside a dataframe called all_books_df

all_books_df = pandas.DataFrame(columns=['Title', 'Price', 'Rating'])

## WRITE YOUR CODE HERE



In [29]:
base_url = 'http://books.toscrape.com/'
extract_url(base_url)
all_books_df

Unnamed: 0,Title,Price,Rating
0,A Light in the Attic,Â£51.77,Three
1,Tipping the Velvet,Â£53.74,One
2,Soumission,Â£50.10,One
3,Sharp Objects,Â£47.82,Four
4,Sapiens: A Brief History of Humankind,Â£54.23,Five
...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,Â£55.53,One
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Â£57.06,Four
997,A Spy's Devotion (The Regency Spies of London #1),Â£16.97,Five
998,1st to Die (Women's Murder Club #1),Â£53.98,One


In [30]:
# Let's sort the data frame on basis of Price column

## WRITE YOUR CODE HERE




Unnamed: 0,Title,Price,Rating
638,An Abundance of Katherines,Â£10.00,Five
501,The Origin of Species,Â£10.01,Four
716,The Tipping Point: How Little Things Can Make ...,Â£10.02,Two
84,Patience,Â£10.16,Three
302,Greek Mythic History,Â£10.23,Five
...,...,...,...
366,The Diary of a Young Girl,Â£59.90,Three
560,The Barefoot Contessa Cookbook,Â£59.92,Five
860,Civilization and Its Discontents,Â£59.95,Two
617,Last One Home (New Beginnings #1),Â£59.98,Three


In [37]:
# As we have string in Rating column instead of numbers, we can't sort on the rating column.
# Try a method to convert this Rating to numbers, use Google for help.

## WRITE YOUR CODE HERE




In [32]:
# after mapping string to number, you should have a new column with integer rating which you can use to sort
all_books_df

Unnamed: 0,Title,Price,Rating,RatingNumber
0,A Light in the Attic,Â£51.77,Three,3
1,Tipping the Velvet,Â£53.74,One,1
2,Soumission,Â£50.10,One,1
3,Sharp Objects,Â£47.82,Four,4
4,Sapiens: A Brief History of Humankind,Â£54.23,Five,5
...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Won...,Â£55.53,One,1
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Â£57.06,Four,4
997,A Spy's Devotion (The Regency Spies of London #1),Â£16.97,Five,5
998,1st to Die (Women's Murder Club #1),Â£53.98,One,1


In [33]:
# sort the data frame now on the new rating column

## WRITE YOUR CODE HERE




Unnamed: 0,Title,Price,Rating,RatingNumber
239,The Rosie Project (Don Tillman #1),Â£54.04,One,1
175,Poses for Artists Volume 1 - Dynamic and Sitti...,Â£41.06,One,1
177,"Nightingale, Sing",Â£38.28,One,1
178,Night Sky with Exit Wounds,Â£41.05,One,1
815,Let's Pretend This Never Happened: A Mostly Tr...,Â£45.11,One,1
...,...,...,...,...
569,How to Stop Worrying and Start Living,Â£46.49,Five,5
856,Dark Places,Â£23.90,Five,5
857,Crazy Rich Asians (Crazy Rich Asians #1),Â£49.13,Five,5
216,You (You #1),Â£43.61,Five,5


In [34]:
# As we would like to have the higher rating on the top, find a way to reverse sort the dataframe

## WRITE YOUR CODE HERE




Unnamed: 0,Title,Price,Rating,RatingNumber
999,"1,000 Places to See Before You Die",Â£26.08,Five,5
560,The Barefoot Contessa Cookbook,Â£59.92,Five,5
601,The Darkest Corners,Â£11.33,Five,5
598,The False Prince (The Ascendance Trilogy #1),Â£56.00,Five,5
592,The Mathews Men: Seven Brothers and the War Ag...,Â£42.91,Five,5
...,...,...,...,...
817,"Lean In: Women, Work, and the Will to Lead",Â£25.02,One,1
86,orange: The Complete Collection 1 (orange: The...,Â£48.41,One,1
820,Jurassic Park (Jurassic Park #1),Â£44.97,One,1
821,It's Never Too Late to Begin Again: Discoverin...,Â£42.38,One,1


In [35]:
# There can be times where sorting on single column doesn't provide best results. Therefore, we require to sort on multiple columns.
# Let's sort the dataframe on Rating first so the best books come up, and then sort on Price to find the cheapest best books. Store in a variable called `sorted_df`

## WRITE YOUR CODE HERE



Unnamed: 0,Title,Price,Rating,RatingNumber
638,An Abundance of Katherines,Â£10.00,Five,5
302,Greek Mythic History,Â£10.23,Five,5
590,The Power Greens Cookbook: 140 Delicious Super...,Â£11.05,Five,5
316,Dear Mr. Knightley,Â£11.21,Five,5
601,The Darkest Corners,Â£11.33,Five,5
...,...,...,...,...
752,The Girl Who Kicked the Hornet's Nest (Millenn...,Â£57.48,One,1
805,"Miracles from Heaven: A Little Girl, Her Journ...",Â£57.83,One,1
704,"Unstuffed: Decluttering Your Home, Mind, and Soul",Â£58.09,One,1
393,The Improbability of Love,Â£59.45,One,1


In [36]:
# Finally, let's save our dataframe to a csv so that we can export and share the dataset.
sorted_df.to_csv('books_data.csv', index=False)

# Done!

Web scraping is an important aspect in Data Collection pipeline and this notebook provides basics on starting with Web Scraping, hope it helped you and feel free to ask the TA's on how to proceed further in Web Scraping and Programming.