## Web Scraping Using Python

Website to be scraped - https://books.toscrape.com/catalogue/page-1.html

In [7]:
# Run these instructions in command prompt
# pip install bs4
# pip install requests

**Import necessary libraries**

In [8]:
import requests
import bs4
import pandas as pd

**Send request using URL of the webpage to be scraped**

In [45]:
# accessing second page of the website
request1 = requests.get('http://books.toscrape.com/catalogue/page-2.html')

In [47]:
# status 200 means connection successful
print(request1)

<Response [200]>


**Create soup object using BeautifulSoup library used for web scraping which returns HTML code of the webpage**

In [11]:
soup = bs4.BeautifulSoup(request1.text)
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="../static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="../static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="

**Go to the webpage and right click, hit 'inspect'. HTML code of that block will appear on right side of the webpage**

**To choose any HTML tag, use `select` method**

In [12]:
soup.select('article')

[<article class="product_pod">
 <div class="image_container">
 <a href="in-her-wake_980/index.html"><img alt="In Her Wake" class="thumbnail" src="../media/cache/5d/72/5d72709c6a7a9584a4d1cf07648bfce1.jpg"/></a>
 </div>
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake</a></h3>
 <div class="product_price">
 <p class="price_color">Â£12.84</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="how-music-works_979/index.html"><img alt="How Music Works" class="thumbnail" src="../media/cache/5c/c8/5cc8e107246cb478960d4f0aba1e1c8e.jpg"/></a>
 

**To select tags inside tag :**

In [13]:
soup.select('article h3 a')

[<a href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake</a>,
 <a href="how-music-works_979/index.html" title="How Music Works">How Music Works</a>,
 <a href="foolproof-preserving-a-guide-to-small-batch-jams-jellies-pickles-condiments-and-more-a-foolproof-guide-to-making-small-batch-jams-jellies-pickles-condiments-and-more_978/index.html" title="Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More">Foolproof Preserving: A Guide ...</a>,
 <a href="chase-me-paris-nights-2_977/index.html" title="Chase Me (Paris Nights #2)">Chase Me (Paris Nights ...</a>,
 <a href="black-dust_976/index.html" title="Black Dust">Black Dust</a>,
 <a href="birdsong-a-story-in-pictures_975/index.html" title="Birdsong: A Story in Pictures">Birdsong: A Story in ...</a>,
 <a href="americas-cradle-of-quarterbacks-western-pennsylvanias-football-factory-from-johnny-unitas-to-joe-mont

**Let's check the lentgh of a list since `select` method returns a list**

In [14]:
len(soup.select('article h3 a'))

20

**So there are 20 books on  a single webpage**

In [15]:
# inspecting any element in the list
soup.select('article h3 a')[0]

<a href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake</a>

In [16]:
# accessing the name of book ( similar to key-value pair in Python dictionary )
soup.select('article h3 a')[0]['title']

'In Her Wake'

In [17]:
# accessing the link of each book ( similar to key-value pair in Python dictionary )
soup.select('article h3 a')[0]['href']

'in-her-wake_980/index.html'

**Let's look for rating of each book in `article` tag**

In [18]:
soup.select('article p')

[<p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="price_color">Â£12.84</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="star-rating Two">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="price_color">Â£37.32</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="price_color">Â£30.52</p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="star-rating Five">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <

In [19]:
# inspecting first element of a list
(soup.select('article p'))[0]

<p class="star-rating One">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>

In [20]:
# accessing the class
(soup.select('article p'))[0]['class']

['star-rating', 'One']

In [21]:
(soup.select('article p'))[0]['class'][1]

'One'

**Using for loop to get the names of all the books and their rating**

In [22]:
title_list = []
rating_list = []

for page_num in range(1,51):  # total pages
    req_page = requests.get(f'http://books.toscrape.com/catalogue/page-{page_num}.html')  # url of each page
    soup_page = bs4.BeautifulSoup(req_page.text)
    
    for item in soup_page.select('article h3 a'):
        title_list.append(item['title'])
        
    for n in range(0,60,3):
        rating_list.append(soup_page.select('article p')[n]['class'][1])

In [23]:
print(title_list)

['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas", 'In Her Wake', 'How Music Works', 'Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More',

In [24]:
# total books - 20 books on each of the 50 pages
print(len(title_list))

1000


In [25]:
print(rating_list)

['Three', 'One', 'One', 'Four', 'Five', 'One', 'Four', 'Three', 'Four', 'One', 'Two', 'Four', 'Five', 'Five', 'Five', 'Three', 'One', 'One', 'Two', 'Two', 'One', 'Two', 'Three', 'Five', 'Five', 'Three', 'Three', 'Three', 'Five', 'Four', 'Five', 'Three', 'Five', 'One', 'Five', 'Three', 'Two', 'One', 'Four', 'Two', 'Three', 'Two', 'Five', 'Five', 'Two', 'One', 'Five', 'Four', 'Four', 'Three', 'One', 'One', 'Three', 'Four', 'Five', 'One', 'One', 'One', 'Four', 'Three', 'Four', 'Three', 'Four', 'Four', 'Three', 'Five', 'One', 'One', 'Four', 'Three', 'Three', 'One', 'Five', 'Four', 'Two', 'Two', 'Three', 'Two', 'Two', 'Three', 'Five', 'Five', 'One', 'Two', 'Three', 'Four', 'One', 'One', 'Three', 'Two', 'Two', 'Two', 'Four', 'Two', 'Three', 'Two', 'One', 'Two', 'Five', 'Four', 'Five', 'Two', 'Three', 'One', 'One', 'Two', 'Three', 'Four', 'One', 'Two', 'Two', 'Four', 'Three', 'Four', 'Four', 'Five', 'Three', 'Two', 'Two', 'Two', 'One', 'One', 'Five', 'One', 'Five', 'Four', 'One', 'Five', 'Fou

In [26]:
print(len(rating_list))

1000


**All other information about each book is available on respective book's URL. Let's check any random book and try to access the information.**

In [27]:
# we already know the url
soup.select('article h3 a')[0]['href']

'in-her-wake_980/index.html'

In [28]:
# new request for each url
new_request = requests.get('https://books.toscrape.com/catalogue/in-her-wake_980/index.html') 

In [29]:
# new soup object for each book url
soup_new = bs4.BeautifulSoup(new_request.text)
soup_new

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    In Her Wake | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:30" name="created"/>
<meta content="
    A perfect life â¦ until she discovered it wasnât her own.A tragic family event reveals devastating news that rips apart Bellaâs comfortable existence. Embarking on a personal journey to uncover the truth, she faces a series of traumatic discoveries that take her to the ruggedly beautiful Cornish coast, where hidden truths, past betrayals and a 25-year-old mystery threaten n A perfect life â¦ until she discovered it wasnât her own.A tragic f

In [30]:
# visit URL and go the Product Information table
soup_new.select('tr')

[<tr>
 <th>UPC</th><td>23356462d1320d61</td>
 </tr>,
 <tr>
 <th>Product Type</th><td>Books</td>
 </tr>,
 <tr>
 <th>Price (excl. tax)</th><td>Â£12.84</td>
 </tr>,
 <tr>
 <th>Price (incl. tax)</th><td>Â£12.84</td>
 </tr>,
 <tr>
 <th>Tax</th><td>Â£0.00</td>
 </tr>,
 <tr>
 <th>Availability</th>
 <td>In stock (19 available)</td>
 </tr>,
 <tr>
 <th>Number of reviews</th>
 <td>0</td>
 </tr>]

In [31]:
# feature of book
soup_new.select('tr th')

[<th>UPC</th>,
 <th>Product Type</th>,
 <th>Price (excl. tax)</th>,
 <th>Price (incl. tax)</th>,
 <th>Tax</th>,
 <th>Availability</th>,
 <th>Number of reviews</th>]

In [32]:
# accessing random feature
soup_new.select('tr th')[2].text

'Price (excl. tax)'

In [33]:
# value of that feature
soup_new.select('tr td')

[<td>23356462d1320d61</td>,
 <td>Books</td>,
 <td>Â£12.84</td>,
 <td>Â£12.84</td>,
 <td>Â£0.00</td>,
 <td>In stock (19 available)</td>,
 <td>0</td>]

In [34]:
# accessing value of random feature
soup_new.select('tr td')[2].text

'Â£12.84'

**Using `for` loop to get all the information about each book in the form of dataframe**

In [35]:
%%time

new_df = pd.DataFrame()  # empty dataframe
index_num= 0             # index of dataframe

for page_num in range(1,51):    # total pages
    request = requests.get(f'http://books.toscrape.com/catalogue/page-{page_num}.html')  # url of each page
    soup_ = bs4.BeautifulSoup(request.text)
    book_info_list = soup_.select('article h3 a')   # all books on that page

    for n in range(0,20):
        res = requests.get('http://books.toscrape.com/catalogue/'+ book_info_list[n]['href']) # url of each book in above page
        soup = bs4.BeautifulSoup(res.text, 'lxml')
        
        book_features_column_list = soup.select('tr th')   # book features like price, rating of each book
        book_column_values = soup.select('tr td')          # values of that features
        
        column_names = [ item.text for item in book_features_column_list ]
        
        column_values = [item.text for item in book_column_values ]
        
        d = dict(zip(column_names, column_values))  # making dictionary with features as columns and values as rows
    
        df = pd.DataFrame(d, index = [index_num + n]) 
        new_df = new_df.append(df)   # append df of every page
        
    index_num += 20  # index for each page to be incremented by 20 as zero_base index since there are 20 books on each page 
    

Wall time: 11min 27s


In [36]:
new_df

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews
0,a897fe39b1053632,Books,Â£51.77,Â£51.77,Â£0.00,In stock (22 available),0
1,90fa61229261140a,Books,Â£53.74,Â£53.74,Â£0.00,In stock (20 available),0
2,6957f44c3847a760,Books,Â£50.10,Â£50.10,Â£0.00,In stock (20 available),0
3,e00eb4fd7b871a48,Books,Â£47.82,Â£47.82,Â£0.00,In stock (20 available),0
4,4165285e1663650f,Books,Â£54.23,Â£54.23,Â£0.00,In stock (20 available),0
...,...,...,...,...,...,...,...
995,cd2a2a70dd5d176d,Books,Â£55.53,Â£55.53,Â£0.00,In stock (1 available),0
996,bfd5e1701c862ac3,Books,Â£57.06,Â£57.06,Â£0.00,In stock (1 available),0
997,19fec36a1dfb4c16,Books,Â£16.97,Â£16.97,Â£0.00,In stock (1 available),0
998,f684a82adc49f011,Books,Â£53.98,Â£53.98,Â£0.00,In stock (1 available),0


In [37]:
print(new_df.shape)

(1000, 7)

In [38]:
# adding new columns in a dataframe 
new_df["Rating"] = rating_list
new_df["Title"] = title_list
new_df

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,Rating,Title
0,a897fe39b1053632,Books,Â£51.77,Â£51.77,Â£0.00,In stock (22 available),0,Three,A Light in the Attic
1,90fa61229261140a,Books,Â£53.74,Â£53.74,Â£0.00,In stock (20 available),0,One,Tipping the Velvet
2,6957f44c3847a760,Books,Â£50.10,Â£50.10,Â£0.00,In stock (20 available),0,One,Soumission
3,e00eb4fd7b871a48,Books,Â£47.82,Â£47.82,Â£0.00,In stock (20 available),0,Four,Sharp Objects
4,4165285e1663650f,Books,Â£54.23,Â£54.23,Â£0.00,In stock (20 available),0,Five,Sapiens: A Brief History of Humankind
...,...,...,...,...,...,...,...,...,...
995,cd2a2a70dd5d176d,Books,Â£55.53,Â£55.53,Â£0.00,In stock (1 available),0,One,Alice in Wonderland (Alice's Adventures in Won...
996,bfd5e1701c862ac3,Books,Â£57.06,Â£57.06,Â£0.00,In stock (1 available),0,Four,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)"
997,19fec36a1dfb4c16,Books,Â£16.97,Â£16.97,Â£0.00,In stock (1 available),0,Five,A Spy's Devotion (The Regency Spies of London #1)
998,f684a82adc49f011,Books,Â£53.98,Â£53.98,Â£0.00,In stock (1 available),0,One,1st to Die (Women's Murder Club #1)


In [37]:
# to select the category of the book 
soup_new.select('ul')

[<ul class="breadcrumb">
 <li>
 <a href="../../index.html">Home</a>
 </li>
 <li>
 <a href="../category/books_1/index.html">Books</a>
 </li>
 <li>
 <a href="../category/books/thriller_37/index.html">Thriller</a>
 </li>
 <li class="active">In Her Wake</li>
 </ul>,
 <ul class="row">
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="../the-rise-of-theodore-roosevelt-theodore-roosevelt-1_276/index.html"><img alt="The Rise of Theodore Roosevelt (Theodore Roosevelt #1)" class="thumbnail" src="../../media/cache/ff/d4/ffd45d95f314555e20c923d3522adea7.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="../the-rise-of-theodore-roosevelt-theodore-roosevelt-1_276/index.html" title="The Rise of Theodore Roosevelt (Theodore Roosevelt #1)">The Rise of Theodore ...</a></h3>
 <div c

In [41]:
# we use index since it is a list  
(soup_new.select('ul'))[0]

<ul class="breadcrumb">
<li>
<a href="../../index.html">Home</a>
</li>
<li>
<a href="../category/books_1/index.html">Books</a>
</li>
<li>
<a href="../category/books/thriller_37/index.html">Thriller</a>
</li>
<li class="active">In Her Wake</li>
</ul>

In [42]:
# from the HTML code above, as we can see 'a' tag has the genre of the book, so we select that tag
(soup_new.select('ul'))[0].select('a')

[<a href="../../index.html">Home</a>,
 <a href="../category/books_1/index.html">Books</a>,
 <a href="../category/books/thriller_37/index.html">Thriller</a>]

In [43]:
# genre is present at the second index position of above list  
(soup_new.select('ul'))[0].select('a')[2]

<a href="../category/books/thriller_37/index.html">Thriller</a>

In [44]:
(soup_new.select('ul'))[0].select('a')[2].text

'Thriller'

**Using `for` loop to get the categories of each book in the form of a list**

In [3]:
%%time

category_list = []

for page_num in range(1,51):
    request = requests.get(f'http://books.toscrape.com/catalogue/page-{page_num}.html')  #url of each page
    soup_ = bs4.BeautifulSoup(request.text)
    book_info_list = soup_.select('article h3 a')
    
    for n in range(0,20):
        res = requests.get('http://books.toscrape.com/catalogue/'+ book_info_list[n]['href']) #url of each book in above page
        soup = bs4.BeautifulSoup(res.text)

        category = (soup.select('ul'))[0].select('a')[2].text
                                 
        category_list.append(category)
        
len(category_list)

Wall time: 9min 53s


1000

In [4]:
print(category_list[0:10])

['Poetry', 'Historical Fiction', 'Fiction', 'Mystery', 'History', 'Young Adult', 'Business', 'Default', 'Default', 'Poetry']


In [39]:
print(len(category_list))

1000


In [40]:
# adding category as a column in the above dataframe
final_dataset = new_df.copy()

final_dataset['Category'] = category_list

final_dataset.head()

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews,Rating,Title,Category
0,a897fe39b1053632,Books,Â£51.77,Â£51.77,Â£0.00,In stock (22 available),0,Three,A Light in the Attic,Poetry
1,90fa61229261140a,Books,Â£53.74,Â£53.74,Â£0.00,In stock (20 available),0,One,Tipping the Velvet,Historical Fiction
2,6957f44c3847a760,Books,Â£50.10,Â£50.10,Â£0.00,In stock (20 available),0,One,Soumission,Fiction
3,e00eb4fd7b871a48,Books,Â£47.82,Â£47.82,Â£0.00,In stock (20 available),0,Four,Sharp Objects,Mystery
4,4165285e1663650f,Books,Â£54.23,Â£54.23,Â£0.00,In stock (20 available),0,Five,Sapiens: A Brief History of Humankind,History


In [43]:
print(final_dataset.shape)

(1000, 10)


**To store this data in the form of `csv` file**

In [44]:
final_dataset.to_csv('scraped_dataset.csv')