# Working with Multiple Pages and Items

Realistically, when web scraping, you will have ti grab multiple elemtents accross multiple pages. Thus, you have to employ loops, functions and implement classes.

For this notebook, we will use a website designed for webscraping: [toscrape.com/](https://toscrape.com/). Our goal is to grab every book with a 2 star rating.

We will do the following:

1. Figure out the URL structure to go through every page
2. Figure out what tag/class represents the Star rating
3. Scrap the first page
4. Create a structure to scrape
6. Scrap every page in the catalogue and display final results

In [1]:
import requests
import bs4
import lxml

# STEP 1: Figure out the URL structure to go through every page

We will need to understand how the catalogue jump from a page to another.

* The first page: https://books.toscrape.com/catalogue/page-1.html
* The second page: https://books.toscrape.com/catalogue/page-2.html

That means that the the catalogue is constructed add a +1 after *page*. In this case, we should loop and insert a string version of a number in order to search through the catalogue. 

We will use the foundation to assign it to a variable, `base_url`.

In [2]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

We can then fill in the page number with `.format()` to go to a specific page.

In [3]:
base_url.format("20")

'http://books.toscrape.com/catalogue/page-20.html'

# STEP 2: Figure out what tag/class represents the Star rating

Inspecting the web page, we can see that each rating has a specific class. We will need to use classes `star-rating One` and `star-rating Two`. 

# STEP 3: Scrap the first page

In [4]:
# Lets extract the first page analize it

res = requests.get(base_url.format('1'))
soup = bs4.BeautifulSoup(res.text,"lxml")

soup.select(".product_pod") #product_pod is designated for items in a catalogue

[<article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="../media/cach

In [5]:
len(soup.select(".product_pod")) #20 items as there are 20 products/page

20

In [6]:
# Assign product_pod to products

products = soup.select(".product_pod")

# STEP 4: Create a structure to scrape

We are going to first test out with the first item in the catalogue, which is a three stars book called `A Light in the Attic`.

In [7]:
# Source on the first item:

products[0]

<article class="product_pod">
<div class="image_container">
<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

Lets assign the first book if it's a three stars rated book. There are two ways:

In [8]:
# First method

'star-rating Three' in str(products[0])

True

In [9]:
# Second and more efficient method

products[0].select(".star-rating.Three") #make sure to fill out spaces with dots

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

Next, we need to select the title of the book. 

In [10]:
products[0].select("a") #select "a" since the tag where the title is captured starts with a

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [11]:
# Selecting the 2nd item since the first tag is with the image, while the second has the title

products[0].select("a")[1] 

<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

In [12]:
# Getting the title

products[0].select("a")[1]["title"]

'A Light in the Attic'

We have thus created our structure for capturing titles with 2 stars ratings:

        string call in, products[n].select(rating)
        products[n].select("a")[1]["title"]

# STEP 5: Scrap every page in the catalogue and display final results

Let's give it a shot by combining all the ideas we've talked about! (this should take up to a minute to complete running. Be aware a firwall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with time.sleep(1).

In [13]:
#Create a list to hold the results
two_star_titles = []

In [14]:
#Loop to go through the pages
for n in range(1,51):

    #using the base_url.format
    scrape_url = base_url.format(n)
    
    #scrape the requested page
    res = requests.get(scrape_url)
    soup = bs4.BeautifulSoup(res.text,"lxml")
    books = soup.select(".product_pod")
    
    #loop to go through each item 
    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_titles.append(book.select('a')[1]['title'])

In [15]:
two_star_titles

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century',
 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes and Stories',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)',
 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)',
 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)',
 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)',
 'Giant Days, Vol. 2 (Giant Day