## Multi-Page Beautiful Soup Tutorial

#### Combined tutorial from:

- Corey Schafer:https://www.youtube.com/watch?v=ng2o98k983k
- https://www.dataquest.io/blog/web-scraping-beautifulsoup/


#### Uses test webscrape site: http://quotes.toscrape.com/

#### this tutorial uses builds off BeautifulSoup_SinglePage_Scrape.ipynb
- loops through each page to scrape data and combine in csv

In [155]:
from bs4 import BeautifulSoup
import requests
import csv

import numpy as np
import re

In [156]:
from time import sleep
from random import randint
from time import time

from IPython.core.display import clear_output
from warnings import warn


#### Controlling the crawl rate 
- is important to not overwhelm the site with too many requests per second:
    - could prevent you from getting IP address banned
    - allows site to respond to other users too during your scrape

- Crawl rate is controlled using sleep() function from Python's time module, where:
    - it will pause the execution for specified amount of seconds

- Will use randint() function from Python's random module to randomly generate integers within a specified interval

- Example code:
```
for _ in range(0,5):
   print('Blah')
   sleep(randint(1,4))
```

#### Monitoring Scraping Projects

- With large scraping projects, monitoring the progress can be important
    - we will monitor 
        - frequency and number of requests 
        - status of code progress to ensure server is sending back responses

- Calculate Frequency
    - numb of requests / time elapsed since first request

```
start_time = time() # set starting time and assign as variable
requestcnt = 0  # requests used to count numb of requests, starting from 0


#start a loop
for _ in range(5):
    # A request would go here  #simulate a request
    requestcnt += 1   # increase number of requests by 1
    
    sleep(randint(1, 3))  # pause loop for randomly selected time period (1-3 secs)
    elapsed_time = time() - start_time  # calculate elapsed time since 1st request
    
    # print each request and the frequency
    #print('Request: {}; Frequency: {} requests/sec' .format(requests, requests/elapsed_time))
    print('Request %.0f: Frequency = %.3f requests/sec' % (requestcnt, requests/elapsed_time))
    clear_output(wait = True) # prevent long list of print outputs
```    
  
    
- Monitor status
    - a successful request is indicated by status code of 200
    - use warn() function to throw warning if status code is not 200

    - for example, the following code:
        ```
        warn('Warning Simulation')
        ```
        
        would produce:
        ```
        'C:\Users\delos001\Anaconda3\lib\site-packages\ipykernel_launcher.py:31: UserWarning: Warning Simulation'    
        ```




In [162]:
## Set primary variables

# parent URL
src = 'http://quotes.toscrape.com'

## set path for saving csv output file
savePath = 'D:\OneDrive - QJA\My Files\DataScience\DataSets'

## set csv parameters to write scraped data to csv
csv_file = open(savePath + '\\' + 'quotetoscrape_tutorial.csv', 
                'w', 
                encoding = 'utf-8')
csv_writer = csv.writer(csv_file, lineterminator = '\n')
csv_writer.writerow(['Quote', 'Author', 'Tags', 'About URL'])

28

In [153]:
## Identify website, send request, and review and prettify source

getsrc = requests.get(src).text
peak = BeautifulSoup(getsrc, 'lxml')
print(peak.prettify())


<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

In [163]:
## Review URL for webpage of interest to determine which 
##   parameters in the URL will need to be modified for 
##   each page
## the quotes.toscrape has 10 pages
pages = 10
pagesrng = [str(i) for i in range(1, pages + 1)]


## variables to monitor scrape rate and progress
start_time = time() # set starting time and assign as variable
requestcnt = 0  # requests used to count numb of requests, starting from 0


for page in pagesrng:
    
    ## specify url and use pagesrng to allow looping by page number
    ## ex url http://quotes.toscrape.com/page/9/
    source = requests.get(src + '/' + 'page/' + page)  # removed .text
    
    ## DEFINE PAUSE RATE----------------------------------------------------------
    ## add pauses in loop
    sleep(randint(8, 12))
    
    ## MONITOR REQUEST COUNT------------------------------------------------------
    requestcnt +=1  # increase number of requests by 1
    elapsed_time = time() - start_time
    print('Request %.0f: Frequency = %.3f requests/sec' % (requestcnt, 
                                                           requestcnt/elapsed_time))
    clear_output(wait = True)
    
    ## SET WARNING: warning code for non-200 status codes-------------------------
    if source.status_code != 200:
        warn('Request %.0f: Status Code = %.s' % (requestcnt, 
                                                  source.status_code))
        
    ## SET COUNTER: requests count stopper (if cnt exceeds x)---------------------
    if requestcnt > pages:
        warn('Number of requests has exceeded page number')
        break
    
    
    
    ## SCRAPE CODE----------------------------------------------------------------
    
    ## in prev tutorial, .text is added to the requests.get, but since it is 
    ##   looped above, it was removed and instead added here
    bsPage = BeautifulSoup(source.text, 'lxml')#.encode("utf-8")
    
    for target in bsPage.find_all('div', class_ = 'quote'): 
        
        ## Use Try/except in cases target field is missing------------------ 
        
        ## retrieve quotes--------------------------------------------------     
        try:
            quote = target.find('span', class_ = 'text').text
        except Exception as e:
            quote = None
        print(quote)

        ## retrieve author name---------------------------------------------
        ## for demo purposes, use split to extract part of a tag
        try:
            authraw = target.a.get('href').split('/')[1] + ": "
            auth = authraw + target.find('small', class_ = 'author').text
        except Exception as e:
            auth = None
        print(auth)

        ## retrieve quote tags----------------------------------------------        
        try:
            tags = target.find('meta', class_ = 'keywords').get('content')
        except Exception as e:
                tags = None        
        print(tags)

        ## retrieve about link----------------------------------------------
        try:
            about = src + target.a.get('href') #about only set if successful
        except Exception as e:
            about = None
        print(about)  

        ## write each output to a new row in csv file
        csv_writer.writerow([quote, auth, tags, about])
    

csv_file.close()
        

## Additional error handling option
# if target.find('meta', class_ = 'keywords').get('content') is not None:
#     try:
#         tags = target.find('meta', class_ = 'keywords').get('content')
#     except Exception as e:
#         tags = None        
#     print(tags)
# else:
#     tags = None

“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”
author: J.K. Rowling
truth
http://quotes.toscrape.com/author/J-K-Rowling
“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”
author: Jimi Hendrix
death,life
http://quotes.toscrape.com/author/Jimi-Hendrix
“To die will be an awfully big adventure.”
author: J.M. Barrie
adventure,love
http://quotes.toscrape.com/author/J-M-Barrie
“It takes courage to grow up and become who you really are.”
author: E.E. Cummings
courage
http://quotes.toscrape.com/author/E-E-Cummings
“But better to get hurt by the truth than comforted with a lie.”
author: Khaled Hosseini
life
http://quotes.toscrape.com/author/Khaled-Hosseini
“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”
author: Harper Lee
better-life-empathy
http://quotes.toscrape.com/

'books,classic,reading'