## Basic Scrape Starting Point


#### Citations: 
- template comes from: 

    - Corey Schafer:https://www.youtube.com/watch?v=ng2o98k983k
    - https://www.dataquest.io/blog/web-scraping-beautifulsoup/


- look in recipes/webscraping for original tutorial

In [5]:
from bs4 import BeautifulSoup
import requests
import csv

import pandas as pd
import numpy as np
import re

In [7]:
from time import sleep
from random import randint
from time import time

from IPython.core.display import clear_output
from warnings import warn

#### Controlling the crawl rate 
- is important to not overwhelm the site with too many requests per second:
    - could prevent you from getting IP address banned
    - allows site to respond to other users too during your scrape

- Crawl rate is controlled using sleep() function from Python's time module, where:
    - it will pause the execution for specified amount of seconds

- Will use randint() function from Python's random module to randomly generate integers within a specified interval

- Example code:
```
for _ in range(0,5):
   print('Blah')
   sleep(randint(1,4))
```

#### Monitoring Scraping Projects

- With large scraping projects, monitoring the progress can be important
    - we will monitor 
        - frequency and number of requests 
        - status of code progress to ensure server is sending back responses

- Calculate Frequency
    - numb of requests / time elapsed since first request

```
start_time = time() # set starting time and assign as variable
requestcnt = 0  # requests used to count numb of requests, starting from 0


#start a loop
for _ in range(5):
    # A request would go here  #simulate a request
    requestcnt += 1   # increase number of requests by 1
    
    sleep(randint(1, 3))  # pause loop for randomly selected time period (1-3 secs)
    elapsed_time = time() - start_time  # calculate elapsed time since 1st request
    
    # print each request and the frequency
    #print('Request: {}; Frequency: {} requests/sec' .format(requests, requests/elapsed_time))
    print('Request %.0f: Frequency = %.3f requests/sec' % (requestcnt, requests/elapsed_time))
    clear_output(wait = True) # prevent long list of print outputs
```    
  
    
- Monitor status
    - a successful request is indicated by status code of 200
    - use warn() function to throw warning if status code is not 200

    - for example, the following code:
        ```
        warn('Warning Simulation')
        ```
        
        would produce:
        ```
        'C:\Users\delos001\Anaconda3\lib\site-packages\ipykernel_launcher.py:31: UserWarning: Warning Simulation'    
        ```




### Set Primary Variables

In [20]:
# parent URL
src = 'http://quotes.toscrape.com'

## set path for saving csv output file
savePath = 'D:\OneDrive - QJA\My Files\DataScience\DataSets'

## set csv parameters to write scraped data to csv
# csv_file = open(savePath + '\\' + 'quotetoscrape_tutorial.csv', 
#                 'w', 
#                 encoding = 'utf-8')
# csv_writer = csv.writer(csv_file, lineterminator = '\n')
# csv_writer.writerow(['Quote', 'Author', 'Tags', 'About URL'])

### Primary Scrape Script

In [11]:
## Create blank lists to store fields of interest
##   these will have info appended at each iteration
quotes = []
authors = []
tags = []
abouts = []



## Review URL for webpage of interest to determine which 
##   parameters in the URL will need to be modified for 
##   each page
## the quotes.toscrape has 10 pages
pages = 10
pagesrng = [str(i) for i in range(1, pages + 1)]


## variables to monitor scrape rate and progress
start_time = time() # set starting time and assign as variable
requestcnt = 0  # requests used to count numb of requests, starting from 0


for page in pagesrng:
    
    ## specify url and use pagesrng to allow looping by page number
    ## ex url http://quotes.toscrape.com/page/9/
    source = requests.get(src + '/' + 'page/' + page)  # removed .text
    
    ## DEFINE PAUSE RATE----------------------------------------------------------
    ## add pauses in loop
    sleep(randint(8, 12))
    
    ## MONITOR REQUEST COUNT------------------------------------------------------
    requestcnt +=1  # increase number of requests by 1
    elapsed_time = time() - start_time
    print('Request %.0f: Frequency = %.3f requests/sec' % (requestcnt, 
                                                           requestcnt/elapsed_time))
    clear_output(wait = True)
    
    ## SET WARNING: warning code for non-200 status codes-------------------------
    if source.status_code != 200:
        warn('Request %.0f: Status Code = %.s' % (requestcnt, 
                                                  source.status_code))
        
    ## SET COUNTER: requests count stopper (if cnt exceeds x)---------------------
    if requestcnt > pages:
        warn('Number of requests has exceeded page number')
        break
    
    
    
    ## SCRAPE CODE----------------------------------------------------------------
    
    ## in prev tutorial, .text is added to the requests.get, but since it is 
    ##   looped above, it was removed and instead added here
    bsPage = BeautifulSoup(source.text, 'lxml')
    
    for target in bsPage.find_all('div', class_ = 'quote'): 
        
        ## Use Try/except in cases target field is missing------------------ 
        
        ## retrieve quotes--------------------------------------------------     
        try:
            quote = target.find('span', class_ = 'text').text
        except Exception as e:
            quote = None
        quotes.append(quote)

        ## retrieve author name---------------------------------------------
        ## for demo purposes, use split to extract part of a tag
        try:
            authraw = target.a.get('href').split('/')[1] + ": "
            author = authraw + target.find('small', class_ = 'author').text
        except Exception as e:
            auth = None
        authors.append(author)

        ## retrieve quote tags----------------------------------------------        
        try:
            tag = target.find('meta', class_ = 'keywords').get('content')
        except Exception as e:
                tag = None        
        tags.append(tag)

        ## retrieve about link----------------------------------------------
        try:
            about = src + target.a.get('href') #about only set if successful
        except Exception as e:
            about = None
        abouts.append(about)  

## Additional error handling option
# if target.find('meta', class_ = 'keywords').get('content') is not None:
#     try:
#         tags = target.find('meta', class_ = 'keywords').get('content')
#     except Exception as e:
#         tags = None        
#     print(tags)
# else:
#     tags = None





Request 10: Frequency = 0.090 requests/sec


#### future work at this point:
- using the 'about' links, we could access each link and pull the age of each author and add to the df

### Move Data to Data frame

#### possible next steps
- export data to csv
- analyze/plot data

In [14]:
## Create pandas df with all stored data

bsPagedf = pd.DataFrame({'Quote': quotes, 
                         'Author': authors, 
                         'Tags': tags, 
                         'About URL': abouts})

print(bsPagedf.info())
bsPagedf.head(10)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
Quote        100 non-null object
Author       100 non-null object
Tags         100 non-null object
About URL    100 non-null object
dtypes: object(4)
memory usage: 3.2+ KB
None


Unnamed: 0,Quote,Author,Tags,About URL
0,“The world as we have created it is a process ...,author: Albert Einstein,"change,deep-thoughts,thinking,world",http://quotes.toscrape.com/author/Albert-Einstein
1,"“It is our choices, Harry, that show what we t...",author: J.K. Rowling,"abilities,choices",http://quotes.toscrape.com/author/J-K-Rowling
2,“There are only two ways to live your life. On...,author: Albert Einstein,"inspirational,life,live,miracle,miracles",http://quotes.toscrape.com/author/Albert-Einstein
3,"“The person, be it gentleman or lady, who has ...",author: Jane Austen,"aliteracy,books,classic,humor",http://quotes.toscrape.com/author/Jane-Austen
4,"“Imperfection is beauty, madness is genius and...",author: Marilyn Monroe,"be-yourself,inspirational",http://quotes.toscrape.com/author/Marilyn-Monroe
5,“Try not to become a man of success. Rather be...,author: Albert Einstein,"adulthood,success,value",http://quotes.toscrape.com/author/Albert-Einstein
6,“It is better to be hated for what you are tha...,author: André Gide,"life,love",http://quotes.toscrape.com/author/Andre-Gide
7,"“I have not failed. I've just found 10,000 way...",author: Thomas A. Edison,"edison,failure,inspirational,paraphrased",http://quotes.toscrape.com/author/Thomas-A-Edison
8,“A woman is like a tea bag; you never know how...,author: Eleanor Roosevelt,misattributed-eleanor-roosevelt,http://quotes.toscrape.com/author/Eleanor-Roos...
9,"“A day without sunshine is like, you know, nig...",author: Steve Martin,"humor,obvious,simile",http://quotes.toscrape.com/author/Steve-Martin


In [18]:
## Re-order columns:

bsPagedf = bsPagedf[['Author', 'Quote', 'About URL', 'Tags']]
bsPagedf.head()

Unnamed: 0,Author,Quote,About URL,Tags
0,author: Albert Einstein,“The world as we have created it is a process ...,http://quotes.toscrape.com/author/Albert-Einstein,"change,deep-thoughts,thinking,world"
1,author: J.K. Rowling,"“It is our choices, Harry, that show what we t...",http://quotes.toscrape.com/author/J-K-Rowling,"abilities,choices"
2,author: Albert Einstein,“There are only two ways to live your life. On...,http://quotes.toscrape.com/author/Albert-Einstein,"inspirational,life,live,miracle,miracles"
3,author: Jane Austen,"“The person, be it gentleman or lady, who has ...",http://quotes.toscrape.com/author/Jane-Austen,"aliteracy,books,classic,humor"
4,author: Marilyn Monroe,"“Imperfection is beauty, madness is genius and...",http://quotes.toscrape.com/author/Marilyn-Monroe,"be-yourself,inspirational"


### Analyze Data

### Wite to External File

In [26]:
bsPagedf.to_csv(savePath + '\\' + 'quotetoscrape_tutorial2.csv', index = False, encoding = 'utf-8')