## Beautiful Soup Tutorial

#### Corey Schafer 
#### https://www.youtube.com/watch?v=ng2o98k983k

#### libraries to install:
-  beautifulsoup4
-  lxml
-  requests


#### This tutorial uses http://books.toscrape.com/  and http://quotes.toscrape.com/ as the sample website to scrape


- There are two sections to this notebook:
    1. using a local html file
    2. using requests package to directly access a webpage 
        - includes error handling
        - write results to file
    3. using request package to directly access multiple pages manually specified
        - includes error handling
        - access multiple pages
        - write results to file

In [182]:
from bs4 import BeautifulSoup
import requests
import csv

### Section 1: Use Local HTML File

- uses http://books.toscrape.com/ 

In [6]:
## first, the bookstoscrape website was saved as an HTML
##   locally.  This part of the tutorial will open it from
##   the local dir

filename = 'Books_to_Scrape.html'
path = 'D:\OneDrive - QJA\My Files\DataScience\DataSets'

with open(path + '\\' + filename) as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    
#print(soup.prettify())

match = soup.title  # gives title attributes too
match = soup.title.text  # just title text
print(match)


    All products | Books to Scrape - Sandbox



In [7]:
## produce div tags

## this will produce first tag on page
match = soup.div
print(match)

<div class="page_inner">
<div class="row">
<div class="col-sm-8 h1"><a href="index.html">Books to Scrape</a><small> We love being scraped!</small>
</div>
</div>
</div>


In [13]:
## Use the find() method:

## since we don't always want first tag on page,
##   need to be able to specify diff tag locations

## with find() method, we can pass in arguments to narrow
##   down which tags we want


match = soup.find('div')  # this will produce same as soup.div

## however, we can specify div class to get a specific div
##   note the underscore (since class is special keyword in python)
match = soup.find('div', class_ = 'page_inner')

## another example
match = soup.find('div', class_ = 'container-fluid page')
print(match)


<div class="container-fluid page">
<div class="page_inner">
<ul class="breadcrumb">
<li>
<a href="index.html">Home</a>
</li>
<li class="active">All products</li>
</ul>
<div class="row">
<aside class="sidebar col-sm-4 col-md-3">
<div id="promotions_left">
</div>
<div class="side_categories">
<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">
                            
                                Books
                            
                        </a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">
                            
                                Travel
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/mystery_3/index.html">
                            
                                Mystery
                            
                        </a>
</li>
<li>
<a href="catalogue/category/books/historical-fiction_4/index.html">
                            
        

In [92]:
## Get multiple things from a page
## In this case, we will get the html address and the book title
##    from a book image on the page
##
## Then we can later create a loop to do for each one

## use var because classtype is long
classtype = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'

## instead of div, 'li' is the tag with class specified above
im_cont = soup.find('li', class_ = classtype)
#print(im_cont)


## using im_cont we get the anchor tag (a) information
## alternatively you could thread this to above: soup.find('li', class_ = classtype).h3.a
## h3 is header 3 and a is the anchor
im_cont_info = im_cont.h3.a
print(im_cont_info)


## within the anchor, there are multiple sections.  We want 
##   the href since it contains the html path for this example
url = im_cont_info.get('href')
print(url)

## similarly, we get the title within the a anchor
title = im_cont_info.get('title')
print(title)



<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
catalogue/a-light-in-the-attic_1000/index.html
A Light in the Attic


In [90]:
## now use findall to get all the tags that match 
##   the arguments you specify, using a loop

## using same code from above

classtype = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'

for imcont in soup.find_all('li', class_ = classtype):


    htmlName = imcont.h3.a.get('href')
    print(htmlName)
    titleText = imcont.h3.a.get('title')
    print(titleText)
    
    print() # puts blank line between each one

catalogue/a-light-in-the-attic_1000/index.html
A Light in the Attic

catalogue/tipping-the-velvet_999/index.html
Tipping the Velvet

catalogue/soumission_998/index.html
Soumission

catalogue/sharp-objects_997/index.html
Sharp Objects

catalogue/sapiens-a-brief-history-of-humankind_996/index.html
Sapiens: A Brief History of Humankind

catalogue/the-requiem-red_995/index.html
The Requiem Red

catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
The Dirty Little Secrets of Getting Your Dream Job

catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull

catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics

catalogue/the-black-maria_991/index.html
The Black Maria

catal

In [114]:
## going to do the same thing but this time,
##   get the name of the book, rating and price
##   which can be compiled to get a full 
##   listing 

## since name and price are under multiple tags,
##   the code is slightly more complicated where
##   a second '.find' is needed


## var for class type
classtype = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'

## remember 'soup' is the name of the file
## info in article tag under li class = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'
for article in soup.find_all('li', class_ = classtype):
    #print(im_cont)

    bkTitle = article.h3.a.get('title')
    print(bkTitle)
    bkRate = article.p.get('class')[1]  # there are 2 elements in the class
    print (bkRate)
    bkPrice = article.find('div', class_ = 'product_price').p.text
    print(bkPrice)

    print()  # print space between each


A Light in the Attic
Three
Â£51.77

Tipping the Velvet
One
Â£53.74

Soumission
One
Â£50.10

Sharp Objects
Four
Â£47.82

Sapiens: A Brief History of Humankind
Five
Â£54.23

The Requiem Red
One
Â£22.65

The Dirty Little Secrets of Getting Your Dream Job
Four
Â£33.34

The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Three
Â£17.93

The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
Four
Â£22.60

The Black Maria
One
Â£52.15

Starving Hearts (Triangular Trade Trilogy, #1)
Two
Â£13.99

Shakespeare's Sonnets
Four
Â£20.66

Set Me Free
Five
Â£17.46

Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Five
Â£52.29

Rip it Up and Start Again
Five
Â£35.02

Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Three
Â£57.25

Olio
One
Â£23.88

Mesaerion: The Best Science Fiction Stories 1800-1849
One
Â£37.59

Libertarianism for Beginners
Two
Â£51.33

It's Only the Himalayas
Two
Â£45.17


In [None]:
## Future work on this: need to learn how to go to new pages, ie: 2-50
##    and get all info from each page

### Section 2: Access Website Directly

- uses http://quotes.toscrape.com/

In [123]:
## request.get sends request to website to access it
##   add .text to store retrieve/store the response

source = requests.get('http://quotes.toscrape.com').text
#print(source)

In [124]:
## Read the get request and use prettify to see the html source
booksoup = BeautifulSoup(source, 'lxml')
print(booksoup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

#### Before we get started, review this example of finding the number of itesm of interst on a page

- NOTE: this isn't incorporated into the code below but is noted here as fyi

- You need to identify the class that is associated with your object of interest, use find_all, and then you can count them

```
bookpages = booksoup.find_all('div', class_ = 'quote')
print(type(bookpages))
print(len(bookpages))
```

In [194]:
## set base URL to be added to partial address 
parentURL = 'http://quotes_toscrape.com'

## set path for saving csv output file
path = 'D:\OneDrive - QJA\My Files\DataScience\DataSets'

## set csv parameters to write scraped data to csv
csv_file = open(path + '\\' + 'quotetoscrape_tutorial.csv', 'w')
csv_writer = csv.writer(csv_file, lineterminator = '\n')
csv_writer.writerow(['Quote', 'Author', 'Tags', 'About URL'])


## SCRAPE CODE
for box in booksoup.find_all('div', class_ = 'quote'): 
#print(box)

    quote = box.find('span', class_ = 'text').text
    print(quote)
    
    # for demo purposes, use split to extract part of a tag
    authraw = box.a.get('href').split('/')[1] + ": "
    
    auth = authraw + box.find('small', class_ = 'author').text
    print(auth)
    
    tags = box.find('meta', class_ = 'keywords').get('content')
    print(tags)

    ## get about link:
    ## exception handling if about link not present
    try:
        about = parentURL + box.a.get('href') #about only set if successful
        
    except Exception as e:
            about = None
            
    print(about)  #add parent URL to get full link
    
    ## write each output to a new row in csv file
    csv_writer.writerow([quote, auth, tags, about])
    
    print() # print blank line after each iteration for ease of reading
    
csv_file.close()

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
author: Albert Einstein
change,deep-thoughts,thinking,world
http://quotes_toscrape.com/author/Albert-Einstein

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
author: J.K. Rowling
abilities,choices
http://quotes_toscrape.com/author/J-K-Rowling

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
author: Albert Einstein
inspirational,life,live,miracle,miracles
http://quotes_toscrape.com/author/Albert-Einstein

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
author: Jane Austen
aliteracy,books,classic,humor
http://quotes_toscrape.com/author/Jane-Austen

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
author: Marilyn Monroe
be-yourself,insp

### Section 3: Access Website Directly and Scrape Multiple Pages

- uses http://quotes.toscrape.com/