# Lesson 5.02 - Web Scraping using Beautiful Soup

### Why Web Scraping?
    
- Popular way to gather data online
- With massive amounts of data online, it is crucial to learn how to extract data efficiently and effectively

## Install lxml
#### What is lxml?
1. lxml is a feature-rich and easy-to-use library for processing XML and HTML in the Python language.
2. It's also very fast and memory friendly

More info can be found at this [link](https://github.com/lxml/lxml)

#### Installing lxml
1. Launch Anaconda Command Prompt
2. Execute conda install `lxml`

## Scrape a single page

### Import Packages

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Set up requests library to retrieve data from target website

In [2]:
# Indicate base url
url = 'http://quotes.toscrape.com/'
response = requests.get(url)

### Create a BeautifulSoup object

In [3]:
soup = BeautifulSoup(response.text,'lxml')

### `soup.find()`

Returns either:

1. A soup object of the first match
2. `None`

In [4]:
quote = soup.find("span", class_="text")

In [5]:
quote.text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

### `soup.find_all()`

Returns a **_LIST_** of soup objects that match your query

In [6]:
# store the actual quotes and their corresponding authors and tags in separate variables 
quotes = soup.find_all("span", class_="text")
authors = soup.find_all("small", class_="author")
tags = soup.find_all("div", class_="tags")
for i in range(0,len(quotes)):
    print(quotes[i].text)
    print(authors[i].text)
    quoteTags = tags[i].find_all('a',class_='tag')
    for quoteTag in quoteTags:
        print(quoteTag.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Albert Einstein
change
deep-thoughts
thinking
world
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
J.K. Rowling
abilities
choices
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Albert Einstein
inspirational
life
live
miracle
miracles
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Jane Austen
aliteracy
books
classic
humor
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Marilyn Monroe
be-yourself
inspirational
“Try not to become a man of success. Rather become a man of value.”
Albert Einstein
adulthood
success
value
“It is better to be hated for what you are than to be loved for what you are not.”
André Gide
life
love
“I have not fa

## Scrape multiple pages (Pagination)

### Print the data from Page 1

In [7]:
# Set the base URL
url = 'https://scrapingclub.com/exercise/list_basic/'

# Set counter for iterating thru multiple pages in the website
count = 1

# Set up requests library to retrieve data from target website
response = requests.get(url)

# Instantiate BeautifulSoup object
soup = BeautifulSoup(response.text, 'lxml')

# Return a list of soup objects that are div tags with class = col-lg-4 col-md-6 mb-4 
items = soup.find_all('div', class_='col-lg-4 col-md-6 mb-4')

# Iterate through each Soup object
for i in items:
    
    # Retrieve the item name which is encompased in h4 tags with card-title class. 
    # Remove all trailing spaces and blank lines if they exist
    itemName = i.find('h4', class_='card-title').text.strip('\n')
        
    # Retrieve the item price which is encompased in h5 tags 
    # Remove all trailing spaces and blank lines if they exist    
    itemPrice = i.find('h5').text.strip('\n')
    
    # Display the Item Number, Price and Name
    print('%s) Price: %s , Item Name: %s' % (count, itemPrice, itemName))
    
    # Increment the count to move to the next item
    count = count + 1

1) Price: $24.99 , Item Name: Short Dress
2) Price: $29.99 , Item Name: Patterned Slacks
3) Price: $49.99 , Item Name: Short Chiffon Dress
4) Price: $59.99 , Item Name: Off-the-shoulder Dress
5) Price: $24.99 , Item Name: V-neck Top
6) Price: $49.99 , Item Name: Short Chiffon Dress
7) Price: $24.99 , Item Name: V-neck Top
8) Price: $24.99 , Item Name: V-neck Top
9) Price: $59.99 , Item Name: Short Lace Dress


### Print the data from remaining pages

In [8]:
# Retrieve the pagination object encompassed within the ul tag
pagination = soup.find('ul', class_='pagination')

# Retrieve the list of pages within tags of page-link class from the pagination ul tag
pages = pagination.find_all('a', class_='page-link')

# Create a list for storing the URL of each page
urls = []

# Iterate through the list of pages
for page in pages:
    
    # Store the page number only if it is a number. 
    # This will help to omit other items found in the pagination object such as Next
    pageNum = int(page.text) if page.text.isdigit() else None
    
    # Check if the page number is not null 
    if pageNum != None:
        
        # Retrieve the URL of each page from the value of its corresponding href element
        link = page.get('href')
        
        # Add each page URL to the urls list
        urls.append(link)
        
# Iterate through the list of urls       
for i in urls:
    # retreive text from each page and store it in the response variable
    # url = base url and i = additional page number attribute e.g. ?page=1
    response = requests.get(url + i)
    
    # create a BeautifulSoup object using the data from each page
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Return a list of soup objects that are div tags with class = col-lg-4 col-md-6 mb-4     
    items = soup.find_all('div', class_='col-lg-4 col-md-6 mb-4')
   
    # Iterate through each Soup object
    for i in items:
        
        # Retrieve the item name which is encompased in h4 tags with card-title class. 
        # Remove all trailing spaces and blank lines if they exist                 
        itemName = i.find('h4', class_='card-title').text.strip('\n')
        
        # Retrieve the item price which is encompased in h5 tags 
        # Remove all trailing spaces and blank lines if they exist          
        itemPrice = i.find('h5').text
        
        # Display the Item Number, Price and Name
        print('%s) Price: %s , Item Name: %s' % (count, itemPrice, itemName))
        
        # Increment the count to move to the next item 
        count = count + 1

10) Price: $34.99 , Item Name: Fitted Dress
11) Price: $69.99 , Item Name: V-neck Jumpsuit
12) Price: $54.99 , Item Name: Chiffon Dress
13) Price: $39.99 , Item Name: Skinny High Waist Jeans
14) Price: $19.99 , Item Name: Super Skinny High Jeans
15) Price: $19.99 , Item Name: Oversized Denim Jacket
16) Price: $24.99 , Item Name: Short Sweatshirt
17) Price: $12.99 , Item Name: Long-sleeved Jersey Top
18) Price: $39.99 , Item Name: Skinny High Waist Jeans
19) Price: $24.99 , Item Name: Short Sweatshirt
20) Price: $12.99 , Item Name: Long-sleeved Jersey Top
21) Price: $12.99 , Item Name: Long-sleeved Jersey Top
22) Price: $19.99 , Item Name: Jersey Dress
23) Price: $24.99 , Item Name: Short Sweatshirt
24) Price: $24.99 , Item Name: Crinkled Flounced Blouse
25) Price: $29.99 , Item Name: Bib Overall Dress
26) Price: $17.99 , Item Name: Loose-knit Sweater
27) Price: $29.99 , Item Name: Skinny Regular Jeans
28) Price: $12.99 , Item Name: Henley-style Top
29) Price: $17.99 , Item Name: Jogger

## Store scraped data in Data Frame

In [9]:
# Instantiate a list to store all the Products
products = []

# Set the base URL
url = 'https://scrapingclub.com/exercise/list_basic/'

# Set counter for iterating thru multiple pages in the website
count = 1

# Set up requests library to retrieve data from target website
response = requests.get(url)

# Instantiate BeautifulSoup object
soup = BeautifulSoup(response.text, 'lxml')

# Return a list of soup objects that are div tags with class = col-lg-4 col-md-6 mb-4 
items = soup.find_all('div', class_='col-lg-4 col-md-6 mb-4')

# Iterate through each Soup object
for i in items:
    
    # Retrieve the item name which is encompased in h4 tags with card-title class. 
    # Remove all trailing spaces and blank lines if they exist
    itemName = i.find('h4', class_='card-title').text.strip('\n')
    
    # Retrieve the item price which is encompased in h5 tags 
    # Remove all trailing spaces and blank lines if they exist    
    itemPrice = i.find('h5').text.strip('\n')
   
    # Instantiate a new product Dictionary
    product = {}
    
    # Store the Item Number, Name and Price
    product['itemNumber'] = count
    product['itemName'] = itemName
    product['itemPrice'] = itemPrice
    
    # Add the product to the Products List
    products.append(product)
  
    # Test if the product has been added successfully
    print(product)
    
    # Increment the count to move to the next item
    count = count + 1

{'itemNumber': 1, 'itemName': 'Short Dress', 'itemPrice': '$24.99'}
{'itemNumber': 2, 'itemName': 'Patterned Slacks', 'itemPrice': '$29.99'}
{'itemNumber': 3, 'itemName': 'Short Chiffon Dress', 'itemPrice': '$49.99'}
{'itemNumber': 4, 'itemName': 'Off-the-shoulder Dress', 'itemPrice': '$59.99'}
{'itemNumber': 5, 'itemName': 'V-neck Top', 'itemPrice': '$24.99'}
{'itemNumber': 6, 'itemName': 'Short Chiffon Dress', 'itemPrice': '$49.99'}
{'itemNumber': 7, 'itemName': 'V-neck Top', 'itemPrice': '$24.99'}
{'itemNumber': 8, 'itemName': 'V-neck Top', 'itemPrice': '$24.99'}
{'itemNumber': 9, 'itemName': 'Short Lace Dress', 'itemPrice': '$59.99'}


In [10]:
# Store the Products Dictionary in a Data Frame
df = pd.DataFrame(products)

# Display all data in the Data Frame
df

Unnamed: 0,itemNumber,itemName,itemPrice
0,1,Short Dress,$24.99
1,2,Patterned Slacks,$29.99
2,3,Short Chiffon Dress,$49.99
3,4,Off-the-shoulder Dress,$59.99
4,5,V-neck Top,$24.99
5,6,Short Chiffon Dress,$49.99
6,7,V-neck Top,$24.99
7,8,V-neck Top,$24.99
8,9,Short Lace Dress,$59.99
