# Project 8 - Web scraping

This file contains a set of tasks to learn about web scraping. The packages used for web scraping include [beautifulsoup](https://beautiful-soup-4.readthedocs.io/en/latest/) and [Selenium](https://selenium-python.readthedocs.io/). 

In [2]:
# relevant modules
import pandas as pd
import requests
import bs4
from bs4 import *
import os
import lxml

## Task 1 

- Use the requests library and BeautifulSoup to connect to Quotes to Scrape and get the HTML text from the homepage.
- Get the names of all the authors on the first page.
- Create a list of all the quotes on the first page.

**Website Text**

In [3]:
# obtaining raw text from website

result = requests.get('http://quotes.toscrape.com/')

type(result)

result.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <sp

In [5]:
# using beautifulsoup to have reader friendly of the scraped content

soup = bs4.BeautifulSoup(result.text, 'html')

text = soup.get_text(separator='\n', strip=True) # using the separator ensures each block of text is separated in a new line while strip removes leading and trailing whitespaces in each line

print(text)

Quotes to Scrape
Quotes to Scrape
Login
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by
Albert Einstein
(about)
Tags:
change
deep-thoughts
thinking
world
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by
J.K. Rowling
(about)
Tags:
abilities
choices
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by
Albert Einstein
(about)
Tags:
inspirational
life
live
miracle
miracles
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by
Jane Austen
(about)
Tags:
aliteracy
books
classic
humor
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by
Marilyn Monroe
(about)
Tags:
be-yourself
inspirational
“Try not to become a man of success. Rather become a man of value.”
by
Albert Einstein
(about)
Tags:


**Authors' names**

In [6]:
# obtaining authors of in the first page

base_url = 'https://quotes.toscrape.com/page/1/'

result_a = requests.get(base_url)

soup = bs4.BeautifulSoup(result_a.text, 'html.parser')

In [9]:
# using inspect I can see the element which I need to reference using the select() function. Element is: <small class="author" itemprop="author">J.K. Rowling</small> 

authors = soup.select('small.author')

In [10]:
# raw result from the soup-processed request 
authors

[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

In [11]:
    # extracting the text content from each author element and storing it inside a list
author_names = [author.get_text() for author in authors]

# print the list of author names
for name in author_names:
    print(name)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


**Quotes**

In [21]:
# getting quotes on the 1st page of website 

# this code is the same as above given no change is needed at this point

base_url = 'https://quotes.toscrape.com/page/1/'

result_a = requests.get(base_url)

soup = bs4.BeautifulSoup(result_a.text, 'html.parser')

quotes = soup.select('span.text')


In [23]:
quotes

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

In [22]:
quote_texts = [quote.get_text() for quote in quotes]

# Print the list of quotes
for text in quote_texts:
    print(text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


## Task 2

-  Extract the top ten tags from BeautifulSoup the requests text shown on the top right of the homepage (e.g., Love, Inspirational, Life, etc.).

- Using loops, get all the unique authors on the website (across all pages).


**Tags**

In [50]:
# the code for this tasks is similar as above but the CSS selector changes from "span.text" to the one specific for the top tags  

base_url = 'https://quotes.toscrape.com/'

result_a = requests.get(base_url)

soup = bs4.BeautifulSoup(result_a.text, 'html.parser')

tags = soup.select('span.tag-item a.tag')


In [51]:
tags

[<a class="tag" href="/tag/love/" style="font-size: 28px">love</a>,
 <a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>,
 <a class="tag" href="/tag/life/" style="font-size: 26px">life</a>,
 <a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>,
 <a class="tag" href="/tag/books/" style="font-size: 22px">books</a>,
 <a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>,
 <a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>,
 <a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>,
 <a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>,
 <a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>]

In [52]:
tag_text = [tag.get_text() for tag in tags]

# Print the list of quotes
for tag in tag_text:
    print(tag)

love
inspirational
life
humor
books
reading
friendship
friends
truth
simile


**Unique authors - loop**

In [53]:
# defining an empty set to store unique authors
unique_authors = set()

# defining the page number
page_number = 1

# start of the while loop with similar structure as above
while True:
    url = f'http://quotes.toscrape.com/page/{page_number}/'
    result = requests.get(url)

    soup = bs4.BeautifulSoup(result.text, 'html.parser')
    
    # using the small.author as the CSS element identified using inspect on the website 
    authors = soup.select('small.author')
    
    # loop to get the text = name of each author
    for author in authors:
        unique_authors.add(author.get_text())
    
    # the link is used as the break or continue clause so that the loop either continues or breaks 
    next_page_link = soup.find('li', class_='next')
    
    # an if condition where if there is no "next page" link the loop breaks
    if not next_page_link:
        break
    
    # page number increment for next iteration
    page_number += 1

# results are printed out from the now filled set
for author in unique_authors:
    print(author)

Helen Keller
Eleanor Roosevelt
George R.R. Martin
W.C. Fields
Jorge Luis Borges
Thomas A. Edison
Ernest Hemingway
Pablo Neruda
Ralph Waldo Emerson
J.D. Salinger
Elie Wiesel
Mark Twain
E.E. Cummings
John Lennon
George Carlin
Terry Pratchett
Haruki Murakami
Charles Bukowski
Mother Teresa
George Eliot
James Baldwin
Friedrich Nietzsche
Jimi Hendrix
Suzanne Collins
Stephenie Meyer
Albert Einstein
J.R.R. Tolkien
André Gide
Garrison Keillor
C.S. Lewis
Alfred Tennyson
Allen Saunders
William Nicholson
Dr. Seuss
Charles M. Schulz
George Bernard Shaw
J.M. Barrie
Bob Marley
Alexandre Dumas fils
Madeleine L'Engle
Khaled Hosseini
Ayn Rand
Marilyn Monroe
Martin Luther King Jr.
J.K. Rowling
Jim Henson
Douglas Adams
Harper Lee
Steve Martin
Jane Austen
