
# Know what interests famous people have by reading their quotes

![](https://i.imgur.com/Zemp7ZZ.png)

Sometimes we all need a little inspiration or advice on how to react to given
life situations, whether on how to be a valuable person, a better friend, or
how to react to something adverse. Various famous and successful people have
said things that we all can find helpful. There is a website, "Quotes to scrape",
that offers dozens, if not hundreds, of such quotes.

Web-scraping is a gathering of useful information from a website of interest and
presenting it in a meaningful way.

In this project read in a list of quotes from famous people using the
"quotes to scrape" website, based on the default top quotes, or quotes filtered
based on various subjects:
 - love
 - inspirational
 - life
 - humor
 - books
 - reading
 - friendship
 - friends
 - truth
 - similes.

Once you pick a subject of interest a request will be made over the web and
an http response document will be returned by the website from where the request
was submitted.

Information will be extracted from the document using the Python library,
BeautifulSoup. Here is some general information from their documentation:

"Beautiful Soup is a Python library for pulling data out of HTML and XML files.
It works with your favorite parser to provide idiomatic ways of navigating, 
searching, and modifying the parse tree. It commonly saves programmers hours 
or days of work."

We will analyze the data and report
 - author's name (the person being quoted)
 - an 'about' link, giving information about the author
 - the text of the quote

## We will then create a dataset, storing the gathered information

Using the authors and corresponding quotes listed, create a list of dictionaries,
each one with an entry containing the author's name, a link to his/her about info,
and the quote itself. This dataset will be stored as a tabular database, in CSV
format and can be downloaded for subsequent data analysis and machine learning
tasks

In [8]:
# The Jovian platform is where this notebook was developed
# and copies of it are maintained 
#
# since this notebook's initial development is complete, we 
# will not use these cells but keep them here for reference
#!pip install jovian --upgrade --quiet

In [5]:
#import jovian

In [9]:
# Execute this to save new versions of the notebook
#jovian.commit(project="dataanalyst-bootcamp-project1-web-scraping")

## Install the libraries
 - requests allows this notebook to interact with websites
 - bs4, or Beautiful Soup allows us to parse information from HTML documents

In [10]:
!pip install install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet # note we're using BeautifulSoup V4

## Import the packages:

In [11]:
import requests
from bs4 import BeautifulSoup

## Read in a web page from the site containing famous quotes (see https://quotes.toscrape.com)

![](https://i.imgur.com/tKxKtG7.png)

In [12]:
'''
function read_page_from_quotes_toscrape
function to get author/quote data from the website
return the response
params:
    url = base url for website we're scraping
    page = page number if we're paginating 
        NOTE: even if we're not paginating we need to supply a page number to get 
        a full amount of quotes in the text, for the first page, if any are in the
        given tag, selected or default (see below)
    (optional) tag = filter for quotes we're requesting
'''
def read_page_from_quotes_toscrape(url, page, tag):
    response = ''
    quotes_url = url
    if tag != '':
        quotes_url += '/tag/' + tag
    if page != '':
        quotes_url += '/page/' + str(page)
    else:
        raise Exception('enter a valid page number or tag to filter')
    quotes_url += '/'
    # for debugging: print(quotes_url)
    response = requests.get(quotes_url)

    # Check for success in reading the page
    if response.status_code != 200:
        raise Exception('Failed to load {}'.format(url))
    return response

## Convert the returned web page we into a Beautiful Soup Document

In [13]:
'''
convert the web page to a BeautifulSoup object
'''
def parse_page_with_bs4(page):
    html_source = BeautifulSoup(page, 'html.parser')
    return html_source

## Create the list of quotes and their authors

In [14]:
'''
 function get_quotes_and_authors 
 function to scrape the data
 params:
  - bs4 document
  - base url for building author links
 returns:
  - tuple of quotes list and corresponding authors list
'''
def get_quotes_and_authors(document, base_url):
    quote_list_from_pages = []
    authors_list_from_pages = []
    # get all authors and their quotes in a list of tags
    tags = document.find_all('div', class_='quote')

    # append the quotes and author links
    for i in tags:
        # quotes
        quote = i.find('span').text
        quote_list_from_pages.append(quote)
        # authors
        author_link = i.find('a')['href']
        authors_list_from_pages.append(base_url+author_link)
        
    return (quote_list_from_pages,authors_list_from_pages)

## Get quote category from the user, request the document from the website, and create Python dictionary structures for subsequent storage into the dataset.

You will be prompted for a quote category out of the categories listed.
Once you type it in, we will
 - get the data from the website
 - create a Beautiful Soup object containing the data
 - build Python dictionaries of the authors and their quotes

This data will be ready for subsequent processing into the desired .csv file, the dataset that is output by this web scraping process.

In [15]:
#
# this is the main program. it will get the desired
# subject from the user to get related quotes, request them
# from the web site, and convert them into tabular data
instructions = 'Choose a category of quotes from the list below. For a random list of quotes, press <enter>:'
    
all_tags = [
'love',
'inspirational',
'life',
'humor',
'books',
'reading',
'friendship',
'friends',
'truth',
'simile']

print(instructions)
for tag in all_tags:
    print(tag,'\n')
    
quote_subject = input("Tag: ")

if quote_subject not in all_tags:
    quote_subject = ''

# reserve a place to save all quotes and authors retrieved
quote_list = []
author_list = []

# page number for iterating through all the pages
page_num = 1

base_url = 'https://quotes.toscrape.com'
topic_url = base_url + '/page/' + str(page_num)

# get a web page from quotes.toscrape.com
page = read_page_from_quotes_toscrape(base_url, page_num, quote_subject)
# convert the page into a BeautifulSoup object
document = parse_page_with_bs4(page.text)

# extract the first page's quotes and authors
(quotes,authors) = get_quotes_and_authors(document, base_url)
for quote in quotes:
    quote_list.append(quote)
for author in authors:
    author_list.append(author)

# see if there are more pages to scrape by checking the number of 'next' links
pager = document.find_all('li', class_='next')

while len(pager) == 1:
    # for debugging only: print('entering while loop',pager)
    page_num += 1
    page = read_page_from_quotes_toscrape(base_url, page_num, quote_subject)
    document = parse_page_with_bs4(page.text)
    (quotes,authors) = get_quotes_and_authors(document, base_url)
    for quote in quotes:
        quote_list.append(quote)
    for author in authors:
        author_list.append(author)
    pager = document.find_all('li', class_='next')
    
for i in range(len(author_list)):
    author = author_list[i]
    author = author[len(base_url) + len('/author/'):].strip()
    author = author.replace('-',' ')


Choose a category of quotes from the list below. For a random list of quotes, press <enter>:
love 

inspirational 

life 

humor 

books 

reading 

friendship 

friends 

truth 

simile 

Tag: 


## Build the dataset from what was just read in.

In [16]:
'''
function write_csv
given a list of dictionaries, assuming the first item is a header,
create a csv file for all items.

items consist of:
 - list of authors
 - list of quotes, corresponding to each item in the list of authors

output a .csv file at the given path
'''
def write_csv(items, path):
    """Write a list of dictionaries to a CSV file"""
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        headers = list(items[0].keys())
        #f.write(','.join(headers) + '\n')
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

In [17]:
'''
function build_dictionary
given a header dictionary, list of all quoted authors,
and their quotes, build a dictionary list consisting of
 - the headers
 - the entries (author, author's about link, quote)
'''
def build_dictionary_list(headers,authors,quotes):
    
    result = []
    header_line = {
        'key1' : headers[0],
        'key2' : headers[1],
        'key3' : headers[2]
    }
    result.append(header_line)

    for i in range(len(author_list)):
        author_about = author_list[i]
        author = author_about[len(base_url) + len('/author/'):].strip()
        author = author.replace('-',' ')
        quote = quote_list[i]
        quote = quote.replace(',',' ')
        dict_entry = {
            'key1' : author,
            'key2' : author_about,
            'key3' : quote #quote_list[i]
        }
        result.append(dict_entry)
    return result

In [18]:
'''
Format the data for the .CSV file

list out the headers in the top line, then pass in the authors and quotes
we'll receive a list of dictionary entries that can be saved
'''
header = ['author', 'about', 'quote']
csv_data = build_dictionary_list(header,author_list,quote_list)

# use the requested subject in the file name
if quote_subject == '':
    quote_subject = 'general'
    
#print('the csv data:','\n',csv_data)
filename = 'quotes_'+quote_subject+'.csv'
write_csv(csv_data,filename)

In [19]:
'''
view the data using Pandas
'''
import pandas as pd
pd.read_csv(filename)

Unnamed: 0,author,about,quote
0,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,“The world as we have created it is a process ...
1,J K Rowling,https://quotes.toscrape.com/author/J-K-Rowling,“It is our choices Harry that show what we t...
2,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,“There are only two ways to live your life. On...
3,Jane Austen,https://quotes.toscrape.com/author/Jane-Austen,“The person be it gentleman or lady who has ...
4,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,“Imperfection is beauty madness is genius and...
...,...,...,...
95,Harper Lee,https://quotes.toscrape.com/author/Harper-Lee,“You never really understand a person until yo...
96,Madeleine LEngle,https://quotes.toscrape.com/author/Madeleine-L...,“You have to write the book that wants to be w...
97,Mark Twain,https://quotes.toscrape.com/author/Mark-Twain,“Never tell the truth to people who are not wo...
98,Dr Seuss,https://quotes.toscrape.com/author/Dr-Seuss,“A person's a person no matter how small.”


In [20]:
#jovian.commit(project="dataanalyst-bootcamp-project1-web-scraping", outputs=[filename])

## Summary

In this project we have gathered information about famous people from throughout history by scraping quotes from them from this https://quotes.toscrape.com

The metadata gathered there shares some of the things they've said and offers links to each person being quoted for more general information about them.

Using the python libraries requests, BeautifulSoup, and pandas the following steps are taken:
 - scraped the website, gathering names, quotes, and informational links
 - to do this we
     - prompt the user to enter a quote category by selecting a tag name as given above
     - use the requests library to scrape the website
     - use the Beautiful Soup library to parse metadata from the web page returned (name, authors about-link, and quote)
 - create a dataset in the form of a .csv file given the tag name entered by the user.
 
### Data scraped from the website is now available alongside this notebook's enclosing folder.

If you are running this notebook on a Jupyter platform, refer to the menu at the top of the page, select
'File | open'. A new tab will be opened in the browser. Locate and download 'quotes_<tag>.csv' If no tag was picked then the file will be named 'quotes_general.csv'    

In [21]:
# this was useful while studying Pandas and its management of csv files
#help(pd.read_csv)

In [22]:
#print(filename)
#csv_data

## Future work
    - scrape multiple websites for additional quotes/people and combine the metadata with what is here
    - gather more in-depth information about each person being quoted by following the link provided in the metadata obtained from this scraping
    - for other webscraping projects, use Selenium to scrape websites with dynamically changing data

## References

[jovian project](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis)

[information about web scraping](https://en.wikipedia.org/wiki/Web_scraping)

[Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

[HTTP requests using Python](https://docs.python-requests.org/en/master/)

[Pandas: (the tool used for csv file management in this project)](https://pandas.pydata.org/docs/pandas.pdf)

## Make a submission

In [23]:
# again, the Jovian platform is where this project was developed
# and submitted for evaluation. We don't do that here as this work
# is here for demonstration only
#jovian.submit(assignment="zerotoanalyst-project1", outputs=filename)