  Import necessary libraries

In [73]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

  Using the requests library, we will get the page we want to scrape and extract it’s HTML:

In [76]:
f = requests.get('http://quotes.toscrape.com/')

  we will pass the site’s HTML text to BeautifulSoup, which will parse this raw data so it can be easily scraped

In [79]:
soup = BeautifulSoup(f.text)

     We can easily run BeautifulSoup’s in-built functions on this object in order to extract the data we want.
For example, if we wanted to extract all of the text available on the web page, we can easily do it with the following lines of code:

In [82]:
print(soup.get_text())




Quotes to Scrape








Quotes to Scrape




Login






“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)


            Tags:
            
change
deep-thoughts
thinking
world



“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)


            Tags:
            
abilities
choices



“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)


            Tags:
            
inspirational
life
live
miracle
miracles



“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)


            Tags:
            
aliteracy
books
classic
humor



“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe


  BeautifulSoup has methods like find() and findAll() that you can use to extract specific HTML tags from the web page.In this case, notice that the <span> class called text is highlighted. This is because you right-clicked on one of the quotes on the page, and all the quotes belong to this text class.
We need to extract all the data in this class:

In [85]:
for i in soup.findAll("div",{"class":"quote"}):
    print((i.find("span",{"class":"text"})).text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


        We will use the find() and findAll() functions to extract all the author names within this tag.

In [88]:
for i in soup.findAll("div",{"class":"quote"}):
    print((i.find("small",{"class":"author"})).text)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


   To extract all the tags on the page, run the following lines of code

In [91]:
for i in soup.findAll("div",{"class":"tags"}):
    print((i.find("meta"))['content'])

change,deep-thoughts,thinking,world
abilities,choices
inspirational,life,live,miracle,miracles
aliteracy,books,classic,humor
be-yourself,inspirational
adulthood,success,value
life,love
edison,failure,inspirational,paraphrased
misattributed-eleanor-roosevelt
humor,obvious,simile


Let’s create three empty arrays so we can store the data collected

In [94]:
quotes = []
authors = []
tags = []

Create a loop that ranges from 1–10, and iterate through every page on the site. We will run the exact same lines of code we created earlier. The only difference is that instead of printing the output, we will now append it to an array.

In [97]:
for pages in range(1,10):    
    f = requests.get('http://quotes.toscrape.com/page/'+str(pages))
    soup = BeautifulSoup(f.text)    
    for i in soup.findAll("div",{"class":"quote"}):
        quotes.append((i.find("span",{"class":"text"})).text)  
   
    for j in soup.findAll("div",{"class":"quote"}):
        authors.append((j.find("small",{"class":"author"})).text)    
        for k in soup.findAll("div",{"class":"tags"}):
            tags.append((k.find("meta"))['content'])

Finally, let’s consolidate all the data collected into a Pandas dataframe

In [101]:
a = {'Quotes' : quotes ,'Authors' : authors , 'Tags': tags}
df = pd.DataFrame.from_dict(a, orient='index')
df = df.transpose()

In [103]:
df

Unnamed: 0,Quotes,Authors,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational"
...,...,...,...
895,,,"age,fairytales,growing-up"
896,,,god
897,,,"death,life"
898,,,"misattributed-mark-twain,truth"


We have successfully scraped a website using Python libraries, and stored the extracted data into a dataframe. This data can be used for further analysis — you can build a clustering model to group similar quotes together, or train a model that can automatically generate tags based on an input quote.