Source: https://www.kdnuggets.com/2022/02/build-web-scraper-python-5-minutes.html

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [4]:
f = requests.get('https://quotes.toscrape.com/')

In [5]:
# pass the text to BeautifulSoup for parsing of the raw data
soup = BeautifulSoup(f.text)

### Exploring the data

The commands below extract different parts of the HTML, starting broad and getting more specific. 

Inspect the HTML within a browser to learn how to craft these commands.

In [31]:
# display the full html

#print(soup.get_text)

In [13]:
# display the first two quote <div>'s
[i for i in soup.findAll("div",{"class":"quote"})][:2]

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>
 <div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>
 </div>,
 <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
 <span>by <small class="author" itempr

In [18]:
# extract the "text" span from each "quote" div
for i in soup.findAll("div",{"class":"quote"}):
    print(i.find("span",{"class":"text"}))

<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>
<spa

In [19]:
# extract only the text from each "text" span
for i in soup.findAll("div",{"class":"quote"}):
    print((i.find("span",{"class":"text"})).text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


In [20]:
# extract the authors
for i in soup.findAll("div",{"class":"quote"}):
    print((i.find("small",{"class":"author"})).text)

Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin


Note, the "tags" div is nested in the "quote" div, but you can drill down directly to it.

In [21]:
# extract the tags
for i in soup.findAll("div",{"class":"tags"}):
    print((i.find("meta"))['content'])

change,deep-thoughts,thinking,world
abilities,choices
inspirational,life,live,miracle,miracles
aliteracy,books,classic,humor
be-yourself,inspirational
adulthood,success,value
life,love
edison,failure,inspirational,paraphrased
misattributed-eleanor-roosevelt
humor,obvious,simile


### Storing the data in a dataframe

In [22]:
quotes = []
authors = []
tags = []

In [23]:
for pages in range(1,10):
    f = requests.get('http://quotes.toscrape.com/page/'+str(pages))
    soup = BeautifulSoup(f.text)
    for i in soup.findAll("div",{"class":"quote"}):
        quotes.append((i.find("span",{"class":"text"})).text)
    for j in soup.findAll("div",{"class":"quote"}):
        authors.append((j.find("small",{"class":"author"})).text)
    for k in soup.findAll("div",{"class":"tags"}):
        tags.append((k.find("meta"))['content'])

In [27]:
df = pd.DataFrame(
    {'quote':quotes,
     'author':authors,
     'tags':tags
    })

In [30]:
df.head(3)

Unnamed: 0,quote,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles"
