# Web Scraping Project

Objective:
* To get the list of all quote topics from *https://www.brainyquote.com/topics* along with their respective urls.
* To get the quotes from individual topics from their respective urls.
* To make a webapp using Streamlit that allows the user to select a topic and then the program will display 10 random quotes by scraping the above mentioned site in real time.

## Importing the necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import os

## Defining global constants

In [2]:
site_url='https://www.brainyquote.com/topics'
base_url='https://www.brainyquote.com'
store_dir='files/'

## Gettting the list of topics

In [3]:
response=requests.get(site_url)

In [4]:
# check the status of the request 
response.status_code

200

**Inference:** Since the status_code is in the range [200,200] we are good to go(request was successfully accepted).

### Parse the downloaded data using BeautifulSoup

In [5]:
soup=BeautifulSoup(response.text,'html.parser')

In [6]:
# check the length of the parsed data
len(soup)

5

### Extraing the necessary data from the parsed data

In [7]:
# extrating the section where list of topics is located
topics_section_selection_class='row bq_left'
topics_section=soup.find_all('div',{'class':topics_section_selection_class})

In [8]:
# check the length of the topic_block_tags
len(topics_section)

1

In [9]:
# extracting the topic blocks
topic_block_selection_class='bqLn'
topic_blocks=topics_section[0].find_all('div',{'class':topic_block_selection_class})

In [10]:
# check the length of topic blocks
len(topic_blocks)

250

#### View an Individual topic_block element

In [11]:
topic_blocks[0]

<div class="bqLn">
<div class="bqLn">
<a class="topicIndexChicklet bq_on_link_cl" data-xtracl="PT,Index,1" href="/topics/age-quotes"><span class="topicContentName">Age</span>
<span class="topicIndexArrow"><img alt="Profession Detail" class="bq-fa" src="/st/img/4782095/fa/chv-r.svg"/></span>
<div style="clear:both"></div></a>
</div>
</div>

**Inference:** Although the `len(topic_blocks)` gives 250 as count of topics, the actual number of topics is half of it because of the repeated div tags.

#### Extracting data from individual topic_block element

In [12]:
# extract topic name
topic_name_tag=topic_blocks[0].find('span',class_='topicContentName')
topic_name_tag.text.strip()

'Age'

In [13]:
#extract topic url
topic_url_tag=topic_blocks[0].find('a',class_='topicIndexChicklet bq_on_link_cl')
base_url+topic_url_tag['href']

'https://www.brainyquote.com/topics/age-quotes'

### Putting it altogether
* Getting the topic name and url from all elements of the topic_block and storing the results in a dataframe.

In [14]:
topics=[]
for i in range(0,len(topic_blocks),2):
    topic={}
    # extract topic name
    topic_name_tag=topic_blocks[i].find('span',class_='topicContentName')
    topic['Topic Name']=topic_name_tag.text
    
    #extract topic url
    topic_url_tag=topic_blocks[i].find('a',class_='topicIndexChicklet bq_on_link_cl')
    topic['Topic URL']=base_url+topic_url_tag['href']
    
    topics.append(topic)

#### Making a dataframe of topics and their respective urls

In [15]:
topics_df=pd.DataFrame(topics)
topics_df.head()

Unnamed: 0,Topic Name,Topic URL
0,Age,https://www.brainyquote.com/topics/age-quotes
1,Alone,https://www.brainyquote.com/topics/alone-quotes
2,Amazing,https://www.brainyquote.com/topics/amazing-quotes
3,Anger,https://www.brainyquote.com/topics/anger-quotes
4,Anniversary,https://www.brainyquote.com/topics/anniversary...


In [16]:
topics_df.shape

(125, 2)

In [17]:
# make a folder named files to store the .csv files
os.mkdir('files')

In [18]:
# saving the topics_df into a .csv file
topics_df.to_csv(store_dir+'topics.csv',index=None)

## Getting the quotes for an individual topic

In [19]:
topic_url=topics_df['Topic URL'][0]
topic_url

'https://www.brainyquote.com/topics/age-quotes'

In [20]:
response=requests.get(topic_url)

In [21]:
# check the status of the request 
response.status_code

200

### Parse the downloaded data using BeautifulSoup

In [22]:
soup=BeautifulSoup(response.text,'html.parser')

In [23]:
# check the length of the parsed data
len(soup.text)

7579

In [24]:
quotes_selection_class='grid-item qb clearfix bqQt'
quotes_tags=soup.find_all('div',{'class':quotes_selection_class})

In [25]:
# check the number of quotes
len(quotes_tags)

60

### View an individual quote_tag

In [26]:
quotes_tags[0]

<div class="grid-item qb clearfix bqQt" id="pos_1_1">
<a class="b-qt qt_103892 oncl_q" href="/quotes/mark_twain_103892?src=t_age" title="view quote">
<div style="display: flex;justify-content: space-between">
Age is an issue of mind over matter. If you don't mind, it doesn't matter.
<img alt="Share this Quote" class="bq-qb-chv" src="/st/img/4785795/fa/chv-r.svg"/>
</div>
</a>
<a class="bq-aut qa_103892 oncl_a" href="/authors/mark-twain-quotes" title="view author">Mark Twain</a>
</div>

#### Extract the necessary data from the parsed data

In [27]:
#extract the quote 
quote=quotes_tags[0].find('a',class_='b-qt')
quote.text.strip()

"Age is an issue of mind over matter. If you don't mind, it doesn't matter."

In [28]:
# extract the author name
author=quotes_tags[0].find('a',class_='bq-aut')
author.text.strip()

'Mark Twain'

### Putting it altogether
* Getting all the quotes along with the author name for an individual topic and storing it in a dataframe.

In [29]:
quotes=[]
for i in range(len(quotes_tags)):

    quote={}
    #extract the quote 
    # print(i)
    quote_tag=quotes_tags[i].find('a',class_='b-qt')
    quote['Quote']=quote_tag.text.strip()
    # extract the author
    author=quotes_tags[i].find('a',class_='bq-aut')
    quote['Author']=author.text.strip()
    
    quotes.append(quote)

#### Making a dataframe of quotes along with its author name for an individual topic

In [30]:
quotes_df=pd.DataFrame(quotes)
quotes_df.head()

Unnamed: 0,Quote,Author
0,Age is an issue of mind over matter. If you do...,Mark Twain
1,"Anyone who stops learning is old, whether at t...",Henry Ford
2,"You can't help getting older, but you don't ha...",George Burns
3,"Youth is the gift of nature, but age is a work...",Stanislaw Jerzy Lec
4,Forty is the old age of youth; fifty the youth...,Victor Hugo


In [31]:
quotes_df.shape

(60, 2)

## Getting the quotes for all the topics
* Storing the quotes of individual topics in seperate .csv files.

In [32]:

for row in topics_df.itertuples():
    _,topic_name,topic_url=row
    # print(topic_name,"     ",topic_url)
    print(f'Scraping topic {topic_name}')
    response=requests.get(topic_url)
    status_code=response.status_code
    if status_code<200 and status_code> 199:
        continue
    soup=BeautifulSoup(response.text,'html.parser')
    quotes_selection_class='grid-item qb clearfix bqQt'
    quotes_tags=soup.find_all('div',{'class':quotes_selection_class})
    quotes=[]
    for i in range(len(quotes_tags)):

        quote={}
        #extract the quote 
        # print(i)
        quote_tag=quotes_tags[i].find('a',class_='b-qt')
        quote['Quote']=quote_tag.text.strip()
        # extract the author
        author=quotes_tags[i].find('a',class_='bq-aut')
        quote['Author']=author.text.strip()

        quotes.append(quote)
    quotes_df=pd.DataFrame(quotes)
    file_name=store_dir+topic_name+'.csv'
    quotes_df.to_csv(file_name,index=None)

Scraping topic Age
Scraping topic Alone
Scraping topic Amazing
Scraping topic Anger
Scraping topic Anniversary
Scraping topic Architecture
Scraping topic Art
Scraping topic Attitude
Scraping topic Beauty
Scraping topic Best
Scraping topic Birthday
Scraping topic Brainy
Scraping topic Business
Scraping topic Car
Scraping topic Chance
Scraping topic Change
Scraping topic Christmas
Scraping topic Communication
Scraping topic Computers
Scraping topic Cool
Scraping topic Courage
Scraping topic Dad
Scraping topic Dating
Scraping topic Death
Scraping topic Design
Scraping topic Diet
Scraping topic Dreams
Scraping topic Easter
Scraping topic Education
Scraping topic Environmental
Scraping topic Equality
Scraping topic Experience
Scraping topic Experience
Scraping topic Failure
Scraping topic Faith
Scraping topic Family
Scraping topic Famous
Scraping topic Father's Day
Scraping topic Fear
Scraping topic Finance
Scraping topic Fitness
Scraping topic Food
Scraping topic Forgiveness
Scraping topic

**Note:** The first two objectives of my project are completed the third one will be done in a seperate `app.py` file.