Name: Dhruv Patel

Assignment: Project 2; Part 2 Web Scraped Data Analysis 

In [None]:
### Scraping the data ###


import requests
from bs4 import BeautifulSoup
import pandas as pd

# Initialize an empty list to store data
data = []

# Base URL for the website
base_url = "http://quotes.toscrape.com/page/{}/"

# Scrape the first 5 pages
for page in range(1, 6):
    response = requests.get(base_url.format(page))
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract quotes, authors, and tags
    for quote in soup.find_all('div', class_='quote'):
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        tags = [tag.text for tag in quote.find_all('a', class_='tag')]
        data.append({"Quote": text, "Author": author, "Tags": tags})

# Convert the data into a pandas DataFrame
df = pd.DataFrame(data)

# Display the first few rows
print(df.head())


In this step we are scraping data from Quotes to Scrape website, which provides quotes, authors, and their associated tasks. The data will be collected for the first 5 pages of the website.  

In [None]:
### Cleaning the data ###

# Convert the Tags column from lists to a single string for each row
df['Tags'] = df['Tags'].apply(lambda tags: ', '.join(tags))

# Check for missing values
print(df.isnull().sum())

# Remove duplicate rows, if any
df.drop_duplicates(inplace=True)

# Display the cleaned data
print(df.head())


After scraping the data, it is very important that we clean it for a better usability and anaylsis. This step is to ensure that your dataset doesnt have any redudent rows, missing values or anything of such sort.

In [None]:
### Data Analysis and Visualization ###
# Most frequently quoted authors #

import seaborn as sns
import matplotlib.pyplot as plt

# Top 5 authors with the most quotes
top_authors = df['Author'].value_counts().head(5)
print("Most Frequently Quoted Authors:")
print(top_authors)

# Plot
sns.barplot(x=top_authors.values, y=top_authors.index, palette='viridis')
plt.title("Top 5 Most Frequently Quoted Authors")
plt.xlabel("Number of Quotes")
plt.ylabel("Author")
plt.show()


This is to identify and take note of the most frequently quoted authors in the dataset. By doing this we can who has the most quotes.

In [None]:
# Most Common Tags #

from collections import Counter

# Flatten the list of all tags
all_tags = [tag for tags in df['Tags'].str.split(', ') for tag in tags]

# Count the most common tags
tag_counts = Counter(all_tags).most_common(5)
print("Most Common Tags:", tag_counts)

# Create a DataFrame for plotting
tags_df = pd.DataFrame(tag_counts, columns=['Tag', 'Count'])

# Plot
sns.barplot(x='Count', y='Tag', data=tags_df, palette='muted')
plt.title("Top 5 Most Common Tags")
plt.xlabel("Frequency")
plt.ylabel("Tag")
plt.show()


The sections analyzes the most occuring tags across the quotes in the dataset. These tags help us to notice the recurring themes and topics discussed in the quotes.   

In [None]:
# Number of Unique Authors that are quoted #

unique_authors = df['Author'].nunique()
print(f"Total Unique Authors: {unique_authors}")


In [None]:
# How many quotes contain the word "life" #

life_quotes = df['Quote'].str.contains("life", case=False).sum()
print(f"Number of Quotes Containing 'life': {life_quotes}")
