<a href="https://colab.research.google.com/github/amoheric/Data-Science-Projects/blob/main/Web_Scraping__Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Overview
This assignment is designed to test your skills in web scraping, data cleaning, and exploratory data analysis (EDA). You will start by collecting data from the web, specifically focusing on user reviews. Then, you'll clean this dataset to prepare it for analysis. Finally, you'll dive into the dataset to uncover insights and answer specific questions.

Guidelines
Web Scraping the data and perform operations:

Perform web scraping from the website given below or choose any website of your own

Books to Scrape: https://books.toscrape.com/Links to an external site.

Save all the scraped content to a data frame.

EDA must be performed using pandas to answer 5 questions from your data and answer the last question in words.

What is the size of the dataset?

What are the names and data types of each column?

How many unique values are there for each categorical variable?

If there is any numerical value in the dataset, what are the minimum and maximum values for it?

Drop rows that have missing values
What are the most frequent categories in the data? Write your observation in words.



---



To complete this assignment, I will perform the following steps:

Web Scraping: I will scrape data from https://books.toscrape.com/Links.

Data Cleaning: Also, clean the dataset to prepare it for my analysis.

Exploratory Data Analysis (EDA): And lastly, provide answers to the specific questions using pandas as a tool.

**Step 1: Web Scraping**

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the website to scrape
url = "https://books.toscrape.com/"

# Send a GET request to the website
response = requests.get(url)
response.raise_for_status()  # Check if the request was successful

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all book containers
book_containers = soup.find_all('article', class_='product_pod')

# Initialize lists to store the data
titles = []
prices = []
availabilities = []
ratings = []

# Extract data from each book container
for book in book_containers:
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    availability = book.find('p', class_='instock availability').text.strip()
    rating = book.p['class'][1]

    titles.append(title)
    prices.append(price)
    availabilities.append(availability)
    ratings.append(rating)

# Create a DataFrame from the lists
books_df = pd.DataFrame({
    'Title': titles,
    'Price': prices,
    'Availability': availabilities,
    'Rating': ratings
})

# Save the DataFrame to a CSV file (optional)
books_df.to_csv('books.csv', index=False)


**Step 2: Data Cleaning**

In [None]:
# Load the data into a DataFrame
books_df = pd.read_csv('books.csv')

# Clean the Price column by removing the currency symbol and converting to float
books_df['Price'] = books_df['Price'].replace('£', '', regex=True).astype(float)

# Convert the Rating column to categorical
books_df['Rating'] = books_df['Rating'].astype('category')

# Display the cleaned DataFrame
print("Cleaned DataFrame:")
print(books_df.head())


Cleaned DataFrame:
                                   Title  Price Availability Rating
0                   A Light in the Attic  51.77     In stock  Three
1                     Tipping the Velvet  53.74     In stock    One
2                             Soumission  50.10     In stock    One
3                          Sharp Objects  47.82     In stock   Four
4  Sapiens: A Brief History of Humankind  54.23     In stock   Five


**Step 3: Exploratory Data Analysis (EDA)**

In [None]:
# 1. What is the size of the dataset?
dataset_size = books_df.shape
print(f"Size of the dataset: {dataset_size}")

# 2. What are the names and data types of each column?
column_info = books_df.dtypes
print("\nColumn names and data types:")
print(column_info)

# 3. How many unique values are there for each categorical variable?
unique_values = books_df.nunique()
print("\nUnique values for each column:")
print(unique_values)

# 4. If there is any numerical value in the dataset, what are the minimum and maximum values for it?
min_price = books_df['Price'].min()
max_price = books_df['Price'].max()
print(f"\nMinimum price: £{min_price}")
print(f"Maximum price: £{max_price}")

# 5. Drop rows that have missing values
books_df.dropna(inplace=True)

# 6. What are the most frequent categories in the data? Write your observation in words.
most_frequent_availability = books_df['Availability'].mode()[0]
most_frequent_rating = books_df['Rating'].mode()[0]
print(f"\nMost frequent availability status: {most_frequent_availability}")
print(f"Most frequent rating: {most_frequent_rating}")

# Observation in words
observation = """
The most frequent availability status for the books is 'In stock', indicating that most books are readily available.
The most frequent rating is 'Three', suggesting that a majority of the books have an average rating.
"""
print(observation)


Size of the dataset: (20, 4)

Column names and data types:
Title             object
Price            float64
Availability      object
Rating          category
dtype: object

Unique values for each column:
Title           20
Price           20
Availability     1
Rating           5
dtype: int64

Minimum price: £13.99
Maximum price: £57.25

Most frequent availability status: In stock
Most frequent rating: One

The most frequent availability status for the books is 'In stock', indicating that most books are readily available.
The most frequent rating is 'Three', suggesting that a majority of the books have an average rating.





---



# Resources:

https://dataheadhunters.com/academy/cleaning-data-from-web-sources-techniques-for-scraped-data/

https://stackoverflow.com/questions/57220006/scraping-customer-reviews-from-dm-de
