# Introduction to BS4 for scraping the required data from Amazon:
In this code, BeautifulSoup (bs4) is used for web scraping to extract information from Amazon product pages. The code reads a list of URLs from a file, visits each URL, and extracts various product details like title, price, rating, availability, description, and reviews.

In this code, BeautifulSoup is used to parse the HTML content of the Amazon product pages and extract specific information from them. The following BeautifulSoup functions are utilized for different tasks:

soup.find(tag, attrs): Finds and extracts elements with specific HTML tags and attributes.
soup.find_all(tag, attrs): Finds and extracts all elements matching the given criteria.
soup.text and element.string: Extract the text content of HTML elements.

Reasons for using BS4 instead of Scrapy:
a. Simplicity and Readability: BeautifulSoup is a simpler and more lightweight library compared to Scrapy. It provides a more intuitive way to navigate and extract data from HTML documents, which makes the code easier to understand and maintain, especially for small to medium-sized web scraping tasks like this one.

b. Less Overhead: Scrapy is a powerful web scraping framework that is designed for complex and large-scale scraping projects. However, for simple tasks like extracting information from a list of URLs, using Scrapy can introduce unnecessary complexity and overhead. BeautifulSoup provides a more direct and lightweight solution for such cases.

c. Flexibility: BeautifulSoup allows fine-grained control over the scraping process. Developers can easily adapt the code to changes in the website's structure or layout. It also allows for custom parsing logic, which can be essential when dealing with websites that do not follow a standard structure.

d. Ease of Integration: BeautifulSoup can be easily integrated with other Python libraries like pandas for data processing and manipulation. It is often used in combination with libraries like requests and pandas, as demonstrated in the code.

e. Scalability: While Scrapy is suitable for large-scale web scraping projects with multiple pages and complex data extraction requirements, BeautifulSoup can be used for smaller, targeted scraping tasks like extracting data from a list of product pages. It offers a balance between simplicity and functionality for such cases.

# Get links for products using a keyword search request

In [24]:
import requests
import time
from bs4 import BeautifulSoup

#Search request
search_term = 'shirts'
search_url = f"https://www.amazon.in/s?k={search_term}"

# Create a session to manage cookies and maintain state
session = requests.Session()

# Set a User-Agent header to mimic a common web browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}

# Send an initial request to the search URL
response = session.get(search_url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the search page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all links with the specified class
    links_with_class = soup.find_all('a', class_='a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal')

    # Extract the href attribute from each link and store it in a list
    link_list = [link.get('href') for link in links_with_class]

    # Append "https://www.amazon.in" to links beginning with "/"
    for i in range(len(link_list)):
        if link_list[i].startswith('/'):
            link_list[i] = 'https://www.amazon.in' + link_list[i]

    # Create a filename based on the search term
    filename = f'{search_term}_links.txt'

    # Save the list of modified links to a text file with the search term as the filename
    with open(filename, 'w') as file:
        for link in link_list:
            file.write(link + '\n')

    # Print the list of modified links
    for link in link_list:
        print(link)

    # Add a delay between requests to avoid overloading the server
    time.sleep(3)
else:
    print("Failed to retrieve the Amazon search page. Status code:", response.status_code)


https://www.amazon.in/sspa/click?ie=UTF8&spc=MTo2NjQzMzE1ODE5MzkzNjIwOjE2OTQ1OTI1ODc6c3BfYXRmOjMwMDAyOTQ1NTY3OTQzMjo6MDo6&url=%2FSWADESI-STUFF-Ragular-Sleev-Casual%2Fdp%2FB0CD3XTCJM%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Dshirts%26qid%3D1694592587%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1
https://www.amazon.in/sspa/click?ie=UTF8&spc=MTo2NjQzMzE1ODE5MzkzNjIwOjE2OTQ1OTI1ODc6c3BfYXRmOjIwMTU2NDgxNTA3MTk4OjowOjo&url=%2FAmazon-Brand-Inkast-Casual-S-02A_Green%2Fdp%2FB08THJZD1S%2Fref%3Dsr_1_2_sspa%3Fkeywords%3Dshirts%26qid%3D1694592587%26sr%3D8-2-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1
https://www.amazon.in/sspa/click?ie=UTF8&spc=MTo2NjQzMzE1ODE5MzkzNjIwOjE2OTQ1OTI1ODc6c3BfYXRmOjIwMDQ0ODQ3MjYxNDk4OjowOjo&url=%2FBen-Martin-Classic-Collar-Cotton%2Fdp%2FB09YMCW9FD%2Fref%3Dsr_1_3_sspa%3Fkeywords%3Dshirts%26qid%3D1694592587%26sr%3D8-3-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1
https://www.amazon.in/sspa/click?ie=UTF8&spc=MTo2NjQzMzE1ODE5MzkzNjIwOjE2OTQ1OTI1ODc6c3BfYX

Scrape

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time

# Function to extract Product title
def get_title(soup):
    try:
        title = soup.find("span", attrs={"id": 'productTitle'})
        title_value = title.text
        title_string = title_value.strip()
    except Exception:
        title_string = ""
    return title_string

# Function to extract Product Prices
def get_price(soup):
    try:
        price = soup.find("span", attrs={'id': 'priceblock_ourprice'}).string.strip()
    except Exception:
        try:
            price = soup.find("span", attrs={'id': 'priceblock_dealprice'}).string.strip()
        except:
            price = ""
    return price

# Function to extract Product Rating
def get_rating(soup):
    try:
        rating = soup.find("i", attrs={'class': 'a-icon a-icon-star a-star-4-5'}).string.strip()
    except Exception:
        try:
            rating = soup.find("span", attrs={'class': 'a-icon-alt'}).string.strip()
        except:
            rating = ""
    return rating

# Function to extract the number of reviews
def get_review_count(soup):
    try:
        review_count = soup.find("span", attrs={'id': 'acrCustomerReviewText'}).string.split()
    except Exception:
        review_count = ""
    return review_count

# Function to extract the Product description
def get_product_description(soup):
    try:
        description = soup.find("ul", class_="a-unordered-list a-vertical a-spacing-mini")
        if description:
            description_text = " ".join([item.text.strip() for item in description.find_all("span", class_="a-list-item")])
            return description_text
    except Exception:
        return ""

# Function to extract Availability Status
def get_availability(soup):
    try:
        available = soup.find("div", attrs={'id': 'availability'})
        available = available.find("span").string.strip()
    except Exception:
        available = "Not Available"
    return available

# Function to extract product reviews
def get_reviews(soup):
    reviews = []
    review_elements = soup.find_all("div", class_="a-row a-spacing-small review-data")
    
    for element in review_elements[:10]:  # Extract the first 10 reviews
        review_text = element.find("span", class_="a-size-base review-text").text.strip()
        reviews.append(review_text)
    
    return reviews

# List of URLs to scrape
def read_urls_from_file(filename):
    with open(filename, 'r') as file:
        urls = [line.strip() for line in file]
    return urls

filename = 'shirts_links.txt'  # Replace with your filename
urls = read_urls_from_file(filename)

d = {"title": [], "price": [], "rating": [], "reviews_count": [], "availability": [], "description": [], "reviews": []}

for url in urls:
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        }
        
        # Create a session to manage cookies and maintain state
        session = requests.Session()
        time.sleep(2)
        # Use the session for making requests
        response = session.get(url, headers=headers)
        if response.status_code == 200:
            
            soup = BeautifulSoup(response.text, 'html.parser')
            d['title'].append(get_title(soup))
            d['price'].append(get_price(soup))
            d['rating'].append(get_rating(soup))
            d['reviews'].append(get_reviews(soup))
            d['availability'].append(get_availability(soup))
            d['description'].append(get_product_description(soup))
            print(get_title(soup))
        else:
            print(f"Failed to retrieve the Amazon page: {url}")
        # Ensure all lists have the same length by appending empty strings if needed
        for key in d:
            if len(d[key]) < len(d['title']):
                d[key].append("")
    except Exception as e:
        print(f"An error occurred: {e}")

# Continue with the rest of your code for data processing and CSV export

# Read the existing CSV file, if it exists
existing_df = pd.read_csv("amazon_data.csv")

# Concatenate the new data with the existing data
combined_df = pd.concat([existing_df, pd.DataFrame.from_dict(d)])

# Remove duplicate rows based on the 'title' column
combined_df.drop_duplicates(subset=['title'], keep='last', inplace=True)

# Save the combined DataFrame to the CSV file
combined_df.to_csv("amazon_data.csv", header=True, index=False)

SWADESI STUFF Ragular Fit Half Sleev Solid Casual Shirt for Men
Amazon Brand - INKAST Men Casual Shirt
Ben Martin Men's Classic Collar Slim Fit Cotton Casual Full Sleeve Shirt
ECOLINE Clothing Mens Polyester Filament Solid Regular Full Sleeve Casual Shirt
Majestic Man Men Slim Fit Casual Shirt
Dennis Lingo Men's Solid Slim Fit Cotton Casual Shirt with Spread Collar & Full Sleeves (Also Available in Plus Size)
IndoPrimo Men's Cotton Casual Regular Fit Checks Shirt with Pocket for Men Long Sleeves - BMW
IndoPrimo Men's Regular Fit Checks Cotton Casual Shirt for Men Full Sleeves - Suzuki
Zombom Men's Regular Fit Cotton Blend Printed Full Sleeve Casual Shirts
Lymio Casual Shirt for Men|| Shirt for Men|| Men Stylish Shirt || Men Printed Shirt (Geo)
Lymio Casual Shirt for Men|| Shirt for Men|| Men Stylish Shirt || Men Printed Shirt (Beach-Floral-BSY)
Lymio Casual Shirt for Men|| Shirt for Men|| Men Stylish Shirt || Men Printed Shirt (D-01-08)
Amazon Brand - INKAST Men Casual Shirt
U-TURN Men