# BestBuy Webscraping Project

Author: Muhammad Fouzan Akhter

The code for a web scraping project that targets BestBuy is shown below. Underscoring the importance of following website privacy policies is crucial for any online scraping project. It is imperative to highlight that this project is scraping entirely publicly accessible data from Yahoo Finance while adhering to the platform's privacy standards.

In [None]:
#installing required packages:
!pip install requests
!pip install beautifulsoup4
!pip install pandas
!pip install selenium

In [None]:
#importing required libraries:
import requests
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

**This Project is coded in the Jupyter Notebook Environment**


The webscraper for BestBuy is divided into two parts. Initially, a webdriver is utilized to extract the links of the products from a specified number of pages, determined by the `max_pages` variable. The second part involves inputting a link, from which Heading, Model Number, SKU Number, Rating, Reviews, and Price of each product are extracted. Both functions are separately defined. To achieve an autonomous process of webscraping, the function that extracts links should be connected to the function that extracts data from the links, and the final output is stored in a pandas dataframe.

### Product Link Extractor

In [None]:
driver = webdriver.Chrome()
base_url = 'https://www.bestbuy.com/site/laptop-computers/all-laptops/pcmcat138500050001.c?id=pcmcat138500050001&intl=nosplash'
driver.get(base_url)
max_pages = 25
collected_links = set()
next_page_selector = "a.sku-list-page-next[aria-disabled='false']"
current_page = 1
while current_page <= max_pages:
    try:
        next_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_selector))
        )
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        product_links = soup.find_all('a', class_='image-link')
        for link in product_links:
            href = link.get('href')
            full_link = f'https://www.bestbuy.com{href}'
            collected_links.add(full_link)
        next_button.click()
        time.sleep(5)
        current_page += 1
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        break
for link in collected_links:
    print(link)
num_links_collected = len(collected_links)
print(f"Number of links collected: {num_links_collected}")
driver.quit()

### Product Information Extractor

In [None]:
url = # add any URL extracted from the function above to test 
headers = {
    'User-Agent': (
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
        '(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    )
}
with requests.Session() as session:
    try:
        response = session.get(url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        heading_element = soup.find('h1', class_='heading-5 v-fw-regular')
        model_element = soup.find('div', class_='model product-data')
        sku_element = soup.find('div', class_='sku product-data')
        rating_reviews_element = soup.find('p', class_='visually-hidden')
        price_element = soup.find('div', class_='priceView-hero-price priceView-customer-price')
        if heading_element and heading_element.text:
            heading_text = heading_element.text.strip()
            print(f"Heading: {heading_text}")
        if model_element and model_element.text:
            model_text = model_element.text.strip()
            print(f"Model Number: {model_text}")
        if sku_element and sku_element.text:
            sku_text = sku_element.text.strip()
            print(f"SKU Number: {sku_text}")
        if rating_reviews_element and rating_reviews_element.text:
            rating_reviews_text = rating_reviews_element.text.strip()
            match = re.search(r'([\d.]+) out of 5 stars with (\d+) reviews', rating_reviews_text)
            if match:
                rating, reviews = match.groups()
                print(f"Rating: {rating}")
                print(f"Reviews: {reviews}")
            else:
                print("Unable to extract rating and reviews.")
        if price_element and price_element.text:
            price_match = re.search(r'\$([\d,.]+)', price_element.text.strip())
            if price_match:
                price_text = price_match.group(1).replace(',', '')
                print(f"Price: ${price_text}")
            else:
                print("Unable to extract price.")
    except requests.exceptions.HTTPError as err:
        print(f"Failed to retrieve the page. HTTP error: {err}")
    except requests.exceptions.RequestException as err:
        print(f"Failed to retrieve the page. Error: {err}")

By inputting the output of the first function into the second function and storing the result of the second function in a pandas dataframe, autonomous webscraping is achieved.

**------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**