Q1. What is Web Scraping? Why is it Used? Give three areas where Web Scraping is used to get data.
ans.

**Web Scraping** is the process of automatically extracting data from websites using software tools or scripts. It involves fetching web pages and parsing their content to collect specific information, often in a structured format like CSV, Excel, or a database.



**Why is it Used?**

Web scraping is used to gather large amounts of data quickly and efficiently from websites where APIs may not be available or accessible. It helps in:

* Collecting data for analysis or comparison
* Automating repetitive data-gathering tasks
* Monitoring changes on websites (e.g., price changes, news updates)


**Three Areas Where Web Scraping is Used:**

1. **E-commerce and Price Monitoring:**
   Businesses scrape competitor websites to track product prices, reviews, and availability.

2. **Market Research and Data Analytics:**
   Companies collect user opinions, ratings, and comments from review sites or forums to analyze market trends.

3. **Job Aggregation and Recruitment:**
   Job portals use scraping to gather job listings from multiple company websites and job boards.



Q2. What are the different methods used for Web Scraping?


Answer:

There are several methods used for web scraping, depending on the complexity of the website and the type of data required. Here are the most commonly used methods:


### 1. **Manual Copy-Pasting**

* The simplest form of scraping.
* Involves manually copying data from a webpage and pasting it into a file or spreadsheet.
* Suitable for small amounts of data.


### 2. **HTML Parsing**

* Uses libraries like **BeautifulSoup** (Python) or **Cheerio** (JavaScript) to parse HTML content.
* Allows easy navigation and extraction of specific elements using tags, classes, or IDs.



### 3. **DOM Parsing**

* Accesses the Document Object Model (DOM) of the webpage using browsers or tools like **Selenium** or **Puppeteer**.
* Useful for scraping data from JavaScript-heavy websites that require interaction (like clicking or scrolling).



### 4. **XPath and CSS Selectors**

* These are methods used to locate elements within the HTML tree.
* XPath provides powerful navigation capabilities, while CSS selectors are simpler and faster for straightforward structures.



### 5. **Regular Expressions (Regex)**

* Extracts data based on patterns found in the raw HTML content.
* Fast but fragile – small changes in structure can break the extraction.


### 6. **APIs (When Available)**

* Many websites provide APIs for structured access to data.
* This is the most reliable and legal method of data extraction.


### 7. **Web Scraping Tools and Frameworks**

* Tools like **Scrapy**, **Octoparse**, **ParseHub**, and **WebHarvy** provide user-friendly or programmatic ways to scrape data efficiently.



Q3. What is Beautiful Soup? Why is it used?

ans. Answer:

Beautiful Soup is a Python library used for parsing HTML and XML documents. It creates a parse tree from the page source code that can be used to extract data easily.

Why is it Used?
Beautiful Soup is used in web scraping to:

Navigate the structure of a web page (HTML/XML).

Search and locate elements using tags, attributes, classes, or IDs.

Extract the required data from specific parts of the page.

Clean up and convert poorly structured or malformed HTML.

Key Features:
Works well with various HTML parsers (like html.parser, lxml, or html5lib).

Handles broken HTML gracefully.

Easy to integrate with other libraries like requests for downloading web pages.



In [1]:
from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract the title of the page
title = soup.title.text
print("Page Title:", title)


Page Title: Example Domain


Q4. Why is flask used in this Web Scraping project?

ans.Answer:

Flask is a lightweight and flexible web framework for Python. In a web scraping project, Flask is used to create a web application or interface that allows users to interact with the scraped data.

Why Flask is Used in Web Scraping Projects:
To Display Scraped Data in a Web Browser:

Flask helps present the scraped data in a user-friendly format (like a table or dashboard) via HTML templates.

To Trigger Scraping Through a Web Interface:

You can create a button or form on a webpage that, when clicked or submitted, runs the scraping script.

To Serve Scraped Data as an API:

Flask can provide scraped data in JSON format via RESTful APIs, allowing other apps or systems to access the data.

To Schedule or Monitor Scraping Jobs:

A simple interface built with Flask can help you run scraping jobs manually or monitor results.

Example Use Case:
A user visits your Flask web app.

They enter a product name in a search box.

Flask calls the scraping function to collect product data from a website.

The results are displayed in the browser in real time.

example :

In [None]:
from flask import Flask, render_template
app = Flask(__name__)

@app.route('/')
def home():
    return "<h1>Welcome to the Web Scraper</h1>"

if __name__ == '__main__':
    app.run(debug=True)


Q5. Write the names of AWS services used in this project. Also, explain the use of each service.

ans.


###  **1. Amazon EC2 (Elastic Compute Cloud)**

* **Use:**
  EC2 provides virtual servers (instances) where your **web scraping scripts and Flask app** can run.
* **Why it's used:**
  To host the scraper and keep it running 24/7 or on-demand with scalability.



###  **2. Amazon S3 (Simple Storage Service)**

* **Use:**
  Used to **store scraped data**, such as CSV, JSON, or Excel files, and also to store logs or backups.
* **Why it's used:**
  For reliable, scalable, and cost-effective data storage.



###  **3. Amazon RDS (Relational Database Service)**

* **Use:**
  Stores structured scraped data in relational databases like MySQL, PostgreSQL, etc.
* **Why it's used:**
  For organizing and querying large amounts of data efficiently.


###  **4. AWS Lambda (Optional)**

* **Use:**
  Runs code in response to triggers (like a scheduled event) without provisioning servers.
* **Why it's used:**
  To **automate web scraping tasks** on a schedule (e.g., scrape every hour) without maintaining servers.



###  **5. Amazon CloudWatch**

* **Use:**
  Monitors logs and performance of the web scraping application.
* **Why it's used:**
  For error tracking, logging, and setting up alerts if the scraping fails or behaves unexpectedly.



###  **6. Amazon Route 53 (Optional)**

* **Use:**
  Manages custom domains for your Flask web app.
* **Why it's used:**
  To map a domain like `www.my-scraper-app.com` to your EC2 instance.


###  **7. AWS IAM (Identity and Access Management)**

* **Use:**
  Manages **permissions and security** for users and applications accessing AWS services.
* **Why it's used:**
  To keep your AWS environment secure and grant limited access where necessary.

