In [27]:
# This is made for learning purpose - Web Scraping

# Introduction

- Web scraping is an automated method to extract data from websites.
- It removes the need for manual copy-paste, saving time and effort.
- Python is commonly used for web scraping due to its simplicity and powerful libraries.
- Web scraping helps in collecting data for research, analysis, and monitoring.
- This project focuses on understanding basic web scraping concepts using Python.

**Market and Competitor Analysis:**
- Businesses collect product prices, customer reviews, and competitor details to track market trends and stay competitive.

**Financial Data Collection:**
- Investors and analysts extract stock prices, historical data, and financial reports to support informed decision-making.

**Social Media Monitoring:**
- Marketers analyze trends, customer sentiment, and campaign performance to improve engagement and strategy.

**SEO Tracking:**
- Companies monitor search engine rankings for keywords to optimize content and improve online visibility.

**Research and Machine Learning:**
- Researchers and data scientists gather large datasets to perform analysis and train machine learning models.

**Overall Benefit:**
- Web scraping makes data collection faster, scalable, and more accurate compared to manual methods.

**Techniques of Web Scraping**

> Manual Extraction:
Data is copied and pasted manually from websites. This method is simple but slow, inefficient, and not suitable for large or frequently updated data.

> Automated Extraction:
Uses scripts or software to collect data automatically. It is faster, more reliable, and suitable for large-scale scraping. Common automated techniques include:

  > *HTML Parsing:* Extracting data from the raw HTML of static web pages.
> 
  > *DOM Parsing:* Working with the Document Object Model to extract dynamically loaded content.
> 
  > *API Access:* Using official APIs to fetch structured data when available (preferred over scraping).
> 
  > *Headless Browsers (e.g., Selenium):* Simulating real user actions to scrape JavaScript-heavy or interactive websites.
> 
  > *Technique Selection:* The scraping method depends on the websiteâ€™s structure, complexity, and data format.

**1. BeautifulSoup**

In [39]:
"""
BeautifulSoup is a Python library used to extract data from HTML and XML documents.
It converts webpage content into a structured format, making data extraction simple and efficient.

Steps Involved in Web Scraping:
------------------------------
1. Send HTTP Request
2. Parse HTML Content
3. Extract Required Data
4. Store Data for Future Use
"""

"""-------------------------------
### Install Required Libraries
-------------------------------"""

# pip install requests
# pip install beautifulsoup4

# ^Use these in Python Terminal to Install the files


"""-------------------------------
### Fetch HTML Content
-------------------------------"""
import requests
url = "https://www.geeksforgeeks.org/dsa/dsa-tutorial-learn-data-structures-and-algorithms/"
response = requests.get(url) 
print(response.text) 

# Explanation:
# Sends a GET request to the given URL
# response.text returns the raw HTML content of the webpage



"""-------------------------------
### Handling 403 Forbidden Error (It's Optional)
-------------------------------"""
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)



"""-------------------------------
### Parse HTML Using BeautifulSoup
-------------------------------"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

# Explanation:
# Converts raw HTML into a structured parse tree
# html.parser is Pythonâ€™s built-in HTML parser



"""-------------------------------
### Extract Specific Data (Example: Inspirational Quotes)
-------------------------------"""
import requests
from bs4 import BeautifulSoup

url = "https://www.passiton.com/inspirational-quotes"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

quotes = []

quote_boxes = soup.find_all(
    'div',
    class_='col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'
)

for box in quote_boxes:
    quote_text = box.img['alt'].split(" #")
    quote = {
        'theme': box.h5.text.strip(),
        'image_url': box.img['src'],
        'lines': quote_text[0],
        'author': quote_text[1] if len(quote_text) > 1 else 'Unknown'
    }
    quotes.append(quote)

# for q in quotes[:5]:
    # print(q) ( If you want to check the output you can download the file and remove the "#" before print
    

# Explanation:
# find_all() locates all quote containers using class names
# Extracts quote text, author, theme, and image URL
# Stores extracted data as a list of dictionaries



"""-------------------------------
### Understanding HTML Structure
-------------------------------"""
container = soup.find('div', attrs={'id': 'all_quotes'})

# soup.prettify() helps inspect HTML structure
# find() retrieves a single element
# find_all() retrieves multiple matching elements



"""-------------------------------
### Save Extracted Data to CSV
-------------------------------"""
import csv

filename = "quotes.csv"

with open(filename, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(
        file,
        fieldnames=['theme', 'image_url', 'lines', 'author']
    )
    writer.writeheader()
    for quote in quotes:
        writer.writerow(quote)

# Explanation:
# Creates a CSV file named quotes.csv
# Stores extracted data in a structured tabular format
# Data can be reused for analysis or reporting


""" **CONCLUSION**
This project demonstrates how web scraping can be implemented using Python
and BeautifulSoup. It automates data collection, extracts useful information,
and stores it efficiently, making it a powerful tool for data analysis
and research.
"""

<!DOCTYPE html><html lang="en"><head><link rel="preconnect" href="https://fonts.googleapis.com"/><link rel="preconnect" href="https://fonts.gstatic.com" crossorigin="true"/><meta charSet="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=0.5, maximum-scale=3.0"/><meta name="robots" content="index, follow, max-image-preview:large, max-snippet:-1"/><link rel="shortcut icon" href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" type="image/x-icon"/><meta name="theme-color" content="#308D46"/><meta name="image" property="og:image" content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png"/><meta property="og:image:type" content="image/png"/><meta property="og:image:width" content="200"/><meta property="og:image:height" content="200"/><meta name="facebook-domain-verification" content="xo7t4ve2wn3ywfkjdvwbrk01pvdond"/><meta property="og:title" content="DSA Tutorial - GeeksforGeeks"/><meta name="descri

' **CONCLUSION**\nThis project demonstrates how web scraping can be implemented using Python\nand BeautifulSoup. It automates data collection, extracts useful information,\nand stores it efficiently, making it a powerful tool for data analysis\nand research.\n'

**2. Requests (Python)**

In [33]:
"""
PYTHON REQUESTS LIBRARY â€“ COMPLETE DEMONSTRATION

The Python Requests library is a simple and powerful tool used to send HTTP
requests and interact with web resources. It supports GET, POST, PUT,
DELETE, PATCH, and HEAD requests and is widely used in REST APIs,
web scraping, and backend development.
"""

"""
-------------------------------
### Why Use Requests Library
-------------------------------
1. Simplifies HTTP requests
2. Manages headers, cookies, sessions, and authentication
3. Ideal for REST API consumption and testing
4. Supports all HTTP methods
5. Built-in SSL verification and error handling
"""

"""
-------------------------------
### Installation (Run in Terminal)
-------------------------------
"""

# pip install requests


"""
-------------------------------
### IMPORT REQUIRED LIBRARY
-------------------------------
"""
import requests


"""
-------------------------------
### REQUEST SYNTAX
-------------------------------
requests.get(url, params={key: value}, **kwargs)

Parameters:
- url      : Target URL (Required)
- params   : Query parameters (Optional)
- **kwargs : Headers, cookies, auth, timeout, proxies, SSL, etc.

Return Type:
- Response object
"""


"""
-------------------------------
### SIMPLE GET REQUEST
-------------------------------
"""
response = requests.get("https://example.com/")
print("Status Code:", response.status_code)

# Status code 200 means request successful


"""
-------------------------------
### GET REQUEST WITH PARAMETERS
-------------------------------
"""
response = requests.get("https://api.github.com/users/octocat")
print("Status Code:", response.status_code)
print("Response Content:", response.content)


"""
-------------------------------
### HTTP REQUEST METHODS
-------------------------------
GET     - Retrieve information from server
POST    - Send data to server
PUT     - Replace existing resource
DELETE  - Delete a resource
HEAD    - Retrieve headers only
PATCH   - Apply partial updates
"""


"""
-------------------------------
### RESPONSE OBJECT EXAMPLE
-------------------------------
"""
response = requests.get("https://api.github.com/")
print("Final URL:", response.url)
print("Status Code:", response.status_code)


"""
-------------------------------
### COMMON RESPONSE ATTRIBUTES
-------------------------------
"""
print("Headers:", response.headers)
print("Encoding:", response.encoding)
print("Elapsed Time:", response.elapsed)
print("Is OK:", response.ok)


"""
-------------------------------
### POST REQUEST EXAMPLE
-------------------------------
"""
payload = {'username': 'test', 'password': 'test123'}
response = requests.post("https://httpbin.org/post", data=payload)
print("POST Response:", response.text)


"""
-------------------------------
### AUTHENTICATION USING REQUESTS
-------------------------------
"""
from requests.auth import HTTPBasicAuth

response = requests.get(
    "https://api.github.com/user",
    auth=HTTPBasicAuth("user", "pass")
)
print("Auth Status Code:", response.status_code)

# Replace user and pass with valid credentials


"""
-------------------------------
### SSL CERTIFICATE VERIFICATION
-------------------------------
"""
response = requests.get("https://expired.badssl.com/", verify=False)
print("SSL Bypass Status:", response.status_code)

# verify=False disables SSL verification (not recommended)


"""
-------------------------------
### SESSION OBJECTS
-------------------------------
"""
session = requests.Session()

session.get("https://httpbin.org/cookies")
response = session.get("https://httpbin.org/cookies")
print("Session Cookies:", response.text)


"""
-------------------------------
### ERROR HANDLING
-------------------------------
"""
# try:
#    response = requests.get("https://www.example.com/", timeout=5)
#    response.raise_for_status()
#    print("Request Successful")
# except requests.exceptions.HTTPError as errh:
#    print("HTTP Error:", errh)
# except requests.exceptions.ConnectionError as errc:
#    print("Connection Error:", errc)
# except requests.exceptions.Timeout as errt:
#    print("Timeout Error:", errt)
# except requests.exceptions.RequestException as err:
#    print("Other Error:", err)


"""
-------------------------------
### CONCLUSION
-------------------------------
The Requests library provides a clean and efficient way to communicate
with web servers. It simplifies HTTP operations, supports APIs,
handles authentication and errors, and is widely used in
web scraping and backend development.
"""

Status Code: 200
Status Code: 200
Response Content: b'{"login":"octocat","id":583231,"node_id":"MDQ6VXNlcjU4MzIzMQ==","avatar_url":"https://avatars.githubusercontent.com/u/583231?v=4","gravatar_id":"","url":"https://api.github.com/users/octocat","html_url":"https://github.com/octocat","followers_url":"https://api.github.com/users/octocat/followers","following_url":"https://api.github.com/users/octocat/following{/other_user}","gists_url":"https://api.github.com/users/octocat/gists{/gist_id}","starred_url":"https://api.github.com/users/octocat/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/octocat/subscriptions","organizations_url":"https://api.github.com/users/octocat/orgs","repos_url":"https://api.github.com/users/octocat/repos","events_url":"https://api.github.com/users/octocat/events{/privacy}","received_events_url":"https://api.github.com/users/octocat/received_events","type":"User","user_view_type":"public","site_admin":false,"name":"The Octocat","company



SSL Bypass Status: 200
Session Cookies: {
  "cookies": {}
}



'\n-------------------------------\n### CONCLUSION\n-------------------------------\nThe Requests library provides a clean and efficient way to communicate\nwith web servers. It simplifies HTTP operations, supports APIs,\nhandles authentication and errors, and is widely used in\nweb scraping and backend development.\n'

**3. Scrapy â€“ Command Line Tools**

In [32]:
"""
Prerequisite: Implementing Web Scraping in Python with Scrapy

Scrapy is a Python library used for web scraping and web crawling.
It uses Spiders to crawl web pages and extract data using selectors.
Scrapy is powerful, fast, and suitable for large-scale scraping tasks.
"""

"""
-------------------------------
### About Scrapy Spiders
-------------------------------
- Spiders crawl websites automatically
- They extract data using CSS/XPath selectors
- Can follow links and scrape multiple pages
- Ideal for structured and large-scale scraping
"""

"""
-------------------------------
### Creating a Scrapy Project
-------------------------------
Before starting:
1. Make sure Python is installed
2. Create and activate a virtual environment
3. Install Scrapy inside the virtual environment
"""

"""
-------------------------------
### Virtual Environment Setup (Terminal Commands)
-------------------------------
These commands should be run in terminal / command prompt
"""

# python --version
# python -m venv scrapy_env
# cd scrapy_env
# cd Scripts
# activate
# cd ..


"""
-------------------------------
### Install Scrapy and Create Project
-------------------------------
"""

# pip install scrapy
# scrapy startproject MyScrapyProject


"""
-------------------------------
### Create a Spider
-------------------------------
Change directory to project folder and generate spider
"""

# cd MyScrapyProject
# scrapy genspider quotes_spider https://quotes.toscrape.com/


"""
-------------------------------
### Scrapy Command-Line Help
-------------------------------
"""

# scrapy -h
# scrapy <command> -h


"""
-------------------------------
### Important Scrapy Commands
-------------------------------
bench     : Tests Scrapy performance on system
check     : Checks spider contracts
crawl     : Runs the spider and crawls data
edit      : Edits spider file
genspider : Creates a new spider
version   : Displays Scrapy version
view      : Opens response body in browser
list      : Lists all available spiders
parse     : Parses a URL using spider
settings  : Displays Scrapy settings
"""


"""
-------------------------------
### Examples of Scrapy Commands
-------------------------------
"""

# scrapy bench
# scrapy check quotes_spider
# scrapy crawl quotes_spider
# scrapy version
# scrapy view https://quotes.toscrape.com/
# scrapy list


"""
-------------------------------
### Custom Commands in Scrapy
-------------------------------
Scrapy allows creating custom command-line tools.
Custom commands are defined inside a commands folder.
"""


"""
-------------------------------
### Configure Custom Commands
-------------------------------
Add the following line in settings.py
"""

# COMMANDS_MODULE = 'MyScrapyProject.commands'


"""
-------------------------------
### Create Custom Command File
-------------------------------
File path:
MyScrapyProject/commands/customcrawl.py
"""


"""
-------------------------------
### Custom Command Code
-------------------------------
"""
from scrapy.commands import ScrapyCommand


class Command(ScrapyCommand):

    # Indicates that the Scrapy project is required
    requires_project = True

    # Syntax of the custom command
    def syntax(self):
        return '[options]'

    # Short description of the command
    def short_desc(self):
        return 'Runs the spider using a custom command'

    # Main execution logic
#    def run(self, args, opts):
#        spider_list = self.crawler_process.spiders.list()
#        self.crawler_process.crawl(spider_list[0], **opts.__dict__)
#        self.crawler_process.start() 



"""
-------------------------------
### CONCLUSION
-------------------------------
Scrapy command-line tools provide powerful control over web scraping tasks.
They allow creating projects, managing spiders, crawling data, and building
custom commands. Scrapy is well-suited for scalable, automated, and
production-level web scraping projects.
"""


NameError: name 'Regards' is not defined

**4. Selenium â€“ Components, Uses and Limitations**

In [20]:
"""
Selenium is a widely used open-source tool for automating web browsers.
It is primarily used for testing web-based applications and is highly
preferred for cross-browser testing and web automation.
"""

"""
-------------------------------
### Selenium Features
-------------------------------
- Cross-browser testing support
- Multi-language compatibility
- Easy interaction with web elements
- Faster performance compared to many tools
- Supports dynamic web elements
- Open-source and free to use
- Platform independent (Windows, macOS, Linux)
- Code reusability
"""

"""
-------------------------------
### Selenium Components
-------------------------------
Selenium consists of four major components:
1. Selenium IDE
2. Selenium RC (Remote Control)
3. Selenium WebDriver
4. Selenium Grid
"""

"""
-------------------------------
### 1. Selenium IDE
-------------------------------
Selenium IDE is a record-and-playback tool used for quick test creation.

Key Features:
- Record user interactions with web applications
- Playback recorded test cases
- Supports multiple browsers
- Inspect and identify web elements
- Debug test cases step-by-step
- Export tests to languages like Python, Java, C#
"""

"""
-------------------------------
### 2. Selenium RC (Remote Control)
-------------------------------
Selenium RC was an early Selenium tool that allowed writing tests
in multiple programming languages using a server as an intermediary.

Limitations of Selenium RC:
- Slower execution due to server dependency
- Complex API
- Less support for modern web technologies

WebDriver replaced Selenium RC due to better performance and simplicity.
"""

"""
-------------------------------
### 3. Selenium WebDriver
-------------------------------
Selenium WebDriver is the most widely used Selenium component.

Key Features:
- Direct communication with browsers
- No need for an intermediary server
- Faster and more stable execution
- Supports modern web technologies
- Rich APIs for browser actions
- Supports parallel execution
"""

"""
-------------------------------
### 4. Selenium Grid
-------------------------------
Selenium Grid allows running tests on multiple machines and browsers.

Key Benefits:
- Parallel execution of tests
- Supports multiple browsers and operating systems
- Central hub manages test execution
- Reduces overall testing time
"""

"""
-------------------------------
### Applications of Selenium
-------------------------------
- Automated Web Application Testing
- Cross-Browser Compatibility Testing
- Web Scraping of dynamic websites
- CI/CD Pipeline Integration (Jenkins, GitHub Actions)
- Functional Testing of web applications
"""

"""
-------------------------------
### Limitations of Selenium
-------------------------------
- Cross-browser behavior differences
- Slow execution for large applications
- Difficulty handling dynamic web elements
- No direct support for mobile app testing
- Limited support for desktop applications
"""

"""
-------------------------------
### CONCLUSION
-------------------------------
Selenium is a powerful and widely used automation tool for web applications.
It is best suited for browser-based testing and automation tasks. However,
it has limitations when dealing with dynamic elements, mobile apps, and
desktop applications. Understanding its strengths and limitations helps
in choosing the right tool for automation needs.
"""


'\n-------------------------------\n### CONCLUSION\n-------------------------------\nSelenium is a powerful and widely used automation tool for web applications.\nIt is best suited for browser-based testing and automation tasks. However,\nit has limitations when dealing with dynamic elements, mobile apps, and\ndesktop applications. Understanding its strengths and limitations helps\nin choosing the right tool for automation needs.\n'

**5. Scrape the Web with Playwright in Python**

In [31]:
"""
Playwright is a modern web testing and automation framework developed by Microsoft.
It is faster, more reliable, and easier to use compared to Selenium. Playwright
supports Chromium, Firefox, and WebKit using a single API and is designed for
cross-browser web automation.
"""

"""
-------------------------------
### Features of Playwright
-------------------------------
- Headless execution
- Auto-wait for elements
- Network interception
- Mobile device emulation
- Geolocation and permission handling
- Shadow DOM support
- Screenshots, video, and HAR capture
- Isolated browser contexts
- Parallel execution
"""

"""
-------------------------------
### Advantages of Playwright
-------------------------------
- Cross-browser execution
- Open-source framework
- Well-documented
- Parallel test execution
- API testing support
- Context isolation
- Python language support
"""

"""
-------------------------------
### Creating a Python Virtual Environment
-------------------------------
Recommended to isolate dependencies using a virtual environment.
Run the following commands in terminal.
"""

# virtualenv venv
# venv/Scripts/activate


"""
-------------------------------
### Installing and Setting Up Playwright
-------------------------------
"""

# pip install playwright
# playwright install


"""
-------------------------------
### Automating and Scraping a Webpage
-------------------------------
Target Website: https://quotes.toscrape.com/
"""


"""
-------------------------------
### Playwright Code Implementation
-------------------------------
Scrapes quotes and authors from the webpage
"""
from playwright.sync_api import sync_playwright


def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        page.goto('https://quotes.toscrape.com/')

        all_quotes = page.query_selector_all('.quote')

        for quote in all_quotes:
            text = quote.query_selector('.text').inner_text()
            author = quote.query_selector('.author').inner_text()
            print({'Author': author, 'Quote': text})

        page.wait_for_timeout(10000)
        browser.close()


# if __name__ == '__main__':
    # main() --- You can remove the comments "#" to run this code. regards, Pranab


"""
-------------------------------
### CONCLUSION
-------------------------------
Playwright is a powerful and modern automation framework that simplifies
web scraping and testing. With its speed, reliability, and cross-browser
support, it is an excellent alternative to Selenium for handling dynamic
websites.
"""

IndentationError: unindent does not match any outer indentation level (<string>, line 92)

In [28]:
import pandas as pd
from tabulate import tabulate

data = [
    {"Tool": "Requests", "Best For": "APIs, static websites", "Strengths": "Fast, lightweight, simple", "Limitations": "No JavaScript support"},
    {"Tool": "BeautifulSoup", "Best For": "HTML/XML parsing", "Strengths": "Beginner-friendly, clean parsing", "Limitations": "Not for dynamic websites or large-scale scraping"},
    {"Tool": "Scrapy", "Best For": "Large-scale web crawling", "Strengths": "High performance, asynchronous, scalable", "Limitations": "Steep learning curve, JS needs extra setup"},
    {"Tool": "Selenium", "Best For": "Dynamic websites, UI automation", "Strengths": "Simulates real user behavior", "Limitations": "Slow, resource-intensive"},
    {"Tool": "Playwright", "Best For": "Modern JavaScript-heavy websites", "Strengths": "Fast, reliable, auto-wait, cross-browser", "Limitations": "High system usage, overkill for static sites"}
]

df = pd.DataFrame(data)

print(tabulate(df, headers="keys", tablefmt="grid", showindex=False))

+---------------+----------------------------------+------------------------------------------+--------------------------------------------------+
| Tool          | Best For                         | Strengths                                | Limitations                                      |
| Requests      | APIs, static websites            | Fast, lightweight, simple                | No JavaScript support                            |
+---------------+----------------------------------+------------------------------------------+--------------------------------------------------+
| BeautifulSoup | HTML/XML parsing                 | Beginner-friendly, clean parsing         | Not for dynamic websites or large-scale scraping |
+---------------+----------------------------------+------------------------------------------+--------------------------------------------------+
| Scrapy        | Large-scale web crawling         | High performance, asynchronous, scalable | Steep learning curve, 