# 📘 Web Scraping

✍️ Aziz Ullah Khan | 📅 July 16, 2024

---

## 🚀 Overview

Welcome to Day 21 of our journey! Today, we'll be diving into the world of web scraping. This notebook will guide you through the basics to advanced techniques of web scraping, using popular Python libraries.



---

## 📚 Table of Contents

1. [Introduction to Web Scraping](#Introduction-to-Web-Scraping)
2. [Tools and Libraries](#Tools-and-Libraries)
3. [Practical Examples](#Practical-Examples)
4. [Advanced Techniques](#Advanced-Techniques)
5. [Best Practices and Legal Considerations](#Best-Practices-and-Legal-Considerations)


Let's deep dive!

## Introduction to Web Scraping

### What is Web Scraping?

Web scraping is the process of extracting data from websites. It involves fetching the web page, parsing the HTML or XML content, and extracting useful information.

### Use Cases and Applications

- Data analysis
- Price monitoring
- News aggregation
- Market research
- Social media sentiment analysis


## Required Packages

In [34]:
!pip install beautifulsoup4 selenium scrapy

## Tools and Libraries

### BeautifulSoup

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML.

In [4]:
from bs4 import BeautifulSoup
import requests

url = 'https://www.thenews.com.pk/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text)

The News International: Latest News Breaking, Pakistan News


### Scrapy

Scrapy is an open-source web-crawling framework for Python. It's used to extract data from websites and process them as needed.

### Selenium

Selenium is a powerful tool for controlling a web browser through programs and performing browser automation. It's useful for scraping websites with dynamic content.

In [4]:
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get('https://www.thenews.com.pk/')
print(driver.title)
driver.quit()

The News International: Latest News Breaking, Pakistan News


## Practical Examples

### Extracting data from HTML

In this section, we'll demonstrate how to extract specific data from a web page using BeautifulSoup.

In [10]:
url = 'https://www.thenews.com.pk/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a')[:5]: # display only 5
    print(link.get('href'))

https://www.thenews.com.pk/
https://www.thenews.com.pk/latest-stories
https://www.thenews.com.pk/latest/category/national
https://www.thenews.com.pk/latest/category/sports
https://www.thenews.com.pk/latest/category/world


### Navigating and parsing websites

BeautifulSoup allows you to navigate the parse tree and extract data from nested elements.

In [13]:
for item in soup.find_all('li')[:5]: # display only 5
    print(item.text)

Latest News
National
Sports
World
Business


#### Find the element by Class Name

In [32]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Open the specified URL
driver.get('https://www.thenews.com.pk/')

# Find the element by ID
content = driver.find_element(By.CLASS_NAME, 'siteContent')

# Print the text of the element
print(content.text[:100]) # display only 100 characters

# Close the WebDriver
driver.quit()

LIVE
Displacement, shortages of food, hospitals endanger pregnant women in Gaza: UN
Water scarcity, 


## Advanced Techniques

### Bypassing anti-scraping mechanisms


Websites often have measures to detect and block web scraping. To bypass these mechanisms, you can:

    - Rotate IP addresses
    - Use proxies
    - Set appropriate headers

Example code for setting headers:

In [11]:
import requests
from bs4 import BeautifulSoup

url = 'http://example.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

## Managing Sessions and Cookies

Handling sessions and cookies is crucial for scraping websites that require login.

Example code for managing sessions and cookies:


In [16]:
import requests

login_url = 'http://example.com/login'
url = 'http://example.com/some-page' 
payload = {'username': 'user', 'password': 'pass'}

session = requests.Session()
session.post(login_url, data=payload)
response = session.get(url)

print(response.text[:100])


<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <m


## Scraping Large Websites Efficiently

Using Scrapy, you can build a spider to crawl and scrape large websites efficiently.

Example code for a simple Scrapy spider:

In [3]:
# Import the necessary modules
import scrapy
from scrapy.crawler import CrawlerProcess
import json

# Define a Scrapy Spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

# Run the Scrapy Spider
process = CrawlerProcess(settings={
    "FEEDS": {
        "quotes.json": {"format": "json"},
    },
})

process.crawl(QuotesSpider)
process.start()

# Read the Scraped Data
with open('quotes.json') as f:
    quotes = json.load(f)

for quote in quotes[:5]:
    print(f"'{quote['text']}' - {quote['author']}")


'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' - Albert Einstein
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”' - J.K. Rowling
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”' - Albert Einstein
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”' - Jane Austen
'“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”' - Marilyn Monroe


## Best Practices and Legal Considerations

### Ethical scraping

Always follow ethical guidelines when scraping websites. Make sure to:

- Respect the website's `robots.txt` file
- Avoid overloading the server with too many requests
- Use appropriate headers to simulate a real user

### Respecting robots.txt

`robots.txt` is a file that webmasters use to give instructions about their site to web robots. Always check and respect the directives in this file.


In [8]:
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://example.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'http://example.com/some-page'))

True


🌐 Feel free to connect with [me](https://www.linkedin.com/in/aziz-ullah-khan/) if you have questions or want to discuss this fascinating journey further! Let's continue exploring together.