<h1> Web Scraping with Python </h1>

Web scraping is an automatic method to obtain large amounts of data from websites.
Web scraping is used for:
- Price monitoring
- Market research
- News monitoring
- Sentiment analysis
- Email marketing
- Collecting data for machine learning and deep learning models

Web pages can either be scraped using APIs or parsing web pages.
There are various libraries for parsing web pages using Python. These include BeautifulSoup, Scrapy, Selenium, ZenRows, Playwright etc.

### Legal and Ethical considerations for web scraping
Just because data is present on the internet, it doesn't mean it can or should be scraped.
Always look into permissions, what data can be scraped, how much can be scraped and how are you allowed to use this data.
A robots.txt file specifies which pages of a website can and can't be accessed for crawling and scraping.

## Regular Expressions

A regular expression (RegEx) is a sequence of characters that forms a search pattern. This is used to check if a string has a specified pattern.
Regular expressions are used to select a specific pattern of information in web scraping.

Function | Description
---------|------------
findall (match_pattern, search_string) | Returns a list containing all matches
search (match_pattern, search_string) | Returns a Match object if there is a match anywhere in the string (only first occurrence)
split (match_pattern, search_string, maxsplit) | Returns a list where the string has been split at each match
sub (match_pattern, replacement, search_string, count) | Replaces one or many matches with a string


In [None]:
import re

string = "Stop the spinning top."

**Match Object**

The Match object has information about the search and result.
- .span() returns a tuple with the start and end position of the match.
- .string returns the string passed to the search function.
- .group() returns the part of the string where the match was found



In [None]:
string = "Stop the spinning top."

x = re.search (r'spin', string)
print(x)    # x = None if the pattern is not found
print("span:", x.span())
print("string:", x.string)
print("group:", x.group())

| Symbol | Description |
|---|---|
| ^ | Matches at the beginning of a line |
| $ | Matches at the end of the line |
| . | Matches any character |
| \s | Matches whitespace |
| \S | Matches any non-whitespace character |
| * | Zero or more occurrences |
| + | One or more occurrences |
| ? | Zero or one occurrences |
| [aeiou] | Matches a single character in the listed set |
| [^XYZ] | Matches a single character not in the listed set |
| [a-z0-9] | The set of characters can include a range |
| ( | Indicates where string extraction is to start |
| ) | Indicates where string extraction is to end |
| \b | Returns a match where the specified characters are at the beginning or at the end of a word |
| \B | Returns a match where the specified characters are present, but NOT at the beginning or at the end of a word |

In [None]:
txt1 = 'the cat sits on the cot.'

In [None]:
# words starting with c and ending with t

In [None]:
# find all 'the's

In [None]:
# find starting 'the's

In [None]:
# extract words

In [None]:
txt2 = 'THE CAT EATS A CARROT IN THE CT SCAN.'

ex5 = re.findall('C.?T', txt2)
print('C.?T:', ex5)

NameError: name 're' is not defined

In [None]:
txt3 = """The majestic mountains towered above the quaint little village nestled in the valley, while the river meandered
gently through the lush greenery, creating a serene and picturesque landscape."""

vowels = re.findall('[aeiouAEIOU]', txt3)
print(len(vowels))
consonants = re.findall('[^aeiouAEIOU0-9\s\W_]', txt3)
print(len(consonants))

## HTTP Libraries for Python
In order to scrape data from websites, first a connection must be established with the website.

**urllib** <br>
urllib is a package that is part of the Python Standard Library. It has several modules for working with URLs like:
- urllib.request for opening and reading URLs
- urllib.error containing the exceptions raised by urllib.request
- urllib.parse for parsing URLs
- urllib.robotparser for parsing robots.txt files <br><br>
It is suitable for simple tasks but lacks many features that are required for more complex web interactions.

**urllib3** <br>
urllib3 is a powerful, user-friendly third-party HTTP client for Python, which includes additional features like:
- Thread safety.
- Connection pooling.
- Client-side TLS/SSL verification.
- File uploads with multipart encoding.
- Helpers for retrying requests and dealing with HTTP redirects.
- Support for gzip, deflate, brotli, and zstd encoding.
- Proxy support for HTTP and SOCKS.
- 100% test coverage.

**Requests** <br>
Requests is a simple and elegant HTTP library for Python, built on top of urllib3.

## Understanding HTTP Headers
A HTTP header gives additional information (metadata) about the request or response. <br>
There are 3 types of headers: General, Response and Request headers.

## Web Scraping with API


In [None]:
import requests
import json

In [None]:
url = "https://openlibrary.org/search/authors.json?q=twain"
response = requests.get(url)

In [None]:
# View response headers
print("Response Headers:")

print("\n")

# View request headers
print("Request Headers:")

In [None]:
# View General Headers

# Accessing Request URL


# Accessing Request Method


# Accessing Status Code



In [None]:
from urllib import robotparser

In [None]:
def get_info(book_name)

In [None]:
book1 = input('Enter 1st book name: ')
book2 = input('Enter 2nd book name: ')

In [None]:
sub1, chars1 = get_info(book1)
sub1 = set(sub1)
chars1 = set(chars1)
sub2, chars2 = get_info(book2)
sub2 = set(sub2)
chars2 = set(chars2)

common_sub = list(sub1 & sub2)
print('Common subjects are:',common_sub)
common_chars = list(chars1 & chars2)
print('Common characters are:',common_chars)

## Web Scraping with *BeautifulSoup*

In [None]:
import urllib.request, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
def scrape_books(url)

In [None]:
start_url = 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html'
rp_webs = robotparser.RobotFileParser()
rp_webs.set_url(f"{start_url}/robots.txt")
rp_webs.read()
print(rp_webs.can_fetch('*', start_url))

In [None]:
# Response Headers

In [None]:
# Some General Headers

In [None]:
titles, ratings, prices = scrape_books(start_url)

In [None]:
data = {'Title': titles, 'Rating': ratings, 'Price': prices}
df = pd.DataFrame(data)

In [None]:
df['Price'] = df['Price'].str.replace('£', '')
print(df)

In [None]:
df.to_csv('books_data.csv', index=False)