<h1> Web Scraping with Python Workshop Resources </h1>

Web scraping is an automatic method to obtain large amounts of data from websites.
Web scraping is used for:
- Price monitoring
- Market research
- News monitoring
- Sentiment analysis
- Email marketing
- Collecting data for machine learning and deep learning models

Web pages can either be scraped using APIs or parsing web pages.
There are various libraries for parsing web pages using Python. These include BeautifulSoup, Scrapy, Selenium, ZenRows, Playwright etc.

### Legal and Ethical considerations for web scraping
Just because data is present on the internet, it doesn't mean it can or should be scraped.
Always look into permissions, what data can be scraped, how much can be scraped and how are you allowed to use this data.
A robots.txt file specifies which pages of a website can and can't be accessed for crawling and scraping.

## Regular Expressions

A regular expression (RegEx) is a sequence of characters that forms a search pattern. This is used to check if a string has a specified pattern.
Regular expressions are used to select a specific pattern of information in web scraping.

Function | Description
---------|------------
findall (match_pattern, search_string) | Returns a list containing all matches
search (match_pattern, search_string) | Returns a Match object if there is a match anywhere in the string (only first occurrence)
split (match_pattern, search_string, maxsplit) | Returns a list where the string has been split at each match
sub (match_pattern, replacement, search_string, count) | Replaces one or many matches with a string


In [None]:
import re

string = "Stop the spinning top."

In [None]:
f = re.findall ('top', string)
print(f)

In [None]:
words = re.split (' ', string)
print(words)

In [None]:
newstring = re.sub ('top', 'wheel', string)
print(newstring)

**Match Object**

The Match object has information about the search and result.
- .span() returns a tuple with the start and end position of the match.
- .string returns the string passed to the search function.
- .group() returns the part of the string where the match was found



In [None]:
string = "Stop the spinning top."

x = re.search (r'spin', string)
print(x)    # x = None if the pattern is not found
print("span:", x.span())
print("string:", x.string)
print("group:", x.group())

| Symbol | Description |
|---|---|
| ^ | Matches at the beginning of a line |
| $ | Matches at the end of the line |
| . | Matches any character |
| \s | Matches whitespace |
| \S | Matches any non-whitespace character |
| * | Zero or more occurrences |
| + | One or more occurrences |
| ? | Zero or one occurrences |
| [aeiou] | Matches a single character in the listed set |
| [^XYZ] | Matches a single character not in the listed set |
| [a-z0-9] | The set of characters can include a range |
| ( | Indicates where string extraction is to start |
| ) | Indicates where string extraction is to end |
| \b | Returns a match where the specified characters are at the beginning or at the end of a word |
| \B | Returns a match where the specified characters are present, but NOT at the beginning or at the end of a word |

In [None]:
txt1 = 'the cat sits on the cot.'

ex1 = re.findall('c.t', txt1)
print(ex1)

In [None]:
ex2 = re.findall('the', txt1)
print(ex2)

In [None]:
ex3 = re.search('^the', txt1)
if ex3:
    print('present')
else:
    print('not present')

In [None]:
ex4 = re.split('\s', txt1)
print(ex4)

In [None]:
txt2 = 'THE CAT EATS A CARROT IN THE CT SCAN.'

ex5 = re.findall('C.?T', txt2)
print('C.?T:', ex5)

C.*T: ['CAT EATS A CARROT IN THE CT']
C.+T: ['CAT EATS A CARROT IN THE CT']
C.?T: ['CAT', 'CT']


In [None]:
ex6 = re.findall('C.*T', txt2)
print('C.*T:', ex6)

ex7 = re.findall('C.+T', txt2)
print('C.+T:', ex7)

In [None]:
ex8 = re.findall(r'C.*?T', txt2)
print(ex8)

ex9 = re.findall(r'C.+?T', txt2)
print(ex9)

In [None]:
txt3 = """The majestic mountains towered above the quaint little village nestled in the valley, while the river meandered
gently through the lush greenery, creating a serene and picturesque landscape."""

vowels = re.findall('[aeiouAEIOU]', txt3)
print(len(vowels))
consonants = re.findall('[^aeiouAEIOU0-9\s\W_]', txt3)
print(len(consonants))

## HTTP Libraries for Python
In order to scrape data from websites, first a connection must be established with the website.

**urllib** <br>
urllib is a package that is part of the Python Standard Library. It has several modules for working with URLs like:
- urllib.request for opening and reading URLs
- urllib.error containing the exceptions raised by urllib.request
- urllib.parse for parsing URLs
- urllib.robotparser for parsing robots.txt files <br><br>
It is suitable for simple tasks but lacks many features that are required for more complex web interactions.

**urllib3** <br>
urllib3 is a powerful, user-friendly third-party HTTP client for Python, which includes additional features like:
- Thread safety.
- Connection pooling.
- Client-side TLS/SSL verification.
- File uploads with multipart encoding.
- Helpers for retrying requests and dealing with HTTP redirects.
- Support for gzip, deflate, brotli, and zstd encoding.
- Proxy support for HTTP and SOCKS.
- 100% test coverage.

**Requests** <br>
Requests is a simple and elegant HTTP library for Python, built on top of urllib3.

## Understanding HTTP Headers
A HTTP header gives additional information (metadata) about the request or response. <br>
There are 3 types of headers: General, Response and Request headers.

## Web Scraping with API


In [None]:
import requests
import json

In [None]:
url = "https://openlibrary.org/search/authors.json?q=twain"
response = requests.get(url)

print('Headers:')
print(response.headers)

Headers:
{'Server': 'nginx/1.18.0 (Ubuntu)', 'Date': 'Sat, 16 Mar 2024 06:51:25 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'access-control-allow-origin': '*', 'access-control-allow-method': 'GET, OPTIONS', 'access-control-max-age': '86400', 'x-ol-stats': '"SR 1 0.462 TT 0 0.480"', 'Referrer-Policy': 'no-referrer-when-downgrade'}


In [None]:
# View response headers
print("Response Headers:")
for header, value in response.headers.items():
    print(f"{header}: {value}")

print("\n")

# View request headers
print("Request Headers:")
for header, value in response.request.headers.items():
    print(f"{header}: {value}")

In [None]:
# View General Headers

# Accessing Request URL
request_url = response.url
print("Request URL:", request_url)

# Accessing Request Method
request_method = response.request.method
print("Request Method:", request_method)

# Accessing Status Code
status_code = response.status_code
print("Status Code:", status_code)

# Accessing Referrer Policy
referrer_policy = response.headers.get('Referrer-Policy')
print("Referrer Policy:", referrer_policy)

In [None]:
from urllib import robotparser
rp_api = robotparser.RobotFileParser()
rp_api.set_url(f"{url}/robots.txt")
rp_api.read()
print(rp_api.can_fetch('*', url))

True


In [None]:
def get_info(book_name):
    query_book = book_name.lower()
    query_book = query_book.replace(' ', '+')

    resp = requests.get(f"http://openlibrary.org/search.json?title={query_book}")
    info = resp.json()

    author_name = info['docs'][0]['author_name'][0]
    publishing_year = info['docs'][0]['first_publish_year']
    avg_rating = info['docs'][0]['ratings_average']
    subjects = info['docs'][0]['subject']
    people = info['docs'][0]['person']
    print(f"Author of '{book_name}' is {author_name}. It was first published in {publishing_year}. It has an average rating of {avg_rating}")

    return subjects, people

In [None]:
book1 = input('Enter 1st book name: ')
book2 = input('Enter 2nd book name: ')

In [None]:
sub1, chars1 = get_info(book1)
sub1 = set(sub1)
chars1 = set(chars1)
sub2, chars2 = get_info(book2)
sub2 = set(sub2)
chars2 = set(chars2)

common_sub = list(sub1 & sub2)
print('Common subjects are:',common_sub)
common_chars = list(chars1 & chars2)
print('Common characters are:',common_chars)

## Web Scraping with *BeautifulSoup*

In [None]:
import urllib.request, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
def scrape_books(url):
    # Lists to save Titles, Ratings aand Prices
    Title = []
    Rating = []
    Price = []

    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    books = soup.find('ol', class_='row').find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

    for book in books:
        title = book.find('h3').find('a').get('title')
        print("Title:", title)
        Title.append(title)

        star = book.find('p', class_='star-rating')
        rating = star['class'][1]
        print("Rating:", rating)
        Rating.append(rating)

        price = book.find('p', class_="price_color").text
        print("Price:", price)
        Price.append(price)

        print("----------")

    next_page_tag = soup.find('ul', class_ = 'pager').find('li', class_='next')
    if next_page_tag:
        next_url = next_page_tag.find('a')['href']
        base_url = re.sub(r'(.*fiction_4/).*', r'\1', url)
        next_page_url = urllib.parse.urljoin(base_url, next_url)
        next_page_data = scrape_books(next_page_url)
        Title.extend(next_page_data[0])
        Rating.extend(next_page_data[1])
        Price.extend(next_page_data[2])

    return Title, Rating, Price

In [None]:
start_url = 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html'
rp_webs = robotparser.RobotFileParser()
rp_webs.set_url(f"{start_url}/robots.txt")
rp_webs.read()
print(rp_webs.can_fetch('*', start_url))

True


In [None]:
req = urllib.request.Request(start_url)
response = urllib.request.urlopen(req)

# View response headers
print('\nResponse Headers:')
print(response.headers)


Response Headers:
Date: Sat, 16 Mar 2024 07:24:27 GMT
Content-Type: text/html
Content-Length: 50109
Connection: close
Last-Modified: Wed, 08 Feb 2023 21:02:32 GMT
ETag: "63e40de8-c3bd"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=0; includeSubDomains; preload




In [None]:
# Some General Headers
print('final url:', response.geturl())
print('status code:', response.getcode())

final url: https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html
status code: 200


In [None]:
titles, ratings, prices = scrape_books(start_url)

Title: Tipping the Velvet
Rating: One
Price: £53.74
----------
Title: Forever and Forever: The Courtship of Henry Longfellow and Fanny Appleton
Rating: Three
Price: £29.69
----------
Title: A Flight of Arrows (The Pathfinders #2)
Rating: Five
Price: £55.53
----------
Title: The House by the Lake
Rating: One
Price: £36.95
----------
Title: Mrs. Houdini
Rating: Five
Price: £30.25
----------
Title: The Marriage of Opposites
Rating: Four
Price: £28.08
----------
Title: Glory over Everything: Beyond The Kitchen House
Rating: Three
Price: £45.84
----------
Title: Love, Lies and Spies
Rating: Two
Price: £20.55
----------
Title: A Paris Apartment
Rating: Four
Price: £39.01
----------
Title: Lilac Girls
Rating: Two
Price: £17.28
----------
Title: The Constant Princess (The Tudor Court #1)
Rating: Three
Price: £16.62
----------
Title: The Invention of Wings
Rating: One
Price: £37.34
----------
Title: World Without End (The Pillars of the Earth #2)
Rating: Four
Price: £32.97
----------
Title: The

In [None]:
data = {'Title': titles, 'Rating': ratings, 'Price': prices}
df = pd.DataFrame(data)

In [None]:
df['Price'] = df['Price'].str.replace('£', '')
print(df)

                                                Title Rating  Price
0                                  Tipping the Velvet    One  53.74
1   Forever and Forever: The Courtship of Henry Lo...  Three  29.69
2             A Flight of Arrows (The Pathfinders #2)   Five  55.53
3                               The House by the Lake    One  36.95
4                                        Mrs. Houdini   Five  30.25
5                           The Marriage of Opposites   Four  28.08
6     Glory over Everything: Beyond The Kitchen House  Three  45.84
7                                Love, Lies and Spies    Two  20.55
8                                   A Paris Apartment   Four  39.01
9                                         Lilac Girls    Two  17.28
10         The Constant Princess (The Tudor Court #1)  Three  16.62
11                             The Invention of Wings    One  37.34
12    World Without End (The Pillars of the Earth #2)   Four  32.97
13                              The Passion of D

In [None]:
df.to_csv('books_data.csv', index=False)