# Web Scraping

## What is Web Scraping?

- "The web": a collection of files hosted on a large network of 
communicating servers.
- *Webscraping* : the act of accessing those files and programmatically saving them, or parts of them, to a chosen location (usually your computer). This is often a critical task  when writing projects that require
data from the internet. 



HTML (HyperText Markup Language): said to be the fabric of the internet.


Nearly all of the things that you 
would normally think of as "webpages" are really files 
written in HTML. A browser like Firefox, Chrome, or Safari is
just a program for *rendering* HTML in an attractive visual 
format. 

HTTP (Hypertext Transfer Protocol): a protocol to send/receive HTML files.

- Unfortunately, for scraping, we often need to interact with raw HTML, which can get messy. 
- Fortunately, web scraping packages, like `beautifulsoup` or `scrapy` give us some tools with which to do this. 


Resources:

- pd.read_html: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html to look for a table in a webpage

- requests: https://requests.readthedocs.io/en/latest/ a Python interface to make HTTP requests

- Introduction to HTML: https://www.w3schools.com/html/html_intro.asp

- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Let's take a quick look at the tutorial website we'll scrape from. 

http://quotes.toscrape.com/

We observe that there are a number of quotes, which possess 
text, authors, and tags. There are multiple pages of 
these quotes, which are accessed via the "Next" button. 

For now, let's try just obtain the text on the webpage. 

In [None]:
import requests
link = "http://quotes.toscrape.com/"
data = requests.get(link).text

In [None]:
print(data)

In [None]:
from bs4 import BeautifulSoup

The `BeautifulSoup` type is a basis type for parsing a webpage.

In [None]:
def link2soup(link):
    """Convert a link to a BeautifulSoup object."""
    data = requests.get(link).text
    return BeautifulSoup(data)

In [None]:
type(requests.get(link))

In [None]:
soup = link2soup(link)

In [None]:
type(soup)

## CSS Selectors

CSS (Cascaded Styling Sheet) is a file type for styling web pages. It is designed to apply some formatting to certain parts of the webpage. How do we select "certain parts"? That is what CSS selectors are for. 


- CSS selector references: https://www.w3schools.com/cssref/css_selectors.php
- a fun activity: https://flukeout.github.io/


__Note__: We are intentionally taking a route using the CSS selectors and `.select()` methods, as this is also useful for other web scraping tools. There are interfaces that might be simpler for you to use in BeautifulSoup -- that will be covered by Alex. 


A quick code to parse text, author name, and the list of tags:



In [None]:
soup.select("small.author")[0].get_text()

In [None]:
l = []

for t in soup.select("div.quote"): # div element of class quote
    text = t.select("span.text")[0].get_text()
    author = t.select("small.author")[0].get_text()
    tags = t.select("div.tags a.tag") # "a element of class tag" inside "div element of class tags"
    tags = [x.get_text() for x in tags]
    l.append((text, author, tags))        

In [None]:
l

### Following the links

At the bottom of each page, there is a "next" button. Can we follow the link?

In [None]:
next_button = soup.select(".next a")[0] # an element of a inside an element of class "next"
next_url = link + next_button.attrs["href"]

In [None]:
next_url

In [None]:
next_soup = link2soup(next_url)

In [1]:
def parse_page(l, soup, base_url):
    """
    Parses the quotes in a page, appending a tuple of (text, author, tags) for each quote to l. 
    Then returns the URL to the next page. If the "next" button is not found, return None.
    """
    pass


__Exercise__: Can we continue on and parse all the quotes on that website?

__iClicker poll__: How many quotes are scraped?

In [None]:
base_url = "http://quotes.toscrape.com/"
l = []
soup = link2soup(base_url)
while True:
    pass

In [None]:
len(l)

## Example: The 25 most popular feature films released in 2023

Can be accessed at: https://www.imdb.com/search/title/?title_type=feature&release_date=2023-01-01,2023-12-31&count=25

__Note__: If you attempt to load more than 25 films, you will face a trouble, as this webpage utilizes JavaScript pagination. For such cases, you can use the `selenium` package to automate web browser interaction from Python.

In [None]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2023-01-01,2023-12-31&count=25"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'} 
# you act like user, not a robot. 
data = requests.get(url, headers=headers).text
soup = BeautifulSoup(data)

In [None]:
soup;

Suppose we want to scrape following 8 features from this page:
- Rank (popularity)
- Title
- Description
- Runtime
- User rating
- Metascore

### Rank and title

In [None]:
title_texts = [x.get_text() for x in soup.select('.ipc-title-link-wrapper .ipc-title__text')]

In [None]:
title_texts

In [None]:
import re # the Python regular expression module

In [None]:
rank_data = [int(re.search('^[0-9]+', x).group(0)) for x in title_texts]
rank_data;

In [None]:
title_data = [x[re.search('^[0-9]+. ', x).end():] for x in title_texts]
title_data

### Descriptions

In [None]:
description_data = [x.get_text() for x in soup.select('.ipc-html-content-inner-div')]
description_data

### Runtimes

In [None]:
runtime_data = [x.get_text() for x in soup.select('.dli-title-metadata-item:nth-child(2)')]
runtime_data[0]

In [None]:
runtime_hr = [int(re.search("\\d+(?=h)", x).group(0)) if re.search("\\d+(?=h)", x) else 0 for x in runtime_data]

In [None]:
runtime_min = [int(re.search("\\d+(?=m)", x).group(0)) if re.search("\\d+(?=m)", x) else 0 for x in runtime_data ]

In [None]:
runtime_data = [runtime_hr[i] * 60 + runtime_min[i] for i in range(len(runtime_hr))]

In [None]:
runtime_data

### User rating

In [None]:
userrating_data = [x.get_text() for x in soup.select('.ratingGroup--imdb-rating')]

In [None]:
userrating_data[0]

In [None]:
userrating_data = [float(x.split('\xa0')[0]) for x in userrating_data]

In [None]:
userrating_data

### Metascore

In [None]:
metascore_data = [float(x.get_text()) for x in soup.select('.metacritic-score-box')]

In [None]:
len(metascore_data)

In [None]:
metascore_data

__Exercise__: We have movie ratings (R, PG, PG-13, etc.) listed on IMDB, and you can access them by the CSS selector `'.dli-title-metadata-item:nth-child(3)'`. What does this CSS selector mean? How would you obtain the ratings? How many of them are rated-R?

### Visualizing the data

In [None]:
import pandas as pd
df = pd.DataFrame(data = {
    "poprank" : rank_data,
    "title" : title_data,
    "description": description_data,
    "runtime": runtime_data,
    "userrating": userrating_data,
    "metascore": metascore_data
}
                 )
    

In [None]:
df

In [None]:
from plotly import express as px

In [None]:
fig = px.scatter(df, 
                 x = "userrating",
                 y = "metascore",
                 hover_name = "title",
                 height = 500,
                 trendline="lowess"
)
fig.show()