# Class 5: Scraping Web Data 1 - BeautifulSoup & HTML


## How do websites work?

### What is the "backend" of a website?
The backend of a website has a bunch of moving parts. Content probably lives in a database (or, more likely these days, several databases). There are servers (big computers) with functions that are responsible for getting content from the databases and sending it, in a machine-readable format, to the frontend. This functionality is called an API, or *Application Programming Interface*. It is (generally speaking) how computers request and get data. 

### What is the "frontend" of a website?
The frontend of a website is where you, the human user, see the content. The frontend takes the big chunk of content sent from the server via the API and puts it into a nice, pretty format. This is what you see as "the website." The way that websites typically display content is via a combination of HTML (hypertext markup language) and Javascript, for the interactive features. 
#### HTML
HTML is a programming language that people use to make webpages. It consists of *elements* that can be nested within each other. Elements are indicated with *tags;* typically there are open tags (`<p>`) and close tags (`</p>`) that surround the contents of an element. Browsers read HTML and use the tags to figure out how to display content; you don't see the raw HTML when you use a browser (though you can do so using the "inspect element" feature). Here is an example of a (very bare-bones) HTML file:
```
<!DOCTYPE html>
<html>
<head>This is a header!</head>
<body>

<h1>This is a heading!</h1>

<p>And this is a paragraph</p>

</body>
</html>
```
Let's see what this looks like in our notebook!

In [5]:
from IPython.core.display import display, HTML

my_html_string = """
<!DOCTYPE html>
<html>
<head>This is a header!</head>
<body>

<h1>This is a heading!</h1>

<p>And this is a paragraph</p>

</body>
</html>
"""
display(HTML(my_html_string))

Here's an example with a link and an image:

In [13]:
now_with_link = """
<!DOCTYPE html>
<html>
<head>This has a link!</head>
<p>
<a href="https://northeastern.edu">This is a link</a>
</p>
<img src="images/whale.jpg" alt="this is a whale" width=200 height=200>
"""
display(HTML(now_with_link))

Obviously most websites are more fancy than that, but at their core, when you visit them, HTML is being generated -- and you can look at it with your computer instead of via your browser. 
The act of looking at web pages via your computer (i.e. programatically) instead of via a conventional browser is called *scraping*, and it's not super hard to do!

## Ways to access a website
### Visiting the website via a browser 
Pros: 
* Does not require that much specialized knowledge.
* Is how you're generally encouraged to use websites.
* Easy to understand what you're looking at.

Cons:
* Does not scale well (if you're trying to look at 5000 webpages, this is not a good approach)

### Using a website's API
Pros:
* Much faster
* Scales better
* Output is easily machine-readable

Cons:
* The API exists because the website's owner allows it to exist (see: Twitter/X). 
* Might cost money
* Might have rate limits
* Output is not easy to read if you are a human

### Scraping a website
Pros:
* Also scales pretty well
* Does not require the goodwill of a website's owner
* Scraping publicly accessible data is [legal](https://techcrunch.com/2022/04/18/web-scraping-legal-court/) in the US

Cons:
* You run the risk of getting your IP banned
* Often have to build a custom scraper for each website
* Not doable for all websites (e.g. Facebook)

## How do we scrape a website?

### First, we practice good robot citizenship via the `robots.txt` file!
https://en.wikipedia.org/wiki/Robots_exclusion_standard

http://www.robotstxt.org/robotstxt.html

- It is a standard used by websites to communicate with web crawlers and other web robots
- The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned
- Robots are often used by search engines to categorize web sites
- Not all robots cooperate with the standard; email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out

In practice,
- when a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.example.com/robots.txt)
- this text file contains the instructions in a specific format
- robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site
- if this file doesn't exist, web robots assume that the web owner wishes to provide no specific instructions, and crawl the entire site.
- a robots.txt file covers one origin. For websites with multiple subdomains, each subdomain must have its own robots.txt file.

### Let's check out the `robots.txt` for tmz.com using the `requests` package.
The `requests` package lets us make requests to websites or APIs. It gives us back HTML webpages that we can read through as if they were .html files. 

In [15]:
import requests
res = requests.get('https://www.tmz.com/robots.txt')
print(res.text)

Sitemap: https://www.tmz.com/sitemaps/article/index.xml
Sitemap: https://www.tmz.com/sitemaps/gallery/index.xml
Sitemap: https://www.tmz.com/sitemaps/page/index.xml
Sitemap: https://www.tmz.com/sitemaps/watch/index.xml
Sitemap: https://www.tmz.com/sitemaps/news.xml

User-agent: Googlebot-News
Disallow: /photos
Disallow: /videos

User-agent: proximic
Disallow:

User-agent: bingbot
Crawl-delay: 60

User-agent: *

Disallow: /_/
Disallow: */print
Disallow: /search
Disallow: /xid




Our `User-agent` is categorized under `*` because it is not Googlebot_news, proximic, or bingbot. This means we're not allowed to go to `tmz.com/*/print`, `tmz.com/search/`, or `tmz.com/xid`. 

## Actually Scraping Data
