# Module 2: Data Engineering
## Sprint 2: SQL and Data Scraping
## Subproject 3: Web scraping

During this lesson, you will learn all that you need to know to start scraping the internet. You will get familiar with the structure of websites, key elements of HTML. You will be introduced to the `requests` and `bs4` libraries that combined enable robust web scraping workflow that can be used to create datasets.

## Learning outcomes
- At the end of this lesson you will be able to create web page's scraping strategy
- You will be able to understand HTML document's structure
- You will know how to extract specific information form websites
- You will know how to scrape websites to create your own datasets

---

## Why we scrape the internet?
Data science without data is not much of a science. In most cases, you can find many great curated clean datasets online (for example Kaggle). Sometimes you need specific data for your project to succeed. How can you collect one? Well, you can always search the web for particular information and manually create your own dataset. Unfortunately, these days many models require quite large data samples and it would be time-consuming and sometimes even impossible to manually collect +10k individual samples to your dataset. Luckily for us, there is one great method that enables us to automate this process. It is called data scraping. This technique is widely used by software agents for different purposes. One of the examples is Google. The software giant goes to every page, scans it, and puts the information into its databases for later indexing.

## What is a Webpage?
Almost all web pages are text documents presented in `.html` format. HTML is a markup language similar to Markdown that you use to write README files on GitHub. To be able to scrape information from HTML files, first, you need to understand how to create one. You have to watch [this](https://www.youtube.com/watch?v=pQN-pnXPaVg) freecodecamp video that explains the main concepts of creating **HTML based** website. Later, we will take a closer look into individual subjects.

## Anatomy of an HTML element
### Nesting elements
Every HTML element lives inside other HTML element. HTML documents have their own strict structure. Here is a base skeleton of every HTML document:
```html
<!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html> 
```
As you can see there are three main parts of the HTML document:
* `<html>` - place where all html content rests in
* `<head>` - place where meta information of page is placed: title, scripts, stylesheets, etc.
* `<body>` - the main part of the document: all information is placed here

### Tags, Classes and Ids
As you will primarily be working with getting information from HTML documents, let's talk about what you can find inside `<body>` part. All information inside HTML documents is presented via [html tags](https://developer.mozilla.org/en-US/docs/Web/HTML/Element). There is a large number of HTML tags available but almost all of them share the same properties: they are place of some block inside the document and they format presented information. For example:
```html
<div class="content">
    <h1>Text</h1>
    <h2 id="subtxt">Subtext</h2>
</div>
```
As you can see, elements are nested inside other elements. Information is presented inside HTML tags. HTML elements can be differentiated using **class** and **id**. Classes are used to group certain elements and apply scripts and styles to them. The same is with ids but the idea is that ids should be unique per document but sadly this is not always the case. Why tags and classes are important to us? We can use them to select parts of the HTML element that we want to extract information from. Let's say there is a page where capitals and native language of EU countries are listed:

```html
<div>
    <ul class="capitals">
        <li>Vilnius</li>
        <li>Paris</li>
        <li>Riga</li>
        <li>Tallinn</li>
    </ul>
    <ul class="languages">
        <li>Lithuanian</li>
        <li>French</li>
        <li>Latvian</li>
        <li>Estonian</li>
    </ul>
</div>
```
If you just select `li` as the tag you want to extract information from you will get all: capitals and languages but you want only capitals. `ul` selection also would not give an expected result as there are two `ul` lists in the document. What you want to do is to select `.capitals ul` - by providing class we are able to select distinct `ul` and extract wanted information.

[Here](https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML) is a full introduction to HTML made Mozilla. You can complete all of it if you want to get deep knowledge in the subject. If you want just do some web scraping, knowledge of the HTML file structure and its elements is more than enough.

## Web Scraping
As mentioned in the beginning of the lesson, web scraping is used to automate information collection from websites and to create datasets for later usage. There are many great tools that can be used to scrape websites. Most popular ones are:
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Scrapy](https://scrapy.org/)
* [Selenium](https://www.selenium.dev/)

Scrapy is a more advanced tool that need a bit more setup, Selenium is really similar to Beautiful Soup but Beautiful Soup is more popular and easier to use option, so in this lesson we will be scraping websites using this tool.
First, you should watch [this freecodecamp video](https://www.youtube.com/watch?v=87Gx3U0BDlo) about Soup that covers all basic usages.

## Setup
Setup of Soup is simple and does not require any difficult steps to be made. Just use `pip` to install it:

In [None]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 1.1 MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0"
  Using cached soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.0.1
You should consider upgrading via the '/Users/dqmis/Documents/data-science-course/course/.venv/bin/python -m pip install --upgrade pip' command.[0m


Now we can use it to extract information from HTML elements:

In [None]:
from bs4 import BeautifulSoup

# Defining some HTML format text
html_doc = """
<html>
<head><title>Example text</title></head>
<body><p class="text">Text to extract</p></body>
</html>
"""

# Creating soup object using BeutifulSoup
soup = BeautifulSoup(html_doc, "html.parser")

# Finding all elements inside HTML that have p tag and text class
p_results = soup.find_all("p", class_="text")

# Extracting text value from found elements
for p_result in p_results:
    print(p_result.text)

Text to extract


As you can see from the example, we successfully found desired elements in the HTML document and extracted text values from them. This is pretty neat but the power of web scraping comes from automation and real-world examples. So now let's talk about how to extract information from real web pages. 

## Downloading pages
To get HTML format document to extract data from you need to download it. It is not that difficult if you are using Python `requests`. We will use a popular forum [Hacker News](https://news.ycombinator.com/) as a website that we will scrape information from:

In [None]:
import requests

URL = "https://news.ycombinator.com/"
page = requests.get(URL)

page

<Response [200]>

Some websites can be protected from web scraping. In this case, you need to find ways of exploiting their systems. Most of the time addition of `User-Agent` header in the request should do the thing

In [None]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://news.ycombinator.com/"
page = requests.get(URL, headers=headers)

page

<Response [200]>

## Looking for text
Now that we have a page downloaded we can start extracting information from it. First, we need to find exact parts of the page we want to get text from. We can do it by going to the web page and using [inspect element](https://zapier.com/blog/inspect-element-tutorial/) functionality of your browser. Let's say we want to collect titles of all posts in the page:
<div><img src="https://i.imgur.com/HJ8DwAN.png" width="600px"/></div>
We can see that information that we need is inside the `<a>` tag with the `storylink` class. We can use `soup` to extract the information by providing newly learned properties of the elements:

In [None]:
headers = {"User-Agent": "Mozilla/5.0"}

URL = "https://news.ycombinator.com/"
page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")

for title in soup.find_all("a", class_="storylink"):
    print(title.text)

## Storing collected data
You can use collected data to create pandas DataFrames. This enables you to make various data processing operations (removing corrupted information, filling in missing data). Pandas will also make your life easier when you will need to export collected information to `csv` files or insert data to databases.

In [None]:
# Run this if needed
!pip install pandas

In [None]:
import pandas as pd

page = requests.get("https://news.ycombinator.com/", headers={"User-Agent": "Mozilla/5.0"})

soup = BeautifulSoup(page.content, "html.parser")

collected_information = [{"title": title.text} for title in soup.find_all("a", class_="storylink")]

df = pd.DataFrame(collected_information)
df.head()

Unnamed: 0,title
0,Six GRU Officers Charged in Connection with Wo...
1,First Bitcoin “Mixer” Penalized for Violating ...
2,You don't need all those root certificates
3,No-till no-herbicide farming system in trial s...
4,"This page is a truly naked, brutalist HTML quine"


You can see how powerful `soup` is for doing web scraping tasks! 

## Exercise
Now it is your time to shine: you will need to extract more information from the web page and put it inside pandas DataFrame. Write code inside the cells for tests below to pass.

In [None]:
# Get score of every element


In [None]:
assert type(collected_scores[0]) == dict
assert list(collected_scores[0].keys())[0] == "score"

In [None]:
# Get age of the post


In [None]:
assert type(collected_age[0]) == dict
assert list(collected_age[0].keys())[0] == "post_age"

In [None]:
# Get post's author.


In [None]:
assert type(collected_users[0]) == dict
assert list(collected_users[0].keys())[0] == "post_author"

Now that you are able to extract information from the page, you need to automate this process and extract information from multiple pages at once and add the information to one `dict` (appending it to list). You will need to complete a function that returns a dataframe with columns: title, link, and age

In [73]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
def collect_information(x):
    collected_titles = []
    collected_ages = []
    collected_links = []
    page_number = 1
    
    while page_number <= x:
        url = "https://news.ycombinator.com/news?p="
        page = requests.get(url+str(page_number), headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(page.content, "html.parser")
        for title in soup.find_all("a", class_="storylink"):
            collected_titles.append(title.text)
            collected_links.append(title.get("href"))
        for age in soup.find_all("span", class_ = "age"):
           collected_ages.append(age.text)
        time.sleep(2)   
        page_number += 1   
    df = pd.DataFrame({'title': collected_titles, 'link' : collected_links, "age": collected_ages})
        
    return df


In [74]:
df = collect_information(5)

In [75]:
df = collect_information(2)
assert df.shape == (60, 3)
assert df.columns[0] == "title"

df = collect_information(5)
assert df.shape == (150, 3)
assert df.columns[2] == "age"

In [72]:
df.shape

(150, 3)

## Exercise
This lesson for the sub-project you will need to scrape reddit's homepage. As it can be difficult to do it with the new Reddit's design, you can visit the old one instead - [old.reddit.com](https://old.reddit.com). You will need to complete these tasks:
1. Visit old.reddit.com and look at its layout
2. Create a function that can scrape pages of Reddit. It should return a dataset with: `post score`, `post title`, `post thumb URL`, `posts comments count`, `posts subreddit`. All the missing information should be replaced with `None` values
3. Scrape website and create dataframe that has at least 300 rows
4. Export dataframe's data to csv format and save it in your repository as `reddit_data.csv`

**IMPORTANT**: You might want to check out [this](https://pypi.org/project/fake-useragent/) Python package that creates a fake user agent for the request.

Do not forget to write clean code!


---

In [93]:
page = requests.get("https://old.reddit.com/", headers={"User-Agent": "Mozilla/5.0"})
collected_titles = []
soup = BeautifulSoup(page.content, "html.parser")
for title in soup.find_all("a", class_="title may-blank"):
    collected_titles.append(title.text)

In [94]:
collected_titles

[]

In [95]:
def collect_information(x):
    collected_titles = []
    # collected_subreddits = []
    collected_links = []
    # collected_comments_count = []
    # collected_scores = []
    page_number = 0
    
    while page_number <= x:
        url = "https://old.reddit.com/?count="
        page = requests.get("https://old.reddit.com/?count=25", headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(page.content, "html.parser")
        for title in soup.find_all("a", class_="may-blank"):
            collected_titles.append(title.text)
            collected_links.append(title.get("href"))
        # for age in soup.find_all("span", class_ = "age"):
        #    collected_ages.append(age.text)
        # time.sleep(2)   
        page_number += 25  
    df = pd.DataFrame({'title': collected_titles, 'link' : collected_links})
        
    return df

In [96]:
y =collect_information(2)

In [97]:
y

Unnamed: 0,title,link
0,0:24,/r/PublicFreakout/comments/oo4157/tampa_bay_ra...
1,Tampa Bay Rays game last night had some entert...,/r/PublicFreakout/comments/oo4157/tampa_bay_ra...
2,DZepperoni,https://old.reddit.com/user/DZepperoni
3,r/PublicFreakout,https://old.reddit.com/r/PublicFreakout/
4,3320 comments,https://old.reddit.com/r/PublicFreakout/commen...
...,...,...
126,,/r/dankmemes/comments/oo241d/once_in_a_lifetim...
127,Once in a lifetime opportunity,/r/dankmemes/comments/oo241d/once_in_a_lifetim...
128,TheBrownMamba8,https://old.reddit.com/user/TheBrownMamba8
129,r/dankmemes,https://old.reddit.com/r/dankmemes/


## Summary
If you want to create your own dataset, collecting it by using web scraping technique is one of the ways of doing it. Sometimes data will not be presented in pretty preprocessed csv columns and you will need to extract it from various resources by yourself. Automation of mentioned process can sometimes be a key part of a successful ML Project.

---