# Intro to Web Scraping

## 1. What is Web Scraping

In any given field, there is a growing amount of data collected and presented all over the web. Some of your projects might need access to the data presented at various websites. For example, you might want to collect the stock prices information presented at Bloomberg or Google Finance website. You could casually browse these websites to get the information you are looking for. How  do you allow your program to structurally read these websites and consume the information the applications need? Web Scraping helps you in retrieving such information the program needs in a meaningful way.

In this micro course we will learn how to use Python to do Web Scraping.

If we want to create the Web Scraping programs mentioned earlier, we first need to understand the structure of a typical web page/website. We will explain the details later. 

In addition to understanding what you can expect in a web page, you need tools to extract information from such web pages. These could be libraries exposed to you in any programming language. In python, there are several libraries, one of which is **Beautiful Soup**. In the subsequent sections we will describe how to use such a library in your program.   


### a) Structure of a Web Page

A Website is really what the designer of the website thinks, an abstraction. The website is a collection of individual pages. It can change from one to another, and the type can be an aplication or a static page. However, over the years as the web technology matured, there are some generic layers to a webpage that you come to expect. A webpage typically contains:
```
Header
Navigation
Bread Crumb
Tab Navigation
Content Pane
    Left Navigation Links
    Main content
    Right Details
Footer
```
In most cases, we are interested in the content pane to extract the most useful information from the websites.


### b) Http Request

Web pages are accessed using http requests. Python provides a module called **requests**.
The requests module allows a python program to read a web page by taking in a URL string.

``` python
import requests
response = requests.get("http://www.google.com")
html = response.content
```

As seen above, `import` imports the requests library. The `get` method takes a URL string and gets the html page from the website referred by the URL. The `content` shows the raw bytes of the response's content.

In the following exercise, we will create a code snippet that reads a page and prints out the html content.
In this example, we will use a web site called http://quotes.toscrape.com/ which is specifically designed as a play ground for web scraping. This site has a list of nice quotes by famous authors. The quotes are presented in the web page as tables and div's. The idea is to read the quotes and the corresponding authors from the page. In addition, we could extract additional information about the author, such as the authors' date of birth. This information is provided in a separate author's page.

After we opened the above website, we can right click and select "Inspect" to reveal the html content. We can find many "tags" on this page, ```<head>, <body>, <div>, <span>```, etc. These tags define the structure of the web page. We will only focus on ```<body>``` while scraping because all the information is usually under this tag. Here is a complete list of [HTML tags](https://www.w3schools.com/TAGS/default.ASP). For most tags, after using there will be a closing tag with **/**, like ```<div>...</div>``` or ```<span>...</span>```. Between the opening and closing tags, it is the actual content which we will be focusing to scrape.

<img src=Quotes_to_Scrape.jpg width="900"/>

As you've noticed, in some tags there are other elements such as ```class, itemprop```, etc. These are the attributes of the tag, which define the tag's behaviors, styles, etc. 

In the following sections, we will read all of these information for each quote and create our own quotes dataset.

### Exercise

Write code to open url http://quotes.toscrape.com/ and print the html content. You can practice in the following code cell.


## 2. Parsing Website Content using Beautiful Soup


There are several libraries available in Python for the purpose of web scraping such as Beautifu Soup, Scrapy etc. In this section we will learn how to use **Beautiful Soup** to scrape a web site.

### a) What is Beautiful Soup

**Beautiful Soup** is a Python library for pulling data out of HTML and XML files. The main class is BeautifulSoup which is part of a library bs4 (BeautifulSoup 4.0). In order to start using BeautifulSoup, we need to import BeautifulSoup from bs4.

```python
from bs4 import BeautifulSoup
```

If there's an error preventing you from `import` the library, it's probably that you didn't have the **bs4** library installed yet. You can simply run the following code before the `import` code:
```python
!pip install bs4
```

### b) Parsing the web pages 

In order to start using the module for parsing web pages, you start by passing the html document to BeautifulSoup.
For example:

```python
from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="http://www.google.com">Click to visit google.com</a>')
```


### c) Reading the content using tags and classes

We can read a tag by using that tag's name. For example, `soup.a` will return the first `<a></a>` tag.

```python
a_tag = soup.a
print(a_tag)
```
If we want to find all the `<div></div>` tags, along with specific class `my_class`, we can use `find_all()` with `class_` attribute as below.
```python
specific_tags = soup.find_all("div", class_="my_class")

```
Note here, `class` is a reserved word in Python so we need to add `_` after the key word to use it properly. 

### Exercise

Write code to open http://quotes.toscrape.com using BeautifulSoup (Hint: Refer to the first section). Read and print all html tags (div) with class as "quote". Save it in variable 'tag'.

## 3. Getting to the detailed tags in HTML

Once we find a way to parse the html, the next step is to get to the individual tags that we are interested in.

### a) Class and Tags

Beautiful Soup provides functions to obtain specific tags and search for tags and class.

```python
tag = html_text.find("div", class_="myclass")
```

In the above example, we can find the specific tag , `<div></div>` in this case, where the class is `myclass`. This fetches the first such result. If we use `find_all()` function that will provide all such tags as a result of the search.

### b) Getting the content within

In order to fetch the content, use tag`.text`.<br>
For example, we can read the content of this html tag `<a href="http://www.google.com">Click to visit google.com</a>` by using `.text`, which will return `Click to visit google.com`. in this case the code will be similar to below:

```python
    content = tag.text
    print("Content is :: ", content)
```

### Exercise

From the previous exercise 'tag' variable with all `<div>` with class `quote`, find all "small" tags with the class as  "author". Print the content within the resulting div tags. <br>
(Hint: First we need to iterate over the variable 'tag', because it is a list and we need to check each list item using **for loop**. Then we need to extract the text for `small` tags with class `author`, and the text for `span` tags.)


### Solution
```python
for t in tag:
    print(t.span.text)
    a = t.find("small", class_="author")
    author = a.text
    print("By :: ", author)
```

## 4. Reading Sub-Links and Joining data

In real world, you might need to gather information from multiple sources or at the very least multiple pages or urls from the same application to get the related information. Here is an example of how to do it using **BeautifulSoup**.

We obtain the hrefs from anchor tags (`<a></a>`) in two steps. The first step is to obtain the anchors by calling `.a` on the base html doc. Then on the anchor element obtain the href (link) by calling `.get('href')`.

```python
    hrefs = soup.a
    link = hrefs.get('href')
    
```

Then you could call the requests.get(link) to obtain the sub-link page and perform fresh parsing of the related content.


### Exercise

From the quotes example, write a function get_author_link() and for each parent author class, pass the link and obtain the author brith-born-date and print them. 


In [None]:
#Modify the code below
def get_author_dob(link_url):
    response_auth = requests.get(link_url)
##add your function code here
    
    
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
#print(tag)
for t in tag:
    text_out =  t.span.text 
## Add your logic to call the function get_author_dob() with links

    


### Solution

```python

def get_author_dob(link_url):
    response_auth = requests.get(link_url)
    html_auth = response_auth.content
    auth_soup = BeautifulSoup(html_auth)
    auth_tag = auth_soup.find("span", class_="author-born-date")
    return auth_tag.text

output = []

soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
#print(tag)
for t in tag:
    text_out =  t.span.text 
    print(t.span.text)
    
    a = t.find("small", class_="author")
    author = a.text
    text_out = text_out + ',"' + author + '"' 
    print("By :: ", author)
    hrefs = t.a
    link = hrefs.get('href')
    link_url = url+ link
    print(link_url)
    dob = get_author_dob(link_url)
    print("Author DOB:", dob)
    text_out = text_out + ',"' + dob + '"' 
    output.append(text_out)
    
```


## 5. Writing to file

When we are done with the scraping operations we can write the data to a file. <br>
(If you are not familiar with this operation, feel free to skip this section. We will talk about `file_IO` in future micro courses.)

```python
file_wr = open("output.csv", "a")
for line in output:
    file_wr.write(str(line))
file_wr.close()
```

### Exercise

From the previous exercise with "output" from quotes, write all lines to a "quotes.csv" file by appending each line from scraped quotes.


# Learn more about Web Scraping using Python
Now you know the basics of scraping data from websites. It gives you more freedom to get data compared with using APIs. However, it may not be stable because a number of websites may block the access to your program from time to time. If you want to learn more about how to deal with this problem and how to use third party APIs to retrieve data, check out our course at https://refactored.ai. Our course on python covers everything from introductory python to pandas, to data visualization with Plotly, to statistics and machine learning techniques.