# Web Scraping using Python

## 1.1 What is Web Scraping

In any given field, there is a growing amount of data collected and presented all over the web. Some of your projects might need access to the data presented at various websites. For example, you might want to collect the stock prices information presented at Bloomberg or Google Finance website. You could casually browse these websites to get the information you are looking for. How  do you allow your program to structurally read these websites and consume the information the applications need? Web Scraping helps you in retrieving such information the program needs in a meaningful way.

In order to create such Web Scraping programs, we first need to understand the structure of a typical web page and/or web site. We will explain the details in the next section. In addition to understanding what you can expect in a web page, you need to tools to extract information from such web pages. These could be libraries exposed to you in any programming language. In python, there are several such libraries, one of which is Beautiful Soup. In the subsequent sections we will describe how to use such a library in your program.   


## 1.2 Structure of a Web Page

A Website is really what the designer of the website thinks, an abstraction. The website is a collection of individual pages. It can change from website to website and the type of website, an aplication or a static page. However, over the years as the web technology matured, there are some generic layers to a webpage that you come to expect. A webpage typically contains:

Header
Navigation
Bread Crumb
Tab Navigation
Content Pane
    Left Navigation Links
    Main content
    Right Details
Footer

In most cases, we are interested in the content pane to extract the most useful information from the websites.


## 1.3 Http Request

Web pages are accessed using http requests. Python provides a module called "requests".
The requests module allows a python program to read a web page by taking in a URL string.



``` python
import requests
response = requests.get("http://www.google.com")

```

As seen above, import imports the requests object.The get method takes a URL string and gets the html page from the website referred by the URL.

In the followin code example, we will see a code snippet that reads a page and prints out the html text.
In this example, we will use a web site called http://quotes.toscrape.com/ which is specifically designed as a play ground for web scraping. This site has a list of nice quotes by famous authors. The quotes are presented in the web page as tables and div's. The idea is to read the quotes and the corresponding authors from the page. In addition, we could extract additional information about the author, such as the authors' date of birth. This information is provided in a separate author's page.

In the following sections, we will read all of these information for each quote and create our own quotes dataset.

### Exercise

Write code to open url "http://quotes.toscrape.com/" and print the html response.

In [1]:
#write your code below

### Solution

```python
import requests

url = 'http://quotes.toscrape.com/'
response = requests.get(url)
html = response.content
print(html)
```

## 2. Parsing Website Content using Beautiful Soup

There are several libraries available in Python for the purpose of web scraping such as Beautifu Soup, Scrapy etc. In this lesson we will see how to use Beautiful Soup to scrape a web site.

### 2.1 Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. The main class is BeautifulSoup which is part of a library bs4 (BeautifulSoup 4.0). In order to start using BeautifulSoup, we need to import BeautifulSoup from bs4.

```python
from bs4 import BeautifulSoup
```


### 2.2 Parsing the web pages 

In order to start using the module for parsing web pages, you start by passing the html document to BeautifulSoup.
For example:

```python
from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="http://www.google.com">Click to visit google.com</b>')
```


### 2.3 Reading the content usign tags and classes

We can read a file line-by-line using a **for loop**. This is both efficient and fast.

```python
a_tag = soup.a
print(a_tag)

#For a specific class you can use find_all() with class_ attribute as below
specific_tags = soup.find_all("div", class_="my_class")

```

### Exercise

Write code to open "http://quotes.toscrape.com/" using BeautifulSoup. Read and print all html tags (div) with class as "quote". Save it in variable 'tag'

In [5]:
#Write your code below
#hint use requests.get() and the url to get the html content

### Solution

```python
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
print(tag)
```

    
    

## 3. Getting to the detailed tags in HTML

Once we find a way to parse the html, the next step is to get to the individual tags that we are interested in.

### 2.1 Class and Tags

Beautiful Soup provides functions to obtain specific tags and search for tags and class.

```python
tag = html_text.find("div", class_="myclass")
```

In the above example, we can find the specific tag , div in this case, where the class is "myclass". This fetches the first such result. There is also a find_all function that provies a list of such tags as a result of the search.

### 2.2 Getting the content within

In order to fetch the content, use tag.text.
For example, in this case the code will be similar to below:

```python
    content = tag.text
    print("Content is :: ", content)
```

### Exercise

From the previous exercise 'tag' variable with all DIV with class "quote", find all "small" tags with the class as  "author". Print the content within the resulting div tags.

In [7]:
#write your code below

### Solution

```python
for t in tag:
    print(t.span.text)
    a = t.find("small", class_="author")
    author = a.text
    print("By :: ", author)
```

## 3. Reading Sub-Links and Joining data

In real world, you might need to gather information from multiple sources or at the very least multiple pages or urls from the same application to get the related information. Here is an example of how to do it using BeautifulSoup.

We obtain the hrefs from anchor tags in two steps. The first step is to obtain the anchors by calling a .a on the base html doc. Then on the anchor element otain the href (link) by calling .get('href').

```python
    hrefs = soup.a
    link = hrefs.get('href')
    
```

The you could call the requests.get(link) to obtain the sub-link page and perform fresh parsing of the related content.


### Exercise

From the quotes example, write a function get_uathor_link() and for each parent author class, pass the link and obtain the author brith-born-date and print them. 


In [10]:
#Modify the code below
def get_author_dob(link_url):
    response_auth = requests.get(link_url)
##add your function code here
    
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
#print(tag)
for t in tag:
    text_out =  t.span.text 
## Add your logic to call the function get_author_dob() with links

    




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


### Solution

```python

def get_author_dob(link_url):
    response_auth = requests.get(link_url)
    html_auth = response_auth.content
    auth_soup = BeautifulSoup(html_auth)
    auth_tag = auth_soup.find("span", class_="author-born-date")
    return auth_tag.text

output = []

soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
#print(tag)
for t in tag:
    text_out =  t.span.text 
    print(t.span.text)
    
    a = t.find("small", class_="author")
    author = a.text
    text_out = text_out + ',"' + author + '"' 
    print("By :: ", author)
    hrefs = t.a
    link = hrefs.get('href')
    link_url = url+ link
    print(link_url)
    dob = get_author_dob(link_url)
    print("Author DOB:", dob)
    text_out = text_out + ',"' + dob + '"' 
    output.append(text_out)
    
```

## 4. Writing to file

When we are done with the scraping operations we can write the data to a file.

```python
file_wr = open("output.csv", "a")
for line in output:
    file_wr.write(str(line))
file_wr.close()
```

### Exercise

From the previous exercise with "output" from quotes, write all lines to a "quotes.csv" file by appending each line from scraped quotes.


In [11]:
#Write your code below

### Solution

```python
file_wr = open("quotes.csv", "a")
for line in output:
    file_wr.write(str(line) + '\n')
file_wr.close()
```