# Web Scraping using Python

## 1.1 What is Web Scraping

In any given field, there is a growing amount of data collected and presented all over the web. Some of your projects might need access to the data presented at various websites. For example, you might want to collect the stock prices information presented at Bloomberg or Google Finance website. You could casually browse these websites to get the information you are looking for. How  do you allow your program to structurally read these websites and consume the information the applications need? Web Scraping helps you in retrieving such information the program needs in a meaningful way.


## 1.2 Structure of a Web Page

A Website is really what the designer of the website thinks, an abstraction. The website is a collection of individual pages. It can change from website to website and the type of website, an aplication or a static page. However, over the years as the web technology matured, there are some generic layers to a webpage that you come to expect. A webpage typically contains:

Header
Navigation
Bread Crumb
Tab Navigation
Content Pane
    Left Navigation Links
    Main content
    Right Details
Footer

In most cases, we are interested in the content pane to extract the most useful information from the websites.


## 1.3 Http Request

Web pages are accessed using http requests. Python provides a module called "requests".
The requests module allows a python program to read a web page by taking in a URL string.



``` python
import requests
response = requests.get("http://www.google.com")

```

As seen above, import imports the requests object.
The get method takes a URL string and gets the html page from the website referred by the URL.

Let us see a code that reads a page and prints out the html text.

In [13]:
import requests

url = 'http://quotes.toscrape.com/'
response = requests.get(url)
html = response.content
print(html)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

## 2. Parsing Website Content using Beautiful Soup

There are several ways to read a file in Python, let's see them one by one.

### 2.1 Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. The main class is BeautifulSoup which is part of a library bs4 (BeautifulSoup 4.0). In order to start using BeautifulSoup, we need to import BeautifulSoup from bs4.

```python
from bs4 import BeautifulSoup
```


### 2.2 Parsing the web pages 

In order to start using the module for parsing web pages, you start by passing the html document to BeautifulSoup.
For example:

```python
from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="http://www.google.com">Click to visit google.com</b>')
```


### 2.3 Reading the content usign tags and classes

We can read a file line-by-line using a **for loop**. This is both efficient and fast.

```python
a_tag = soup.a
print(a_tag)
```


In [14]:
import requests
from bs4 import BeautifulSoup

url = 'http://quotes.toscrape.com/'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
print(tag)

    
    

[<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>, <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K.



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


## 3. Getting to the detailed tags in HTML

Once we find a way to parse the html, the next step is to get to the individual tags that we are interested in.

### 2.1 Class and Tags

Beautiful Soup provides functions to obtain specific tags and search for tags and class.

```python
tag = html_text.find("div", class_="myclass")
```

In the above example, we can find the specific tag , div in this case, where the class is "myclass". This fetches the first such result. There is also a find_all function that provies a list of such tags as a result of the search.

### 2.2 Getting the content within

In order to fetch the content, use tag.text.
For example, in this case the code will be similar to below:

```python
    content = tag.text
    print("Content is :: ", content)
```


In [15]:
for t in tag:
    print(t.span.text)
    a = t.find("small", class_="author")
    author = a.text
    print("By :: ", author)


“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
By ::  Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
By ::  J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
By ::  Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
By ::  Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
By ::  Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.”
By ::  Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.”
By ::  André Gide
“I have not failed. I've just found 10,000 ways that won't work.”
By ::  Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is 

## 3. Reading Sub-Links and Joining data

In real world, you might need to gather information from multiple sources or at the very least multiple pages or urls from the same application to get the related information. Here is an example of how to do it using BeautifulSoup.

We obtain the hrefs from anchor tags in two steps. The first step is to obtain the anchors by calling a .a on the base html doc. Then on the anchor element otain the href (link) by calling .get('href').

```python
    hrefs = soup.a
    link = hrefs.get('href')
    
```

The you could call the requests.get(link) to obtain the sub-link page and perform fresh parsing of the related content.

In the quotes example, 


In [26]:
def get_author_dob(link_url):
    response_auth = requests.get(link_url)
    html_auth = response_auth.content
    auth_soup = BeautifulSoup(html_auth)
    auth_tag = auth_soup.find("span", class_="author-born-date")
    return auth_tag.text

output = []

soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
#print(tag)
for t in tag:
    text_out =  t.span.text 
    print(t.span.text)
    
    a = t.find("small", class_="author")
    author = a.text
    text_out = text_out + ',"' + author + '"' 
    print("By :: ", author)
    hrefs = t.a
    link = hrefs.get('href')
    link_url = url+ link
    print(link_url)
    dob = get_author_dob(link_url)
    print("Author DOB:", )
    text_out = text_out + ',"' + dob + '"' 
    output.append(text_out)
    



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
By ::  Albert Einstein
http://quotes.toscrape.com//author/Albert-Einstein
Author DOB:
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
By ::  J.K. Rowling
http://quotes.toscrape.com//author/J-K-Rowling
Author DOB:
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
By ::  Albert Einstein
http://quotes.toscrape.com//author/Albert-Einstein
Author DOB:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
By ::  Jane Austen
http://quotes.toscrape.com//author/Jane-Austen
Author DOB:
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
By ::  Marilyn Monroe
http://quotes.toscrape.com//author/Marilyn-Monroe
Author DOB:
“Try not to become a man of success.

## 4. Writing to file

When we are done with the scraping operations we can write the data to a file.

```python
file_wr = open("output.csv", "a")
for line in output:
    file_wr.write(str(line))
file_wr.close()
```



In [27]:
file_wr = open("output.csv", "a")
for line in output:
    file_wr.write(str(line) + '\n')
file_wr.close()