<img src="http://www.daysofcode.nl/assets/img/logo.png" style="float: left; margin: 5px; height: 40px">

# Introduction to Web Scraping with `beautifulSoup`
---


### Agenda
- Understand the structure and content of HTML
- Learn about elements, attributes, and element hierarchy in HTML
- Practice using Beautiful Soup to parse data from Realty Austin


<img src='https://cdn-images-1.medium.com/max/1200/1*MJ9Y4_tCTv99Gs_xZYlKrA.png' style="margin: 5px; height: 80px">

Webpages represent a wealth of information of unstructured data, most of which can be mined into structured data.


**If you can see it, it can be scraped, mined, and put into a dataframe.**


Before we begin the actual process of webscraping with python, it is important to cover the basic constructs that describe HTML as unstructured data. 

<a id='html'></a>

## Hypertext markup language (HTML)

---

In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.

<a id='elements'></a>
### Elements
Elements begin and end with open and close "tags", which are defined by namespaced, encapsulated strings. These namespaces that begin and end the elements must be the same.

```
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

As you may have several different titles or paragraphs on a single page, you can assign ID values to namespace to make more unique reference points.  IDs are also very useful for labelling nested elements.
```
<title id ='title_1'>I am a the first title.</title>
<p id ='para_1'>I am the first paragraph.</p>
<title id ='title_2'>I am a the second title.</title>
<p id ='para_2'>I am the second paragraph.</p>
```



**Elements can have parents and children:**
It is important to remember that an element can be both a parent and a child and whether to refer to the element as a parent or a child depends on the specific element you are referencing.


```
<body id = 'parent'>
    <div id = 'child_1'>I am the child of 'parent'
        <div id = 'child_2'>I am the child of 'child_1'
            <div id = 'child_3'>I am the child of 'child_2'
                <div id = 'child_4'>I am the child of 'child_4'</div>
            </div>
        </div>
    </div>
</body>
```
**or**
```
<body id = 'parent'>
    <div id = 'child_1'>I am the parent of 'child_2'
        <div id = 'child_2'>I am the parent of 'child_3'
            <div id = 'child_3'> I am the parent of 'child_4'
                <div id = 'child_4'>I am not a parent </div>
            </div>
        </div>
    </div>
</body>
```

<a id='attributes'></a>
### Attributes

HTML elements can have attributes.  They describe properties, and characteristics of elements.  Some affect how the element behaves or looks in terms of the rendered output by the browser.

The most common element is an "anchor" element.  Anchor elements often have an "href" element, which tells the browser where to go after it is clicked.  Anchor elements are typically are formatted in bold, and sometimes are underlined as a visual cue to differentiate itself.

**Markup that describes nn element with attributes, litterally looks like this**

```
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

### Element hierarchy

<img src='http://www.computerhope.com/jargon/d/dom1.jpg' style='height: 200px'>

```
<html>    
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
</html>
```

<a id='html-resources'></a>
### You are now qualified HTML experts

![](assets/certified.jpg)

Your HTML learning can continue...

Read all about the different elements supported amongst modern browsers:
 * [HTML5 Cheatsheet](http://websitesetup.org/html5-cheat-sheet/)
 * [Mozilla HTML Element Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
 * [HTML5 Visual Cheatsheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf)
 

<a id='practical'></a>

## Using Requests + Beautiful Soup to extract information from a webpage.

---

Beautiful Soup is a python library useful for pulling data out of HTML and XML files.  It works with many parsers, such as XPath and can be executed in an IDE, so it can be much easier to work with when first extracting information from html.

Please make sure that the required packages are installed: 

```bash
# beautiful soup:
> conda install bs4 
> conda install lxml

# or if conda doesn't work
> pip install bs4
> pip install lxml
```

<a id='step1'></a>
### Step 1: fetch the content by URL



In [1]:
# you will need the requests library in order to fully utilize bs4
import requests
from bs4 import BeautifulSoup

In [5]:
# call the webpage
response = requests.get("http://daysofcode.io/")

# You can use status codes to understand how the target server responds to your request.
#Ex. 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found
print ('Status Code: ',response.status_code)
# Pull HTML string out of requests and convert to a python string
res = response.text

# The first 500 characters of the content
print ("\nFirst part of HTML document fetched as string:\n")
print (res[:200])

Status Code:  200

First part of HTML document fetched as string:

<!DOCTYPE html>
<html lang="en">

	<head>
		<!-- Meta -->
		<meta charset="utf-8">
		<meta http-equiv="X-UA-Compatible" content="IE=edge">
		<meta name="viewport" content="width=device-width, initial-


<a id='step2'></a>
### Step 2: Parse HTML document with Beautiful Soup

This step allows us to access the elements of the document by XPATH expressions.

In [9]:
soup = BeautifulSoup(res, 'html5lib')

Soup queries are more like accessing information within a python object.  

> **Note:** There are many ways to get the elements in a "soup" object

Here are a few ways to select HMTL elements as "objects" within "soup" as a document.

In [10]:
# Singular element
soup.html.title

<title>Days of Code | Austin | Tuition-free courses</title>

In [12]:
# Just the text between elements
print(soup.html.title.text)

Days of Code | Austin | Tuition-free courses


In [19]:
# find single or multiple elements
# First parameter
element = soup.findAll("section", {"id": "application"})

In [32]:
for person in element[0].findAll('div', {'class': 'single_about'}):
    print(person.findAll('h4')[0].text)

Meghann Agarwal
Youssef Chaker
Natasha Robarge
Ryan Heneise
Mateo Clarke
Iona Olive
Lee Harper
Happiness Kisoso
