# HTML
Extracting data from HTML has cons and pros. 
Some advantages include;
- Abundance of information on the web
- Low cost
- Enormous amount of information in depth and breadth
- High level of accuracy (people care about the accuracy of content they push to the internet)
Disadvantages include;
- The domain may not welcome web scraping, this can usually be found in terms and conditions, so make sure you inspect that document before scraping a website.
- Information and structure of web pages can fast, therefore scripts will need to be updated accordingly.

```html
The basic HTML structure:
<tagname attribute="value" attribute="value">text</tagname>
```

### Imports
`bs4` is a library which is able to parse XML/HTML. `requests` will be used to retrieve the content.

In [1]:
import requests
import bs4

### Set the base url
Set the url of the website you want to extract information from. In this example im using a link to a website designed for people to learn web scraping, it contains quotes by famous people.

In [2]:
base_url = 'http://quotes.toscrape.com/'

### Make a GET request to the url

In [3]:
r = requests.get(base_url)

### Check the status code
If the status code is 2xx we can proceed. If you get 4xx or 5xx it means the resource is not available or there is a server side error.

In [4]:
r.status_code

200

### Create a BeautifulSoup object
Store the content from the request in a `BeautifulSoup` object

In [5]:
soup = bs4.BeautifulSoup(r.content, 'lxml')

### (Optional) Preview the soup
This is optional, however it is preferred to do this in the browser using element inspector, as it offers a high level of interactivity

In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

### Store all divs with class of 'quote'
Using `findAll()` method returns a resultset of all tags you specify with a attribute:value filter. In this case 'div' tag was targetted, filtering by divs with a class of quote. classes are often used by web developers to apply styling across a wide range of elements, that being said they can often be used to target lists of items, in this scenario the list is a list of quotes.

In [7]:
quotes = soup.findAll('div', {'class':'quote'})

In [20]:
type(quotes)

bs4.element.ResultSet

### Found 10

In [8]:
len(quotes)

10

### Inspect one 'quote'
We can see in this quote tag represented by the div with `class` attribute of `quote`, the quote text is stored in a <span> tag with an `itemprop` of text. The author name is stored in <small> tag with an `itemprop` attribute value of `author`. The tags are conveniently stored in the `content` attribute of the <meta> tag with `class` of `keywords`

In [9]:
quotes[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

In [21]:
type(quotes[0])

bs4.element.Tag

### Extracting data from tags
Tag level data can be pulled out with `.find()`, similar to `.findAll()` except it returns one result instead of a result set. The inner text of a tag can be retrieved using the .text property. The value of attributes can be retrieved using the `.get()` method on a tag.

In [10]:
quotes[0].find('small', {'itemprop': 'author'})

<small class="author" itemprop="author">Albert Einstein</small>

In [11]:
quotes[0].find('small', {'itemprop': 'author'}).text.strip()

'Albert Einstein'

In [12]:
quotes[0].find('small', {'itemprop': 'author'}).get('class')

['author']

### Storing Data

In [13]:
import pandas
from datetime import datetime

### Create empty pandas DataFrame

In [14]:
data = pandas.DataFrame()

### Iterate through quotes result set
A few things are happening here
- Iterate through quotes with `enumerate`. This will give the index as well as the item. The index can be used to set the row index in the dataframe
- Storing 4 pieces of information from each quote
- Setting the dataframe cells according to those pieces of information
- Finally after the loop is finished set a timestamp and base_url column, this is just good practice so you can backtrack to when you got the data

In [16]:
for index, q in enumerate(quotes):
    
    quote_text = q.find('span', {'itemprop': 'text'}).text.strip()
    quote_author = q.find('small', {'itemprop': 'author'}).text.strip()
    quote_tags = q.find('meta', {'class': 'keywords'}).get('content')
    author_link = q.find('a').get('href')
    
    data.loc[index, 'quote_text'] = quote_text
    data.loc[index, 'quote_author'] = quote_author
    data.loc[index, 'quote_tags'] = quote_tags
    data.loc[index, 'author_link'] = author_link
    

In [17]:
data['base_url'] = base_url
data['timestamp'] = datetime.timestamp(datetime.now())

### Preview the data

In [18]:
data

Unnamed: 0,quote_text,quote_author,quote_tags,author_link,base_url,timestamp
0,“The world as we have created it is a process ...,Albert Einstein,"change,deep-thoughts,thinking,world",/author/Albert-Einstein,http://quotes.toscrape.com/,1559013000.0
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities,choices",/author/J-K-Rowling,http://quotes.toscrape.com/,1559013000.0
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational,life,live,miracle,miracles",/author/Albert-Einstein,http://quotes.toscrape.com/,1559013000.0
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy,books,classic,humor",/author/Jane-Austen,http://quotes.toscrape.com/,1559013000.0
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself,inspirational",/author/Marilyn-Monroe,http://quotes.toscrape.com/,1559013000.0
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood,success,value",/author/Albert-Einstein,http://quotes.toscrape.com/,1559013000.0
6,“It is better to be hated for what you are tha...,André Gide,"life,love",/author/Andre-Gide,http://quotes.toscrape.com/,1559013000.0
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison,failure,inspirational,paraphrased",/author/Thomas-A-Edison,http://quotes.toscrape.com/,1559013000.0
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt,/author/Eleanor-Roosevelt,http://quotes.toscrape.com/,1559013000.0
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor,obvious,simile",/author/Steve-Martin,http://quotes.toscrape.com/,1559013000.0


### Save data (optional)

In [19]:
data.to_csv('../outputs/quotes.csv', index = True)