# Web Scraping with BeautifulSoup

Web scraping is a technique to extract data from multiple web pages. [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a Python library designed for quick turnaround projects like screen-scraping.

## What are we scraping?
Virginia Dangerous Dog Registry is a **public**, searchable online database of dogs declared dangerous by local courts. It also serves as the mechanism by which local animal control officers must report dangerous dogs to the Virginia Department of Agriculture and Consumer Services.

Source: https://dd.va-vdacs.com/

## Table of Contents:
1. [Import dependencies](#Install-and-Import-dependencies)
2. [Make a POST request](#Make-a-POST-request)
3. [Extracting Content from HTML](#Extracting-Content-from-HTML)
4. [Writing to a CSV](#Writing-to-a-CSV)

## Install and import dependencies

Install these two packages using the pip installer.

```console
$ pip install requests
$ pip install beautifulsoup4
```

Import Dependencies
1. Restful Call: *requests*
2. BeautifulSoup: *BeautifulSoup*
3. Regular Expressions: *re*

In [81]:
# import dependencies
import requests
from bs4 import BeautifulSoup
import re

## Make a POST request
Make a POST request with post data to get the search result

In [82]:
# compose the post data that returns the entire search result
postData = {'SYSJURISNO': 0,
            'DD_TAG_NO': '', 
            'NAME': '', 
            'HOME_ADDRESS': '',
            'HOME_ZIP': ''}

# make a POST request
r = requests.post('https://dd.va-vdacs.com/Public/PublicSearch', data=postData)

Inspecting the Response

In [83]:
# inspecting status code
print(r.status_code)

200


In [84]:
# inspecting the response
print(r.text)





    <p class="text-center" style="font-weight: bold;">Search results ( 548 dogs )</p>
    <table class="table table-condensed table-hover">
        <tr style="font-size: .95em;">
            <th>Locality</th>
            <th>Dog Name</th>
            <th>Dangerous Dog Tag #</th>
            <th>Address</th>
            <th>City</th>
            <th>Zip</th>
            <th class="text-center">Details</th>
        </tr>
            <tr>
                <td>Accomack County</td>
                <td>Bentley</td>
                <td>0156</td>
                <td>22094 Sawyer Dr
                <td>Greenbush
                <td>23357
                </td>
                <td class="text-center">
                    <a data-ajax="true" data-ajax-mode="replace" data-ajax-success="ShowDetails" data-ajax-update="#dvDetails" href="/Public/Details/1432" style="text-decoration:none;padding-left:15px;padding-right:15px;">Details</a>
                </td>
            </t

## Extracting Content from HTML
Note: The search results return 548 dogs with an additional header row.

In [107]:
# Parse HTML DOM structure
soup = BeautifulSoup(r.text, "html.parser")

# entire search result
table = soup.table

# list of rows
rows = table.findAll('tr')
len(rows)

549

In [109]:
# list of dogs without the details column
dogs = [
    ['Locality', 'Dog Name', 'Dangerous Dog Tag #', 'Address', 'City', 'Zip']
]

# remove header row
rows = rows[1:]

# extract dog information from result
for row in rows:
    tds = row.findAll('td')
    dog = []
    for td in tds[:6]:
      dog.append(td.text.strip())
    dogs.append(dog)
    
dogs[:5]

[['Locality', 'Dog Name', 'Dangerous Dog Tag #', 'Address', 'City', 'Zip'],
 ['Albemarle County',
  'Jaxson Luis Alvarez',
  '1538',
  '580 Radford Lane #207\r\n                Charlottesville\r\n                22903\r\n                \n\nDetails',
  'Charlottesville\r\n                22903\r\n                \n\nDetails',
  '22903'],
 ['Albemarle County',
  'Kujo',
  '2052',
  '1398 Lonesome Mountain Hollow\r\n                Charlottesville\r\n                22903\r\n                \n\nDetails',
  'Charlottesville\r\n                22903\r\n                \n\nDetails',
  '22903'],
 ['Albemarle County',
  'Blockhead',
  '1657',
  '8095 Langhorn Road\r\n                Scottsville\r\n                24590\r\n                \n\nDetails',
  'Scottsville\r\n                24590\r\n                \n\nDetails',
  '24590'],
 ['Albemarle County',
  'Max',
  '2057',
  '3558 Layton Drive\r\n                Charlottesville\r\n                22903\r\n                \n\nDetails',
  'Ch

In [None]:
To be continued:
    - Address is a nested <td> element
    - Remove details column

## Writing to a CSV