# Web Scraping with BeautifulSoup

Web scraping is a technique to extract data from multiple web pages. [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) is a Python library designed for quick turnaround projects like screen-scraping.

## What are we scraping?
**Virginia Dangerous Dog Registry** is a **public**, searchable online database of dogs declared dangerous by local courts. It also serves as the mechanism by which local animal control officers must report dangerous dogs to the Virginia Department of Agriculture and Consumer Services.

Source: https://dd.va-vdacs.com/

## Table of Contents:
1. [Import dependencies](#Install-and-Import-dependencies)
2. [Make a POST request](#Make-a-POST-request)
3. [Extracting Content from HTML](#Extracting-Content-from-HTML)
4. [Export to a CSV file](#Export-to-a-CSV-file)

## Install and import dependencies

Install these two packages using the pip installer.

```console
$ pip install requests
$ pip install beautifulsoup4
```

Import Dependencies
1. Restful Call: *requests*
2. BeautifulSoup: *BeautifulSoup*
3. Regular Expressions: *re*
4. Library to write to CSV: *csv*
5. Pandas: *pd*

In [1]:
# import dependencies
import requests
from bs4 import BeautifulSoup
import re
import csv
import pandas as pd

## Make a POST request
Make a POST request with post data to get the search result

In [2]:
# compose the post data that returns the entire search result
postData = {'SYSJURISNO': 0,
            'DD_TAG_NO': '', 
            'NAME': '', 
            'HOME_ADDRESS': '',
            'HOME_ZIP': ''}

# make a POST request
r = requests.post('https://dd.va-vdacs.com/Public/PublicSearch', data=postData)

Inspecting the Response

In [3]:
# inspecting status code
print(r.status_code)

200


In [4]:
# inspecting the response
# print(r.text)

## Extracting Content from HTML
Note: The search results return 548 dogs with an additional header row.

In [5]:
# Parse HTML DOM structure
soup = BeautifulSoup(r.text, "html.parser")

# entire search result
table = soup.table

# list of rows
rows = table.findAll('tr')
len(rows)

549

### Data Cleaning
- Remove the details column (A hyperlink to a details page)
- Add an empty pandas dataframe with column headers

In [6]:
# pandas column header
col_header = ['locality', 'dog_name', 'dangerous_dog_tag', 'address', 'city', 'zip']

# Create pandas dataframe to host dogs info
dogsDf = pd.DataFrame(columns=col_header)

# remove header row
rows = rows[1:]

Print a sample row

In [7]:
# print first row
print(rows[0])

<tr>
<td>Accomack County</td>
<td>Bentley</td>
<td>0156</td>
<td>22094 Sawyer Dr
                <td>Greenbush
                <td>23357
                </td>
<td class="text-center">
<a data-ajax="true" data-ajax-mode="replace" data-ajax-success="ShowDetails" data-ajax-update="#dvDetails" href="/Public/Details/1432" style="text-decoration:none;padding-left:15px;padding-right:15px;">Details</a>
</td>
</td></td></tr>


**Warning:** The address column is a messy nested <td> element

In [8]:
# extract dog information from result
for row in rows:
    dog = []
    locality = row.find('td')
    dogName = locality.find_next_sibling('td')
    dogTag = dogName.find_next_sibling('td')
    full_address = dogTag.find_next_sibling('td')
    
    # traverse through nested full_address which contains two part/elements <td>
    # part 1: address
    address = full_address.contents[0].strip()
    
    # part 2: city, zip, and details
    city_zip_detail = full_address.contents[1]
    
    # extract city information
    city = city_zip_detail.contents[0].strip()
    
    # extract zip code (nested <td> which contains zip and details)
    zip_addr = city_zip_detail.contents[1].contents[0].strip()
    
    # compose dog
    dog.append(locality.string)
    dog.append(dogName.string)
    dog.append(dogTag.string)
    dog.append(address)
    dog.append(city)
    dog.append(zip_addr)
    
    # append to dogsDf dataframe
    dogsDf = dogsDf.append(pd.Series(dog, index=dogsDf.columns), ignore_index=True)
    
dogsDf.head()

Unnamed: 0,locality,dog_name,dangerous_dog_tag,address,city,zip
0,Accomack County,Bentley,156,22094 Sawyer Dr,Greenbush,23357
1,Albemarle County,Jaxson Luis Alvarez,1538,580 Radford Lane #207,Charlottesville,22903
2,Albemarle County,Kujo,2052,1398 Lonesome Mountain Hollow,Charlottesville,22903
3,Albemarle County,Blockhead,1657,8095 Langhorn Road,Scottsville,24590
4,Albemarle County,Max,2057,3558 Layton Drive,Charlottesville,22903


## Export to a CSV file
Use pandas dataframe `.to_csv()` method to generate CSV file

In [9]:
dogsDf.to_csv('./datasets/va_dangerous_dog.csv')