# Beautiful Soup:Philadelphia Health Inspections

Let's scrape data for restaurant inspections using the searchable database maintained by the Philadelphia Inquirer, available at: [https://data.inquirer.com/inspections/](https://data.inquirer.com/inspections/)

![](imgs/clean-plates.png)

### Getting the HTML content

We'll use the built-in "requests" module to request the content of the website and load it into Python.

In [26]:
import requests
import pandas as pd

Use a "get" request to get the content:

In [27]:
url = "https://data.inquirer.com/inspections/"
r = requests.get(url)

In [28]:
type(r)

requests.models.Response

In [29]:
r.status_code

200

### BeautifulSoup makes this much more manageable

[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) makes it much easier to extract out different parts of a website.

In [30]:
from bs4 import BeautifulSoup

Initialize the "soup" object, using the content of our get request:

In [31]:
soup = BeautifulSoup(r.content, 'html.parser')

### Making the HTML "pretty"

In [32]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="initial-scale=1.0, maximum-scale=1.0, user-scalable=1.0" name="viewport"/>
  <title>
   Clean Plates | The Philadelphia Inquirer
  </title>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="Philadelphia Inquirer" name="publication" property="og:site_name">
   <meta content="noindex" name="robots">
    <meta content="summary" name="twitter:card"/>
    <meta content="@phillyinquirer" name="twitter:site"/>
    <meta content="Clean Plates | The Philadelphia Inquirer" name="title">
     <meta content="Clean Plates | The Philadelphia Inquirer" name="twitter:title"/>
     <meta content="Clean Plates | The Philadelphia Inquirer" name="contenttitle" property="og:title">
      <meta content="Website" name="contenttype">
       <meta content="website" property="o

### How to extract the content we want?

**Two important functions**

1. `soup.select_one(selector)`: finds the first element matching the selector query and returns **one** element
1. `soup.select(selector)`: finds **all** elements matching the selector 

**Recommended reading:** Note on beautiful soup and css selectors in [this week's repository](https://github.com/MUSA-550-Fall-2023/week-6/blob/master/css-selectors.md)

### To the Web Inspector!

We can use the web inspector to understand the structure of the website and identify the HTML tags that we want to extract content from.

### Let's select the first row

Web browsers will let us copy the CSS selector for individual elements.

Use: **Right Click > Copy > Copy Selector**

In [33]:
selector = "#inspection_unit_0"

In [34]:
# Select the first row
# NOTE: we are using "select_one()" to select only one matching element
first_row = soup.select_one(selector)

In [35]:
first_row

<div class="inspectionUnit inspectionUnitEven transitionBackground" id="inspection_unit_0"><a href="https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124"><div class="inspectionUnitInner"><div class="inspectionNameWrapper"><div class="inspectionUnitName transitionAll">Bamba and Saduty Produce Market</div><div class="inspectionUnitDate"><span class="inspectionUnitDateTitle">Inspection date:</span> Apr 18, 2024</div><div class="clearAll"></div></div><div class="inspectionUnitInfoWrapper"><div class="inspectionUnitAddress">4603 FRANKFORD AVE 19124</div><div class="inspectionUnitNeigborhood"></div><div class="clearAll"></div></div><div class="inspectionUnitCountWrapper"><span class="inspectionCountLabel">Violations</span><li class="inspectionUnitCount inspectionUnitCountFoodborne inspectionUnitCountFirst"><span class="inspectionCountNumber">5</span><span class="inspectionUnitInfoItemTitle"><span class="inspectionUnitInfo

### But we need all of the rows!

When you use Copy -> Copy Selector, the copied css selector will only match the specific element you've highlighted, no others!

#### Generalizing your selectors

We need to **generalize the selector** to just select all rows from the table, not just the first one. To do this, we'll need to go back to the web inspector and understand the structure of the website. 


When trying to identify a general selector, try to look for common patterns, like shared class names or id strings, across the tags you want to extract.


In our case, it looks like the "inspectionUnit" class is shared across all of the row div elements

![](imgs/clean-plates-2.png)

In [36]:
# Get all tags with the inspectionUnit class name
# Note we are using select() to select ALL elements
rows = soup.select('.inspectionUnit')

In [37]:
len(rows)

50

In [38]:
# get the first row
row = rows[0]

print(row.prettify())

<div class="inspectionUnit inspectionUnitEven transitionBackground" id="inspection_unit_0">
 <a href="https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124">
  <div class="inspectionUnitInner">
   <div class="inspectionNameWrapper">
    <div class="inspectionUnitName transitionAll">
     Bamba and Saduty Produce Market
    </div>
    <div class="inspectionUnitDate">
     <span class="inspectionUnitDateTitle">
      Inspection date:
     </span>
     Apr 18, 2024
    </div>
    <div class="clearAll">
    </div>
   </div>
   <div class="inspectionUnitInfoWrapper">
    <div class="inspectionUnitAddress">
     4603 FRANKFORD AVE 19124
    </div>
    <div class="inspectionUnitNeigborhood">
    </div>
    <div class="clearAll">
    </div>
   </div>
   <div class="inspectionUnitCountWrapper">
    <span class="inspectionCountLabel">
     Violations
    </span>
    <li class="inspectionUnitCount inspectionUnitCountFoodborne i

### Now, let's extract out the content from each row

We'll look for the following items:

1. The link to the full inspection report
1. The name of the restaurant
1. The restaurant address
1. The number of *food-borne violations*


#### 1. The report link

The link is stored as the "href" attribute of the first "a" element:


In [39]:
a = row.select_one("a")

a

<a href="https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124"><div class="inspectionUnitInner"><div class="inspectionNameWrapper"><div class="inspectionUnitName transitionAll">Bamba and Saduty Produce Market</div><div class="inspectionUnitDate"><span class="inspectionUnitDateTitle">Inspection date:</span> Apr 18, 2024</div><div class="clearAll"></div></div><div class="inspectionUnitInfoWrapper"><div class="inspectionUnitAddress">4603 FRANKFORD AVE 19124</div><div class="inspectionUnitNeigborhood"></div><div class="clearAll"></div></div><div class="inspectionUnitCountWrapper"><span class="inspectionCountLabel">Violations</span><li class="inspectionUnitCount inspectionUnitCountFoodborne inspectionUnitCountFirst"><span class="inspectionCountNumber">5</span><span class="inspectionUnitInfoItemTitle"><span class="inspectionUnitInfoItemTitleLabel">Foodborne Illness Risk Factors</span></span></li><li class="inspectionUnitC

Attributes can be extracted from the "attrs" attribute

In [40]:
a.attrs

{'href': 'https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124'}

In [41]:
link = a.attrs['href']

link

'https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124'

#### 2. The restaurant name

Use the "inspectionUnitName" class name to identify the right element.

![](imgs/clean-plates-name.png)

In [42]:
# Use the . to specify class name
name_tag = row.select_one(".inspectionUnitName")

name_tag

<div class="inspectionUnitName transitionAll">Bamba and Saduty Produce Market</div>

In [43]:
name = name_tag.text

name

'Bamba and Saduty Produce Market'

#### 3. The restaurant address

Use the "inspectionUnitAddress" class name to identify the right element.

![](imgs/clean-plates-address.png)

In [44]:
# Use the . to specify class name
addr_tag = row.select_one(".inspectionUnitAddress")

addr_tag

<div class="inspectionUnitAddress">4603 FRANKFORD AVE 19124</div>

In [45]:
address = addr_tag.text

address

'4603 FRANKFORD AVE 19124'

#### 4. The number of food-borne violations

It looks like the count number is within an element with class "inspectionCountNumber". BUT: this class is repeated on the retail violations element as well as the food-borne violations element. So, we'll need to use *nested* selectors

First, select elements with the "inspectionUnitCountFoodborne" class name and then the "inspectionCountNumber" class name.

![](imgs/clean-plates-violations.png)

In [46]:
# The number of foodborne violations
count = row.select_one(".inspectionUnitCountFoodborne .inspectionCountNumber")

int(count.text)

5

If the violations count is zero, there won't be any element that matches the above selector (the website instead uses a "inspectionUnitCountZero" class. 

If the element doesn't exist, the `select_one()` function will return "None"

### Putting it all together

Now, we can put this code into a for loop and extract out the content from every row on the page:

In [47]:
# Store the data from each row
data = []

# Step 1: Get all rows
rows = soup.select(".inspectionUnit")

# Loop over all rows
for this_row in rows:
    
    # Step 2: Get the report link
    # Note: we are using the "this_row" variable from the for loop
    a = this_row.select_one("a")
    url = a.attrs["href"]

    # Step 3: Get the name
    name_tag = this_row.select_one(".inspectionUnitName")
    name = name_tag.text

    # Step 4: Get the name
    addr_tag = this_row.select_one(".inspectionUnitAddress")
    address = addr_tag.text

    # Step 5: Get the violation count
    count_tag = this_row.select_one(".inspectionUnitCountFoodborne .inspectionCountNumber")

    # If there were no matches (None was returned), it means the count was zero
    if count_tag is None:
        count = 0
    else:
        count = int(count_tag.text)

    # Step 6: Save it
    data.append(
        {
            "name": name,
            "address": address,
            "foodborne_count": count,
            "url": url,
        }
    )

In [49]:
data

[{'name': 'Bamba and Saduty Produce Market',
  'address': '4603 FRANKFORD AVE 19124',
  'foodborne_count': 5,
  'url': 'https://data.inquirer.com/inspections/philly/?detail=Bamba%20and%20Saduty%20Produce%20Market|4603%20FRANKFORD%20AVE%2019124'},
 {'name': 'DiBruno Brothers',
  'address': '1730 CHESTNUT ST 19103',
  'foodborne_count': 5,
  'url': 'https://data.inquirer.com/inspections/philly/?detail=DiBruno%20Brothers|1730%20CHESTNUT%20ST%2019103'},
 {'name': 'Carangi Baking Company',
  'address': '2655 S ISEMINGER ST 19148',
  'foodborne_count': 3,
  'url': 'https://data.inquirer.com/inspections/philly/?detail=Carangi%20Baking%20Company|2655%20S%20ISEMINGER%20ST%2019148'},
 {'name': 'Bright Horizons at Philadelphia Cathedral Learning Center',
  'address': '23 S 38TH ST 19104',
  'foodborne_count': 2,
  'url': 'https://data.inquirer.com/inspections/philly/?detail=Bright%20Horizons%20at%20Philadelphia%20Cathedral%20Learning%20Center|23%20S%2038TH%20ST%2019104'},
 {'name': 'Buddy Buddy B

In [50]:
# Make a dataframe
scraped_df = pd.DataFrame(data)

Sort by violation count:

In [51]:
scraped_df.sort_values("foodborne_count", ascending=False, ignore_index=True)

Unnamed: 0,name,address,foodborne_count,url
0,Bamba and Saduty Produce Market,4603 FRANKFORD AVE 19124,5,https://data.inquirer.com/inspections/philly/?...
1,Fado Pub,1500 LOCUST ST 19102,5,https://data.inquirer.com/inspections/philly/?...
2,David Gvinianidze Arts and Music Center,716 RED LION RD 19115,5,https://data.inquirer.com/inspections/philly/?...
3,DiBruno Brothers,1730 CHESTNUT ST 19103,5,https://data.inquirer.com/inspections/philly/?...
4,Hayashi Sushi & Poke,814 S 47TH ST 19143,4,https://data.inquirer.com/inspections/philly/?...
5,Asaad Halal Gyro/MC Crepes/V03396,7300 E ROOSEVELT BLVD 19149,4,https://data.inquirer.com/inspections/philly/?...
6,A & S Deli,2848 S 17TH ST 19145,4,https://data.inquirer.com/inspections/philly/?...
7,Carangi Baking Company,2655 S ISEMINGER ST 19148,3,https://data.inquirer.com/inspections/philly/?...
8,Espinal and Ramos Grocery,2000 MEDARY AVE 19138,2,https://data.inquirer.com/inspections/philly/?...
9,DiBruno Brothers Events & Catering,435 FAIRMOUNT AVE 19123,2,https://data.inquirer.com/inspections/philly/?...
