# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm#MID), thank goodness we can search for these things.

## Preparation: Knowing your tags

These questions are the same for every data set, and might not work exactly for yours.

**Search for every operator with 'dirt' in their name, including abandoned mines.**

### What is the tag and class name for every row of data?

In [1]:
# <tr></tr>

### What is the tag and class name for every mine operator's name?

In [2]:
# <font style="FONT-SIZE:.75em;"></font>

### What is the tag and class name for every mine's name?

In [3]:
# <font style="FONT-SIZE:.75em;"></font>

### What is the tag and class name for every mine operator's name?

In [4]:
# <font style="FONT-SIZE:.75em;"></font>

### What is the tag and class name for every mine operator's name?

In [5]:
# <font style="FONT-SIZE:.75em;"></font>

## Being lazy

If you only needed these results, what would you do instead of scraping them?

In [6]:
# Copy-paste them to MS Excel and save it as CSV

## Setup: Import what you'll need to scrape the page

Use `requests`, not `urllib`.

In [7]:
import requests
from bs4 import BeautifulSoup

url_mines = "https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp"
response_mines = requests.get(url_mines)
doc_mines = BeautifulSoup(response_mines.text, "html.parser")
doc_mines

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!-- ****************************************** Begin META TAGS ********************************************* -->
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<!-- ****************************************** End META TAGS *********************************************** -->
<title>MSHA  - Mine  Data Retrieval System - Basic Mine Information Page</title>
<script src="/2010redesign/Scripts/federated-analytics.js" type="text/javascript"></script>
<script src="/2010redesign/Scripts/AC_RunActiveContent.js" type="text/javascript"></script>
<link href="/2010Redesign/includes/Print.css" media="print" rel="stylesheet" type="text/css"/>
<link href="/2010Redesign/Includes/MSHAwebnew.css" media="screen" rel="stylesheet" type="text/css">
<link href="/2010Redesign/includes/style-screen.css" media="screen" rel="stylesheet" type="tex

## Try to scrape the page

To test if you requested the page correctly, save the BeautifulSoup document as `doc` and run the code `doc.find_all('tr')[-1].text` to get the text of the last `<tr>` element.

- If the result starts with **Total Number of Mines Found**, you were successful.

In [8]:
doc_mines.find_all('tr')

[<tr>
 <td width="30%"><a href="/drs/drshome.htm"><img alt="Mine Data Retrieval System" border="0" height="75" src="/drs/images/drslogo.png" width="300"/></a></td>
 <th width="40%"><font style="FONT-SIZE:1.20em;">Operator Name or Mine Name<br/> Search</font></th>
 <td width="30%"> </td></tr>]

In [9]:
doc_mines.find_all('font', attrs = {'class':'FONT-SIZE:.75em;'})

[]

In [10]:
doc_mines.find_all('tr')[-1].text

'\n\nOperator Name or Mine Name Search\n\xa0'

## Actually scraping

### Hopefully you know that each `tr` is supposed to be your data. What is the index of the first row element that is actually a result?

`.text` will help you here.

In [11]:
url_mines = "https://arlweb.msha.gov/drs/ASP/OprNameStatesearch.asp"
data = {
    'OperSearch':'dirt',
    'Abandoned':'No',
    'MineName':'',
    'StateSearch':'None',
    'CM':'All',
    'x':'19',
    'y':'5',
    'MC':'Opersearch'
}

response_mines = requests.post(url_mines, data=data)
doc_mines = BeautifulSoup(response_mines.text, "html.parser")
doc_mines

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!-- ****************************************** Begin META TAGS ********************************************* -->
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<!-- ****************************************** End META TAGS *********************************************** -->
<title>MSHA  - Mine  Data Retrieval System - Basic Mine Information Page</title>
<script src="/2010redesign/Scripts/federated-analytics.js" type="text/javascript"></script>
<script src="/2010redesign/Scripts/AC_RunActiveContent.js" type="text/javascript"></script>
<link href="/2010Redesign/includes/Print.css" media="print" rel="stylesheet" type="text/css"/>
<link href="/2010Redesign/Includes/MSHAwebnew.css" media="screen" rel="stylesheet" type="text/css">
<link href="/2010Redesign/includes/style-screen.css" media="screen" rel="stylesheet" type="tex

In [12]:
# For finding in what tr the data are starting from, we are going
# to take a look at what a high-numbered tr looks like
doc_mines.find_all('tr')
doc_mines.find_all('tr')[10]

<tr>
<td align="center">
<form action="/drs/ASP/BasicMineInfostatecounty.asp" method="post" name="search">
<input name="MineId" type="hidden" value="4608254"/><font style="FONT-SIZE:.75em;">4608254</font>
</form></td>
<td><font style="FONT-SIZE:.75em;"><!-- DNT --><b>WV</b><!-- /DNT --> </font></td>
<td><font style="FONT-SIZE:.75em;"><!-- DNT -->Dirt Con<!-- /DNT -->  </font></td>
<td><font style="FONT-SIZE:.75em;"><!-- DNT -->Hog Lick Quarry<!-- /DNT --></font></td>
<td align="center"><font style="FONT-SIZE:.75em;"><!-- DNT -->Surface             <!-- /DNT --></font></td>
<td align="center"><font style="FONT-SIZE:.75em;"><!-- DNT -->M<!-- /DNT --> </font></td>
<td><font style="FONT-SIZE:.75em;">Temporarily Idled  </font></td>
<td><font style="FONT-SIZE:.75em;">Crushed, Broken Limestone NEC  </font></td>
<th bgcolor="#000000"><input alt="More Information" border="0" name="submit" src="/drs/images/moreinfo.jpg" type="image"/></th></tr>

In [13]:
# trs containing data are having an <input> with the name='MindId'
# We can use a counter with a for loop to see what tr index
# a td with <input name='MindId'> starts from

# The list of all trs
trs = doc_mines.find_all('tr')

counter = 0
for tr in trs:
    # for each tr, if the loop finds an <input name='MindId'>, then we
    # want it to stop, because we will have found the first td containing data.
    if tr.find('input', attrs = {'name':'MineId'}):
        break
    # until the loop breaks, add 1. the sum the loop will break at,
    # will be the index number of our trs list, tds containing data start from
    counter += 1
    
print("Results start from the", counter, "th element of the list.")

Results start from the 7 th element of the list.


### Loop through each operator result, printing its name

Use LIST SLICING to skip the non-data row(s).

In [14]:
for mines in doc_mines.find_all('tr')[7:]:
    tds = mines.find_all('td')
    if len(tds) >= 3:
        operator_name = mines.find_all('td')[2].text
        print(operator_name)

 Newberg Rock & Dirt  
AM Dirtworks & Aggregate Sales  
Dirt Company  
Dirt Con  
Dirt Doctor Inc  
Dirt Works  
Holley Dirt Company, Inc  
Krueger Brothers Gravel & Dirt  
M R Dirt  
M.R. Dirt Inc.  
P B Dirt Movers, Inc  
PB Dirt Movers  
PB Dirt Movers, Inc  
Prescott Dirt, LLC  
R D Blankenship Dirt Work LLC  
SIMPSON DIRTWORX LLC  
SIMPSON DIRTWORX LLC  
Spry's Dirt & Gravel, Inc.  
Vogt Dirt Service  


### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [15]:
for mines in doc_mines.find_all('tr')[7:]:
    tds = mines.find_all('td')
    if len(tds) >= 3:
        operator_name = mines.find_all('td')[2].text.strip()
        print(operator_name)
        operator_id = mines.find_all('td')[0].text.strip()
        print(operator_id)
        print('--------')

Newberg Rock & Dirt
3503598
--------
AM Dirtworks & Aggregate Sales
4801789
--------
Dirt Company
5001797
--------
Dirt Con
4608254
--------
Dirt Doctor Inc
2103723
--------
Dirt Works
4104757
--------
Holley Dirt Company, Inc
0801306
--------
Krueger Brothers Gravel & Dirt
3901432
--------
M R Dirt
3609624
--------
M.R. Dirt Inc.
3609931
--------
P B Dirt Movers, Inc
1519799
--------
PB Dirt Movers
4407296
--------
PB Dirt Movers, Inc
4407270
--------
Prescott Dirt, LLC
0203332
--------
R D Blankenship Dirt Work LLC
2901986
--------
SIMPSON DIRTWORX LLC
4300768
--------
SIMPSON DIRTWORX LLC
4300776
--------
Spry's Dirt & Gravel, Inc.
2302283
--------
Vogt Dirt Service
2103518
--------


## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

In [16]:
# for mines in doc_mines.find_all('tr')[7:]:
#     tds = mines.find_all('td')
#     if len(tds) >= 1:
#         operator_id = mines.find_all('td')[0].text.strip()
#     if len(tds) >= 2:
#         state = mines.find_all('td')[1].text.strip()
#     if len(tds) >= 3:
#         operator_name = mines.find_all('td')[2].text.strip()
#     if len(tds) >= 4:
#         mine_name = mines.find_all('td')[3].text.strip()        
#     if len(tds) >= 5:
#         mine_type = mines.find_all('td')[4].text.strip()
#     if len(tds) >= 6:
#         coal_or_metal = mines.find_all('td')[5].text.strip()
#     if len(tds) >= 7:
#         status = mines.find_all('td')[6].text.strip()
#     if len(tds) >= 8:
#         commodity = mines.find_all('td')[7].text.strip()
#         br = '\n'
#         print('Operator_name: ' + operator_name + br +\
#               'ID: ' + operator_id + br +\
#               'Mine_name: ' + mine_name + br +\
#               'State: ' + state + br +\
#               'Mine_type: ' + mine_type + br +\
#               'C_M: ' + coal_or_metal + br +\
#               'Status: ' + status + br +\
#               'Commodity: ' + commodity + br +\
#               '--------')

In [17]:
dirt_mines = []

for mines in doc_mines.find_all('tr')[7:]:
    mines_dict = {}
    tds = mines.find_all('td')
    if len(tds) >= 1:
        operator_id = mines.find_all('td')[0].text.strip()
    if len(tds) >= 2:
        state = mines.find_all('td')[1].text.strip()
    if len(tds) >= 3:
        operator_name = mines.find_all('td')[2].text.strip()
    if len(tds) >= 4:
        mine_name = mines.find_all('td')[3].text.strip()        
    if len(tds) >= 5:
        mine_type = mines.find_all('td')[4].text.strip()
    if len(tds) >= 6:
        coal_or_metal = mines.find_all('td')[5].text.strip()
    if len(tds) >= 7:
        status = mines.find_all('td')[6].text.strip()
    if len(tds) >= 8:
        commodity = mines.find_all('td')[7].text.strip()
        mines_dict['Operator_name'] = operator_name
        mines_dict['Operator_ID'] = operator_id
        mines_dict['Mine_name'] = mine_name
        mines_dict['State'] = state
        mines_dict['Mine_type'] = mine_type
        mines_dict['C_M'] = coal_or_metal
        mines_dict['Status'] = status
        mines_dict['Commodity'] = commodity
        dirt_mines.append(mines_dict)
                   
print(len(dirt_mines))
print(dirt_mines)

19
[{'Operator_name': 'Newberg Rock & Dirt', 'Operator_ID': '3503598', 'Mine_name': 'Newberg Rock & Dirt', 'State': 'OR', 'Mine_type': 'Surface', 'C_M': 'M', 'Status': 'Active', 'Commodity': 'Crushed, Broken Stone NEC'}, {'Operator_name': 'AM Dirtworks & Aggregate Sales', 'Operator_ID': '4801789', 'Mine_name': 'AM Dirtworks & Aggregate Sales', 'State': 'ND', 'Mine_type': 'Surface', 'C_M': 'M', 'Status': 'Intermittent', 'Commodity': 'Construction Sand and Gravel'}, {'Operator_name': 'Dirt Company', 'Operator_ID': '5001797', 'Mine_name': 'Bush Pilot', 'State': 'AK', 'Mine_type': 'Surface', 'C_M': 'M', 'Status': 'Intermittent', 'Commodity': 'Construction Sand and Gravel'}, {'Operator_name': 'Dirt Con', 'Operator_ID': '4608254', 'Mine_name': 'Hog Lick Quarry', 'State': 'WV', 'Mine_type': 'Surface', 'C_M': 'M', 'Status': 'Temporarily Idled', 'Commodity': 'Crushed, Broken Limestone NEC'}, {'Operator_name': 'Dirt Doctor Inc', 'Operator_ID': '2103723', 'Mine_name': 'Rock Lake Plant', 'State': 

### Save that to a CSV

In [18]:
import pandas as pd

In [19]:
df_mines = pd.DataFrame(dirt_mines)
df_mines.shape

(19, 8)

In [20]:
df_mines.head()

Unnamed: 0,C_M,Commodity,Mine_name,Mine_type,Operator_ID,Operator_name,State,Status
0,M,"Crushed, Broken Stone NEC",Newberg Rock & Dirt,Surface,3503598,Newberg Rock & Dirt,OR,Active
1,M,Construction Sand and Gravel,AM Dirtworks & Aggregate Sales,Surface,4801789,AM Dirtworks & Aggregate Sales,ND,Intermittent
2,M,Construction Sand and Gravel,Bush Pilot,Surface,5001797,Dirt Company,AK,Intermittent
3,M,"Crushed, Broken Limestone NEC",Hog Lick Quarry,Surface,4608254,Dirt Con,WV,Temporarily Idled
4,M,Construction Sand and Gravel,Rock Lake Plant,Surface,2103723,Dirt Doctor Inc,MN,Intermittent


### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [21]:
df_mines.to_csv('dirt_mines.csv', index = False, columns = ['Operator_ID', 'Operator_name', 'State', 'Mine_name', 'Mine_type', 'C_M', 'Commodity', 'Status'])

In [22]:
pd.read_csv('dirt_mines.csv').shape

(19, 8)

In [23]:
pd.read_csv('dirt_mines.csv').head()

Unnamed: 0,Operator_ID,Operator_name,State,Mine_name,Mine_type,C_M,Commodity,Status
0,3503598,Newberg Rock & Dirt,OR,Newberg Rock & Dirt,Surface,M,"Crushed, Broken Stone NEC",Active
1,4801789,AM Dirtworks & Aggregate Sales,ND,AM Dirtworks & Aggregate Sales,Surface,M,Construction Sand and Gravel,Intermittent
2,5001797,Dirt Company,AK,Bush Pilot,Surface,M,Construction Sand and Gravel,Intermittent
3,4608254,Dirt Con,WV,Hog Lick Quarry,Surface,M,"Crushed, Broken Limestone NEC",Temporarily Idled
4,2103723,Dirt Doctor Inc,MN,Rock Lake Plant,Surface,M,Construction Sand and Gravel,Intermittent
