# Scraping many pages + Using Selenium

## The pages we'll be looking at

If I wanted to read specific information about a specfic mine, it takes a few steps. **Do these steps with your browser before you try any programming.**

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Researching mine information

## Preparation 

### When you search for information on a specific mine, what URL should Selenium visit first?

- *TIP: the answer is NOT `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp`*

In [1]:
# https://arlweb.msha.gov/drs/drshome.htm

### How can you identify the text field we're going to type the Mine ID into?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

In [2]:
# id='inputdrs'
# CSS selector: content > table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(1) > input:nth-child(4)

### How can you identify the search button we're going to click, or the form we're going to submit?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [3]:
# CSS selector: content > table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)
# click()

### Use Selenium to search using the mine ID `3901432`. Get me the operator's name by scraping.

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

In [4]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox()

# Visit the home page
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

# Wait until search for Mine ID is loaded
WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)'))

# Enter search keys
search_box = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(1) > input:nth-child(4)')
search_box.send_keys('3901432')

# Click search button
search_btn = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)')
search_btn.click()

# Get operator's name
operator_name = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > font:nth-child(1) > b:nth-child(1)')
operator_name.text

'Krueger Brothers Gravel & Dirt'

# Using .apply to find data about SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [5]:
import pandas as pd
df_mines_subset = pd.read_csv('mines-subset.csv')
df_mines_subset

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Open up `mines-subset.csv` in a text editor, then look at your dataframe. Is something different about them?

In [6]:
df_mines_subset = pd.read_csv('mines-subset.csv', converters = {'id':str})
df_mines_subset

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Scrape the operator's name for each of those mines and print it

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [7]:
def scrape_operator_name(row):
    # Visit the home page
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')

    # Wait until search for Mine ID is loaded
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)'))

    # Enter search keys
    search_box = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(1) > input:nth-child(4)')
    search_box.send_keys(row['id'])
    
    search_btn = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)')
    search_btn.click()
    
    # Get operators name after its CSS selector has been loaded
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > font:nth-child(1) > b:nth-child(1)'))
    operator_name = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > font:nth-child(1) > b:nth-child(1)')
    print(operator_name.text)
    
driver = webdriver.Firefox()
df_mines_subset.apply(scrape_operator_name, axis=1)

Dirt Works
Holley Dirt Company, Inc
M.R. Dirt Inc.


0    None
1    None
2    None
dtype: object

### Scrape the operator's name and save it into a new column

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [8]:
def scrape_operator_name(row):
    # Visit the home page
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')

    # Wait until search for Mine ID is loaded
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)'))

    # Enter search keys
    search_box = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(1) > input:nth-child(4)')
    search_box.send_keys(row['id'])
    
    search_btn = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)')
    search_btn.click()
    
    # Get operators name after its CSS selector has been loaded
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > font:nth-child(1) > b:nth-child(1)'))
    operator_name = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > font:nth-child(1) > b:nth-child(1)')
    operator_name = operator_name.text
    #print(operator_name.text)
    
    return pd.Series({
        'operator_name' : operator_name
    })

    
driver = webdriver.Firefox()
df_mines_subset.apply(scrape_operator_name, axis=1).join(df_mines_subset)

Unnamed: 0,operator_name,id
0,Dirt Works,4104757
1,"Holley Dirt Company, Inc",801306
2,M.R. Dirt Inc.,3609931


# Researching mine violations

Read the very top again to remember how to find mine violations

### When you search for a mine's violations, what URL is Selenium going to start on?

- *TIP: `requests` can send form data to load in the middle of a bunch of steps, but Selenium has to start at the beginning

In [9]:
# https://arlweb.msha.gov/drs/drshome.htm

### When you're searching for violations from the Mine Information page, how are you going to identify the "Beginning Date" field?

In [10]:
# CSS Selector
# form:nth-child(4) > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(1) > font:nth-child(1) > input:nth-child(2)

### When you're searching for violations from the Mine Information page, how are you going to identify the "Violations" button?

In [11]:
# CSS Selector
# form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > input:nth-child(1)

### When you're searching for violations from the Mine Information page, how are you going to identify the form or the button to click to get a list of the violations?

In [12]:
# CSS Selector
# form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(1)
# .click()

### Using the mine ID `3901432`, scrape all of their violations since 1/1/1995

**Save this into a CSV called `3901432-violations.csv`.** This CSV must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

**Tips:**

- *TIP: It's probably worth it to print them all first, then save them to a CSV once you know it's all working.*
- *TIP: You'll use the parent pattern - get the ROWS first (tr), then loop through and get the TABLE CELLS (td)*

In [13]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

# Visit the home page
driver = webdriver.Firefox()
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

# Wait until search for Mine ID is loaded
WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)'))

# Enter search keys
search_box = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(1) > input:nth-child(4)')
search_box.send_keys('3901432')

search_btn = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)')
search_btn.click()

# Wait for violations button to load
WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > input:nth-child(1)'))

# Enter search keys for Beginning Date
beginning_date = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(1) > font:nth-child(1) > input:nth-child(2)')
beginning_date.send_keys('1/1/1995')

# Tick violations button
violations_button = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > input:nth-child(1)')
violations_button.click()

# Click Get Report
violations_report = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(1)')
violations_report.click()

In [14]:
# All the cells with data are stored in rows with class name 'drsviols'
# Let's make a list of them and print them out
rows = driver.find_elements_by_class_name('drsviols')
for row in rows:
    print(row.text.strip())
    print('--------')

Violator Contractor
ID Citation/Order No. Case No. Date Issued Final Order Date Section of
Act Date Terminated Citation/
Order S & S Standard Proposed
Penalty ($) Citation/Order
Status Current Penalty
($) Amount Paid
To Date ($)
Krueger Brothers Gravel & Dirt    8750964   000361866 8/12/2014 10/22/2014  104(a) 9/2/2014 C N 56.18010 100.00 Closed 100.00  100.00 
Krueger Brothers Gravel & Dirt    6426439   000260865 6/7/2011 8/18/2011  104(a) 6/8/2011 C N 56.4201(a)(2) 100.00 Closed 100.00  100.00 
Krueger Brothers Gravel & Dirt    6426438   000260865 6/7/2011 8/18/2011  104(a) 6/8/2011 C N 56.4101 100.00 Closed 100.00  100.00 
Krueger Brothers Gravel & Dirt    6588189   000260865 6/2/2011 8/18/2011  104(a) 6/2/2011 C N 56.14200 100.00 Closed 100.00  100.00 
Krueger Brothers Gravel & Dirt    6588210   000238554 10/6/2010 12/26/2010  104(a) 10/6/2010 C N 50.30(a) 100.00 Closed 100.00  100.00 
Krueger Brothers Gravel & Dirt    6328074   000188398 5/8/2009 7/23/2009  104(a) 5/8/2009 C N 56.

55.00
--------
Closed
--------
55.00
--------
55.00
--------
Krueger Brothers Gravel & Dirt    7916126   390143205501 10/27/1998 1/26/1999  104(a) 10/27/1998 C N 56.14100(d) 55.00 Closed 55.00  55.00
--------
Krueger Brothers Gravel & Dirt
--------

--------
7916126
--------
390143205501
--------
10/27/1998
--------
1/26/1999
--------
104(a)
--------
10/27/1998
--------
C
--------
N
--------
56.14100(d)
--------
55.00
--------
Closed
--------
55.00
--------
55.00
--------
Krueger Brothers Gravel & Dirt    7916119   390143205502 10/27/1998 4/6/1999  104(a) 1/21/1999 C N 56.18010 55.00 Closed 55.00  55.00
--------
Krueger Brothers Gravel & Dirt
--------

--------
7916119
--------
390143205502
--------
10/27/1998
--------
4/6/1999
--------
104(a)
--------
1/21/1999
--------
C
--------
N
--------
56.18010
--------
55.00
--------
Closed
--------
55.00
--------
55.00
--------
Krueger Brothers Gravel & Dirt    7916115   390143205501 10/27/1998 1/26/1999  104(a) 10/27/1998 C N 41.20 55.00 Clos

In [15]:
# When we print all the rows with class name 'drsviols' the first element
# of the list is a long string that actually contains all the 
# violations of the given Mine ID

# What if we could turn this long string into a list of dictionaries
# using Regular Expressions, list slicing, and string splitting.

# Import RegEx's
import re

# The string that contains all the violations data starts with
# the word 'Violator'. We will use re.search to save only this string
# and then split it by breaking lines so we can get individual
# strings for every violation data row

# We need a new, empty list to store the strings we will create with splitting
new_list = []
for row in rows:
    data = row.text.strip()
    # In rows, search only for strings starting with V (from Violator)
    regex_search = re.search(r'^V', data)
    # When you find strings starting from V, split them by breaking line
    # and append the new strings on a list
    if regex_search:
        new_list.append(regex_search.string.split('\n'))
        
print(new_list)

[['Violator Contractor', 'ID Citation/Order No. Case No. Date Issued Final Order Date Section of', 'Act Date Terminated Citation/', 'Order S & S Standard Proposed', 'Penalty ($) Citation/Order', 'Status Current Penalty', '($) Amount Paid', 'To Date ($)', 'Krueger Brothers Gravel & Dirt    8750964   000361866 8/12/2014 10/22/2014  104(a) 9/2/2014 C N 56.18010 100.00 Closed 100.00  100.00 ', 'Krueger Brothers Gravel & Dirt    6426439   000260865 6/7/2011 8/18/2011  104(a) 6/8/2011 C N 56.4201(a)(2) 100.00 Closed 100.00  100.00 ', 'Krueger Brothers Gravel & Dirt    6426438   000260865 6/7/2011 8/18/2011  104(a) 6/8/2011 C N 56.4101 100.00 Closed 100.00  100.00 ', 'Krueger Brothers Gravel & Dirt    6588189   000260865 6/2/2011 8/18/2011  104(a) 6/2/2011 C N 56.14200 100.00 Closed 100.00  100.00 ', 'Krueger Brothers Gravel & Dirt    6588210   000238554 10/6/2010 12/26/2010  104(a) 10/6/2010 C N 50.30(a) 100.00 Closed 100.00  100.00 ', 'Krueger Brothers Gravel & Dirt    6328074   000188398 5

In [16]:
# Our regex and string split matched two results
# The long string with all the violations (now split in seperate strings)
# and a lone string 'Violator'. We only need the first element of the list
# to work with, so:
new_list = new_list[0]

In [17]:
# Elements 0-7 on our list, are the columns labels.
# We don't want to work with them, so we will just work with the
# eighth element of list onwards.
list_used = new_list[8:]
list_used[0]

'Krueger Brothers Gravel & Dirt    8750964   000361866 8/12/2014 10/22/2014  104(a) 9/2/2014 C N 56.18010 100.00 Closed 100.00  100.00 '

In [18]:
# We are about to use regex's and string splitting
# in order to extract each data cell in one unique cell that
# we will append to a list of lists. That way we will have a list
# with every table row as a list, and every table cell as a string 
# element of that list. Then we will be able to loop through every
# list and create a dictionary with our data that we will append
# to a new list out of which we will be able to generate the DataFrame
# and our CSV file

In [19]:
# Empty list in which we will append every row as a list
# and every cell as a list element. We will use regex's and string splitting
newnewlist = []

for element in list_used:
    # regex's to add a special character into our strings
    # and then string splitting on this special character
    # in order to transform each string into separate list elements
    # that will store our table cells data
    
    # the following rules have been desinged to work seamlessly for rows
    # that are referring to cases that are not asessed yet/non-assessable
    # in order for our code not to break in our multiple pages scraping functions
    
    # All substitutions are being executed step by step, according to the strings
    # that have remained after the regex before them
    # regex1: if there are 3-5 spaces before a decimal, sub with ^
    # We are asking for ^, so we don't interfere with text strings that include a comma
    spaces = re.sub(r'(\s{3,5})(\d)', '^\g<2>', element.strip())
    # regex2: to a closing parenthesis space decimal or letter, add a ^
    parentheses = re.sub(r'([)]) (\d|[A-Za-z])', '\g<1>^\g<2>', spaces)
    # regex3: there is a pattern in our data: cell-date, cell-C, cell-N, cell-number
    # so we ask to add a ^ between each one of these
    c_N = re.sub(r'([/]\d{4}) ([A-Z]) ([A-Z]) (\d)', '\g<1>^\g<2>^\g<3>^\g<4>', parentheses)
    # regex 4: if decimal and 1 or 2 spaces and decimal or capital letter, add ^
    decimals = re.sub(r'(\d) ? (\d|[A-Z])', '\g<1>^\g<2>', c_N)
    # regex5: letter space decimal, add ^
    lower = re.sub(r'([A-Za-z]) (\d)', '\g<1>^\g<2>', decimals)
    # split the string by ^ and apend the lists to a new empty list
    newnewlist.append(lower.split('^'))

print(newnewlist[0])

['Krueger Brothers Gravel & Dirt', '8750964', '000361866', '8/12/2014', '10/22/2014', '104(a)', '9/2/2014', 'C', 'N', '56.18010', '100.00', 'Closed', '100.00', '100.00']


In [20]:
# So now we have one list that contains every violation data row in lists
# Inside these lists, every data cell is a string a element
# We can loop inside these lists and create dictionaries
# that we can feed to a new (final) list out of which we will generate
# our DataFrame and CSV

final_data_list = []
# The CSS selector for the Standard Violated link start from element 18
# so we will need a counter starting from 18
i = 17
for an_element in newnewlist:
    empty_dict = {}
    empty_dict['mine_id'] = '3901432'
    empty_dict['citation_number'] = an_element[1]
    empty_dict['case_number'] = an_element[2]
    empty_dict['standard_violated'] = an_element[9]
    i = i + 1
    standard_violated_url = driver.find_element_by_id('anch_'+str(i)).get_attribute('href')
    empty_dict['standard_violated_url'] = standard_violated_url
    empty_dict['proposed_penalty'] = an_element[10]
    empty_dict['penalty_paid'] = an_element[-1].strip()
    final_data_list.append(empty_dict)
    
print(final_data_list)

[{'mine_id': '3901432', 'citation_number': '8750964', 'case_number': '000361866', 'standard_violated': '56.18010', 'standard_violated_url': 'https://arlweb.msha.gov/PROGRAMS/ASSESS.HTM#Outreach', 'proposed_penalty': '100.00', 'penalty_paid': '100.00'}, {'mine_id': '3901432', 'citation_number': '6426439', 'case_number': '000260865', 'standard_violated': '56.4201(a)(2)', 'standard_violated_url': 'http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-vol1/pdf/CFR-2014-title30-vol1-sec56-18010.pdf', 'proposed_penalty': '100.00', 'penalty_paid': '100.00'}, {'mine_id': '3901432', 'citation_number': '6426438', 'case_number': '000260865', 'standard_violated': '56.4101', 'standard_violated_url': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4201.pdf', 'proposed_penalty': '100.00', 'penalty_paid': '100.00'}, {'mine_id': '3901432', 'citation_number': '6588189', 'case_number': '000260865', 'standard_violated': '56.14200', 'standard_violated_url': 'http://www.gpo.gov/f

In [21]:
# We have recorded all the 18 violations included on the page 
# we were scraping
print(len(final_data_list))

18


In [22]:
import pandas as pd
df_mine_violations = pd.DataFrame(final_data_list)
df_mine_violations.head()

Unnamed: 0,case_number,citation_number,mine_id,penalty_paid,proposed_penalty,standard_violated,standard_violated_url
0,361866,8750964,3901432,100.0,100.0,56.18010,https://arlweb.msha.gov/PROGRAMS/ASSESS.HTM#Ou...
1,260865,6426439,3901432,100.0,100.0,56.4201(a)(2),http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...
2,260865,6426438,3901432,100.0,100.0,56.4101,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...
3,260865,6588189,3901432,100.0,100.0,56.14200,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...
4,238554,6588210,3901432,100.0,100.0,50.30(a),http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...


In [23]:
df_mine_violations.shape

(18, 7)

In [24]:
df_mine_violations.to_csv('3901432-violations-selenium.csv', index = False)

In [25]:
df_mine_violations = pd.read_csv('3901432-violations-selenium.csv', converters = {'case_number':str, 'citation_number':str})
df_mine_violations.head()

Unnamed: 0,case_number,citation_number,mine_id,penalty_paid,proposed_penalty,standard_violated,standard_violated_url
0,361866,8750964,3901432,100.0,100.0,56.18010,https://arlweb.msha.gov/PROGRAMS/ASSESS.HTM#Ou...
1,260865,6426439,3901432,100.0,100.0,56.4201(a)(2),http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...
2,260865,6426438,3901432,100.0,100.0,56.4101,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...
3,260865,6588189,3901432,100.0,100.0,56.14200,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...
4,238554,6588210,3901432,100.0,100.0,50.30(a),http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...


In [26]:
df_mine_violations.shape

(18, 7)

# Using .apply to save mine data for SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [27]:
import pandas as pd

mines_subset = pd.read_csv('mines-subset.csv', converters = {'id':str})
mines_subset

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Scrape the violations for each mine

**Save each mine's violations into separate CSV files.** Each CSV file must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

Make sure you are saving them into **separate files.** It might be nice to name them after the mine id.

- *TIP: Use .apply for this*
- *TIP: Print out the ID before you start scraping. That way you can take that ID and search manually to see if there is anything weird about the results.*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook 
- *TIP: It's probably worth it to print the fields first, then save them to a CSV once you know it's all working.*

In [28]:
# For looping through the ids of the csv file
# we will create a function that will do the same process we did
# before for only one id
def scrape_mine_violations_selenium(row):
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)'))
    search_box = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(1) > input:nth-child(4)')
    search_box.send_keys(row['id'])
    search_btn = driver.find_element_by_css_selector('table:nth-child(15) > tbody:nth-child(2) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(3)')
    search_btn.click()
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > input:nth-child(1)'))
    beginning_date = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(1) > font:nth-child(1) > input:nth-child(2)')
    beginning_date.send_keys('1/1/1995')
    violations_button = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > input:nth-child(1)')
    violations_button.click()
    violations_report = driver.find_element_by_css_selector('form:nth-child(4) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(2) > input:nth-child(1)')
    violations_report.click()

    table_rows = driver.find_elements_by_class_name('drsviols')
    new_list = []
    for one_of_them in table_rows:
        data = one_of_them.text.strip()
        regex_search = re.search(r'^V', data)
        if regex_search:
            new_list.append(regex_search.string.split('\n'))

    new_list = new_list[0]       
    list_used = new_list[8:]
    
    newnewlist = []
    for element in list_used:
        # The regex's we created for one id will apply to all ids
        # They have been designed so they work with both rows that have full data
        # and rows that have data that haven't been assessed yet
        spaces = re.sub(r'(\s{3,5})(\d)', ',  \g<2>', element.strip())
        parentheses = re.sub(r'([)]) (\d|[A-Za-z])', '\g<1>,  \g<2>', spaces)
        c_N = re.sub(r'([/]\d{4}) ([A-Z]) ([A-Z]) (\d)', '\g<1>,  \g<2>,  \g<3>,  \g<4>', parentheses)
        decimals = re.sub(r'(\d) ? (\d|[A-Z])', '\g<1>,  \g<2>', c_N)
        lower = re.sub(r'([A-Za-z]) (\d)', '\g<1>,  \g<2>', decimals)
        newnewlist.append(lower.split(',  '))

    final_data_list = []
    i = 17
    for something in newnewlist:
        # If we take a look to the mine ids provided in the csv file
        # for some of them there are some cases that haven't been assessed yet
        # So for this function to work across mine ids, we need to write
        # some new rules and if-statements for us to append correctly to their dictionaries
        # The main difference is that casess that aren't assessable, have only 10 elements
        # so we will use this as an indicator to seperate how our data are being written to their dicts
        empty_dict = {}
        empty_dict['mine_id'] = row['id']
        empty_dict['citation_number'] = something[1]
        i = i+1
        standard_violated_url = driver.find_element_by_id('anch_'+str(i)).get_attribute('href')
        empty_dict['standard_violated_url'] = standard_violated_url
        if len(something) <= 9:
            empty_dict['case_number'] = 'Not Assessed Yet/Non-Assessable'
            empty_dict['standard_violated'] = something[-2]
            empty_dict['proposed_penalty'] = 'Not Assessed Yet/Non-Assessable'
            empty_dict['penalty_paid'] = 'Not Assessed Yet/Non-Assessable'
        else:
            empty_dict['case_number'] = something[2]
            empty_dict['standard_violated'] = something[9]
            empty_dict['proposed_penalty'] = something[10]
            empty_dict['penalty_paid'] = something[-1].strip()
        final_data_list.append(empty_dict)

    df_mine_violations = pd.DataFrame(final_data_list)
    df_mine_violations.to_csv(row['id']+'-violations-selenium.csv', index = False)

In [29]:
# Calling our webdriver and then our function
driver = webdriver.Firefox()
mines_subset.apply(scrape_mine_violations_selenium, axis=1)

0    None
1    None
2    None
dtype: object

In [30]:
csvs_names_list = []

def csvs_names(row):
    csvs_names_list.append(row['id']+'-violations-selenium.csv')
    
mines_subset.apply(csvs_names, axis = 1)
csvs_names_list

['4104757-violations-selenium.csv',
 '0801306-violations-selenium.csv',
 '3609931-violations-selenium.csv']

In [31]:
df_4104757 = pd.read_csv(csvs_names_list[0], converters = {'id':str, 'case_number':str, 'citation_number':str})
df_4104757.head()

Unnamed: 0,case_number,citation_number,mine_id,penalty_paid,proposed_penalty,standard_violated,standard_violated_url
0,374480,8778046,4104757,100.0,100.0,56.14132(a),http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...
1,374480,8778047,4104757,162.0,162.0,56.18010,http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...
2,345454,8771784,4104757,100.0,100.0,50.30(a),http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...
3,348280,8771781,4104757,100.0,100.0,56.14100(b),http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...
4,345454,8771783,4104757,243.0,243.0,56.9300(a),http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...


In [32]:
df_4104757.shape

(18, 7)

In [33]:
df_0801306 = pd.read_csv(csvs_names_list[1], na_values = ['Not Assessed Yet/Non-Assessable'], converters = {'id':str, 'case_number':str, 'citation_number':str})
df_0801306.head()

Unnamed: 0,case_number,citation_number,mine_id,penalty_paid,proposed_penalty,standard_violated,standard_violated_url
0,Not Assessed Yet/Non-Assessable,8912694,801306,,,56.14132(a),http://www.ecfr.gov/cgi-bin/text-idx?SID=f462b...
1,000427623,8638781,801306,351.0,351.0,56.12028,http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...
2,000411633,8903434,801306,117.0,117.0,46.11(d),http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...
3,000411633,8903433,801306,100.0,100.0,56.14100(b),http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...
4,000411633,8903435,801306,117.0,117.0,56.9300(a),http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...


In [34]:
df_0801306.shape

(76, 7)

In [35]:
df_3609931 = pd.read_csv(csvs_names_list[2], na_values = ['Not Assessed Yet/Non-Assessable'], converters = {'id':str, 'case_number':str, 'citation_number':str})
df_3609931.head()

Unnamed: 0,case_number,citation_number,mine_id,penalty_paid,proposed_penalty,standard_violated,standard_violated_url
0,Not Assessed Yet/Non-Assessable,9317668,3609931,,,50.30(a),http://www.ecfr.gov/cgi-bin/text-idx?SID=f462b...
1,000421654,8928850,3609931,114.0,114.0,56.9301,http://www.gpo.gov/fdsys/pkg/CFR-2016-title30-...
2,000380669,8807882,3609931,100.0,100.0,56.14132(a),http://www.gpo.gov/fdsys/pkg/CFR-2015-title30-...
3,000282555,8650963,3609931,100.0,100.0,56.14100(b),http://www.gpo.gov/fdsys/pkg/CFR-2012-title30-...
4,000274355,8650926,3609931,100.0,100.0,56.1000,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...


In [36]:
df_3609931.shape

(8, 7)