# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [1]:
# https://jportal.mdcourts.gov/license/pbPublicSearch.jsp

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [2]:
# <input name="ac" value="y" id="checkbox" type="checkbox">
# .click()

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [3]:
# <input value="Enter the Site" type="submit">
# .click()

# <form action="pbIndex.jsp" method="get">
# 				<div class="copy"><input name="ac" value="y" id="checkbox" type="checkbox"><label for="checkbox">&nbsp;I have read the below disclaimer&nbsp;</label>
# 				<input value="Enter the Site" type="submit"></div>
# 			</form>
# .submit()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [4]:
# <a href="pbPublicSearch.jsp">SEARCH LICENSE RECORDS</a>
# .click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [5]:
# <option value="50">Statewide</option>
# .click()

### How do you type "vap%" into the Trade Name field?

In [6]:
# <input id="txtTradeName" name="txtTradeName" value="" type="text">
# .send_keys('')

### How do you click the submit button or submit the form?

In [7]:
# <input value="Submit" type="submit">
# .click()

# <form name="searchForm" method="get" action="pbelservlet/Search">
# 				</form>
# submit()

### How can you find and click the 'Next' button on the search results page?

In [8]:
# <nobr>Next »</nobr>

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [9]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

In [10]:
driver = webdriver.Firefox()
#driver.implicitly_wait(2)

In [11]:
# Open the intro page
driver.get('https://jportal.mdcourts.gov/license/index_disclaimer.jsp')

In [12]:
# Tick the button
# <input name="ac" value="y" id="checkbox" type="checkbox">
checkbox = driver.find_element_by_id('checkbox')
checkbox.click()

In [13]:
# Click Enter the Site
# <input value="Enter the Site" type="submit">
# div.copy > input:nth-child(3)
enter_site = driver.find_element_by_css_selector('div.copy > input:nth-child(3)')
enter_site.click()

In [15]:
# Click the Search License Records
# .divider > a:nth-child(2)
WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('td.divider > a:nth-child(2)'))
license_records = driver.find_element_by_css_selector('td.divider > a:nth-child(2)')
license_records.click()

In [16]:
# Click on statewide
statewide = driver.find_element_by_css_selector('#slcJurisdiction > option:nth-child(2)')
statewide.click()

In [17]:
# Click on Trade Name
# <input id="txtTradeName" name="txtTradeName" value="" type="text">
trade_name = driver.find_element_by_id('txtTradeName')
trade_name.send_keys('vap%')

In [18]:
# Submit the form
search_form = driver.find_element_by_css_selector('body > table:nth-child(2) > tbody:nth-child(1) > tr:nth-child(4) > td:nth-child(2) > form:nth-child(7)')
search_form.submit()

In [19]:
WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('table.btmnavtable > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(3) > a:nth-child(1) > nobr:nth-child(1)'))
for i in range(1,10):
    # Kudos to Soma for try-pass
    try:
        next_button = driver.find_element_by_css_selector('table.btmnavtable > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(3) > a:nth-child(1) > nobr:nth-child(1)')
        next_button.click()
    except:
        pass

## Unused alternative - Kudos to Soma for this
# for i in range(1,8):
#     next_buttons = driver.find_element_by_css_selector('table.btmnavtable > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(3) > a:nth-child(1) > nobr:nth-child(1)')
#     if len(next_buttons) > 0:
#         next_buttons[0].click()

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [20]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, 'html.parser')

In [21]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

5

In [22]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    print("ROW 0 IS", rows[0].text.strip())
    print("ROW 1 IS", rows[1].text.strip())
    print("ROW 2 IS", rows[2].text.strip())
    print("ROW 3 IS", rows[3].text.strip())
    #print(rows)

HEADER is 1.
VAPE IT STORE I
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
HEADER is 2.
VAPE IT STORE II
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
HEADER is 3.
VAPEPAD THE
ROW 0 IS ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
ROW 1 IS 2299 JOHNS HOPKINS ROAD
License: 02104436
ROW 2 IS GAMBRILLS, MD 21054
Issued Date: 4/05/2017
ROW 3 IS Anne Arundel County
HEADER is 4.
VAPE FROG
ROW 0 IS COX TRADING COMPANY L L C
Lic. Status: Issued
ROW 1 IS 110 S. PINEY RD
License: 17165957
ROW 2 IS CHESTER, MD 21619
Issued Date: 5/31/2017
ROW 3 IS Queen Anne's County
HEADER is 5.
VAPE FROG
Pending *
ROW 0 IS COX TRADING LLC
Lic. Status: Pending
ROW 1 IS 346 RITCHIE HIGHWAY
ROW 2 IS SEVERNA PARK, MD 21146
ROW 3 IS Anne Arundel County


In [23]:
for header in business_headers:
    business_brand = header.text.strip().split('\n')[1]
    print('business_brand:', business_brand)
    rows = header.find_next_siblings('tr')
    # Scraping business registration name
    business_name = rows[0].text.strip().split('\n')[0]
    print('business_name:', business_name)
    # Scraping streets
    address_street = rows[1].text.strip().split('\n')[0]
    print('address_street:', address_street)
    # Scraping state
    address_state = rows[2].text.strip().split('\n')[0]
    print('address_state', address_state)
    # Scraping county
    county = rows[3].text.strip().split('\n')[0]
    print('county:', county)
    # Scraping license status
    license_status = rows[0].text.strip().split(': ')[1]
    print('license_status:', license_status)
    # Scraping license number
    try:
        license_no = rows[1].text.strip().split(': ')[1]
        print('license_no:', license_no)
    except:
        pass
    # Scraping license issued date
    try:
        license_issued = rows[2].text.strip().split(': ')[1]
        print('license_issued:', license_issued) 
    except:
        pass
    # Scraping details link
    try:
        details_url = header.a['href']
        print('details_url: ' + 'https://jportal.mdcourts.gov/license/' + details_url)
    except:
        pass
    print('-----')

business_brand: VAPE IT STORE I
business_name: AMIN NARGIS
address_street: 1724 N SALISBURY BLVD UNIT 2
address_state SALISBURY, MD 21801
county: Wicomico County
license_status: Issued
license_no: 22173807
license_issued: 4/27/2017
details_url: https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=%2B1bvyN7iH%2F4%3D
-----
business_brand: VAPE IT STORE II
business_name: AMIN NARGIS
address_street: 1015 S SALISBURY BLVD
address_state SALISBURY, MD 21801
county: Wicomico County
license_status: Issued
license_no: 22173808
license_issued: 4/27/2017
details_url: https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=5jnekbCeW7k%3D
-----
business_brand: VAPEPAD THE
business_name: ANJ DISTRIBUTIONS LLC
address_street: 2299 JOHNS HOPKINS ROAD
address_state GAMBRILLS, MD 21054
county: Anne Arundel County
license_status: Issued
license_no: 02104436
license_issued: 4/05/2017
details_url: https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=BlH4BpkdkBw%3D
-----
business_brand: 

### Save these into `vape-results.csv`

In [24]:
import selenium
from bs4 import BeautifulSoup
import pandas as pd


driver = webdriver.Firefox()
driver.get('https://jportal.mdcourts.gov/license/index_disclaimer.jsp')
checkbox = driver.find_element_by_id('checkbox')
checkbox.click()
enter_site = driver.find_element_by_css_selector('div.copy > input:nth-child(3)')
enter_site.click()
WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('td.divider > a:nth-child(2)'))
license_records = driver.find_element_by_css_selector('td.divider > a:nth-child(2)')
license_records.click()
statewide = driver.find_element_by_css_selector('#slcJurisdiction > option:nth-child(2)')
statewide.click()
trade_name = driver.find_element_by_id('txtTradeName')
trade_name.send_keys('vap%')
search_form = driver.find_element_by_css_selector('body > table:nth-child(2) > tbody:nth-child(1) > tr:nth-child(4) > td:nth-child(2) > form:nth-child(7)')
search_form.submit()
WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_class_name('searchfieldtitle'))

doc = BeautifulSoup(driver.page_source, 'html.parser')

business_headers = doc.find_all('tr',class_='searchfieldtitle')

vape_page_one_list = []
for header in business_headers:
    vage_page_one_dict = {}
    business_brand = header.text.strip().split('\n')[1]
    vage_page_one_dict['business_brand'] = business_brand
    
    rows = header.find_next_siblings('tr')
    # Scraping business registration name
    business_name = rows[0].text.strip().split('\n')[0]
    vage_page_one_dict['business_name'] = business_name
    
    # Scraping streets
    address_street = rows[1].text.strip().split('\n')[0]
    vage_page_one_dict['address_street'] = address_street

    # Scraping state
    address_state = rows[2].text.strip().split('\n')[0]
    vage_page_one_dict['address_state'] = address_state

    # Scraping county
    county = rows[3].text.strip().split('\n')[0]
    vage_page_one_dict['county'] = county

    # Scraping license status
    license_status = rows[0].text.strip().split(': ')[1]
    vage_page_one_dict['license_status:'] = license_status
    
    # Scraping license number
    try:
        license_no = rows[1].text.strip().split(': ')[1]
        vage_page_one_dict['license_no'] = license_no
    except:
        pass
    # Scraping license issued date
    try:
        license_issued = rows[2].text.strip().split(': ')[1]
        vage_page_one_dict['license_issued'] = license_issued
    except:
        pass
    # Scraping details link
    try:
        details_url = header.a['href']
        vage_page_one_dict['details_url'] = 'https://jportal.mdcourts.gov/license/' + details_url
    except:
        pass
    vape_page_one_list.append(vage_page_one_dict)
    
#print(vape_page_one_list)     

df = pd.DataFrame(vape_page_one_list)
df.head()

Unnamed: 0,address_state,address_street,business_brand,business_name,county,details_url,license_issued,license_no,license_status:
0,"SALISBURY, MD 21801",1724 N SALISBURY BLVD UNIT 2,VAPE IT STORE I,AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173807.0,Issued
1,"SALISBURY, MD 21801",1015 S SALISBURY BLVD,VAPE IT STORE II,AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173808.0,Issued
2,"GAMBRILLS, MD 21054",2299 JOHNS HOPKINS ROAD,VAPEPAD THE,ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,2104436.0,Issued
3,"CHESTER, MD 21619",110 S. PINEY RD,VAPE FROG,COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,17165957.0,Issued
4,"SEVERNA PARK, MD 21146",346 RITCHIE HIGHWAY,VAPE FROG,COX TRADING LLC,Anne Arundel County,,,,Pending


In [25]:
df.dtypes

address_state      object
address_street     object
business_brand     object
business_name      object
county             object
details_url        object
license_issued     object
license_no         object
license_status:    object
dtype: object

In [26]:
df.to_csv('vape-results.csv', index=False)

### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [27]:
pd.read_csv('vape-results.csv', converters = {'license_no':str})

Unnamed: 0,address_state,address_street,business_brand,business_name,county,details_url,license_issued,license_no,license_status:
0,"SALISBURY, MD 21801",1724 N SALISBURY BLVD UNIT 2,VAPE IT STORE I,AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173807.0,Issued
1,"SALISBURY, MD 21801",1015 S SALISBURY BLVD,VAPE IT STORE II,AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173808.0,Issued
2,"GAMBRILLS, MD 21054",2299 JOHNS HOPKINS ROAD,VAPEPAD THE,ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,2104436.0,Issued
3,"CHESTER, MD 21619",110 S. PINEY RD,VAPE FROG,COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,17165957.0,Issued
4,"SEVERNA PARK, MD 21146",346 RITCHIE HIGHWAY,VAPE FROG,COX TRADING LLC,Anne Arundel County,,,,Pending


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [28]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd

# Creating the main scraping function
# We include the list.append call inside the funnction
# but the driver assignment and call to the browser
# and the empty list will need to be created outside of the function
def scrape_vapes():
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    business_headers = doc.find_all('tr',class_='searchfieldtitle')

    for header in business_headers:
        vage_page_one_dict = {}
        business_brand = header.text.strip().split('\n')[1]
        vage_page_one_dict['business_brand'] = business_brand

        rows = header.find_next_siblings('tr')
        # Scraping business registration name
        business_name = rows[0].text.strip().split('\n')[0]
        vage_page_one_dict['business_name'] = business_name

        # Scraping streets
        address_street = rows[1].text.strip().split('\n')[0]
        vage_page_one_dict['address_street'] = address_street

        # Scraping state
        address_state = rows[2].text.strip().split('\n')[0]
        vage_page_one_dict['address_state'] = address_state

        # Scraping county
        county = rows[3].text.strip().split('\n')[0]
        vage_page_one_dict['county'] = county

        # Scraping license status
        license_status = rows[0].text.strip().split(': ')[1]
        vage_page_one_dict['license_status:'] = license_status

        # Scraping license number
        try:
            license_no = rows[1].text.strip().split(': ')[1]
            vage_page_one_dict['license_no'] = license_no
        except:
            pass
        # Scraping license issued date
        try:
            license_issued = rows[2].text.strip().split(': ')[1]
            vage_page_one_dict['license_issued'] = license_issued
        except:
            pass
        # Scraping details link
        try:
            details_url = header.a['href']
            vage_page_one_dict['details_url'] = 'https://jportal.mdcourts.gov/license/' + details_url
        except:
            pass
        vape_page_all_list.append(vage_page_one_dict)
        
# Function to navigate to results page with selenium
def get_results_page():
    driver.get('https://jportal.mdcourts.gov/license/index_disclaimer.jsp')
    checkbox = driver.find_element_by_id('checkbox')
    checkbox.click()
    enter_site = driver.find_element_by_css_selector('div.copy > input:nth-child(3)')
    enter_site.click()
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_css_selector('td.divider > a:nth-child(2)'))
    license_records = driver.find_element_by_css_selector('td.divider > a:nth-child(2)')
    license_records.click()
    statewide = driver.find_element_by_css_selector('#slcJurisdiction > option:nth-child(2)')
    statewide.click()
    trade_name = driver.find_element_by_id('txtTradeName')
    trade_name.send_keys('vap%')
    search_form = driver.find_element_by_css_selector('body > table:nth-child(2) > tbody:nth-child(1) > tr:nth-child(4) > td:nth-child(2) > form:nth-child(7)')
    search_form.submit()
    WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_class_name('searchfieldtitle'))
        
# Function to click through next pages with selenium
# and on each new page to call the main scraping function
def click_next_scrape():
    for i in range(1,10):
        WebDriverWait(driver, 100).until( lambda driver: driver.find_element_by_class_name('searchfieldtitle'))
        try:
            next_button = driver.find_element_by_link_text('Next »')
            next_button.click()
            scrape_vapes()
        except:
            pass

# Separate function to create the pandas DataFrame from our new list of dictionaries
# and generate the CSV file
def create_csv():
    df_vape_results_all = pd.DataFrame(vape_page_all_list)
    df_vape_results_all.to_csv('vape-results-all.csv', index=False)

    
# Putting everything together
# First we call the browser
driver = webdriver.Firefox()
# We then get to result page
get_results_page()
# Creating the empty list our function will append the scraped data
vape_page_all_list = []
# Calling main scraping function for first page of results
scrape_vapes()
# Click next button and scrape the results of that page
click_next_scrape()
# Generate DataFrame and CSV
create_csv()

In [29]:
df_new = pd.read_csv('vape-results-all.csv', converters = {'license_no':str})
df_new.head()

Unnamed: 0,address_state,address_street,business_brand,business_name,county,details_url,license_issued,license_no,license_status:
0,"SALISBURY, MD 21801",1724 N SALISBURY BLVD UNIT 2,VAPE IT STORE I,AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173807.0,Issued
1,"SALISBURY, MD 21801",1015 S SALISBURY BLVD,VAPE IT STORE II,AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,22173808.0,Issued
2,"GAMBRILLS, MD 21054",2299 JOHNS HOPKINS ROAD,VAPEPAD THE,ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,2104436.0,Issued
3,"CHESTER, MD 21619",110 S. PINEY RD,VAPE FROG,COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,17165957.0,Issued
4,"SEVERNA PARK, MD 21146",346 RITCHIE HIGHWAY,VAPE FROG,COX TRADING LLC,Anne Arundel County,,,,Pending


In [30]:
df_new.shape

(32, 9)

In [31]:
df_new.dtypes

address_state      object
address_street     object
business_brand     object
business_name      object
county             object
details_url        object
license_issued     object
license_no         object
license_status:    object
dtype: object