# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

TO TYPE ENTER 
from selenium.webdriver.common.keys import Keys

element = driver.find_elements_by_class_name("q")
element.send_keys(Keys.RETURN)

TO USE SELECT

from selenium.webdriver.support.ui import Select

select = Select(driver.find_element_by_name('phy_city'))
select.select_by_visible_text('Houston')


**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [3]:
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://jportal.mdcourts.gov/license/index_disclaimer.jsp')

In [4]:
check_box = driver.find_element_by_xpath('//*[@id="checkbox"]')

In [5]:
check_box.click()

In [6]:
enter_box = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')

In [7]:
enter_box.click()

In [8]:
from selenium.webdriver.support.ui import Select

image = (driver.find_element_by_xpath('/html/body/table[1]/tbody/tr[2]/td[2]/table/tbody/tr/td[3]/a/img'))

In [9]:
image.click()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [10]:
#I will click it

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [11]:
select = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')

In [12]:
search_input = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
search_input.send_keys("Statewide")

### How do you type "vap%" into the Trade Name field?

In [13]:
type = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
search_input = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
search_input.send_keys("vap%")

### How do you click the submit button or submit the form?

In [14]:
submit = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]')
submit.click()

### How can you find and click the 'Next' button on the search results page?

In [15]:
while True:
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr').click()
    except:
        
        break


# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [16]:
#next
while True:
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr').click()
    except:
        
        break

In [17]:
#back
while True:
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[1]/a/nobr').click()
    except:
        
        break

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [18]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, 'html.parser')

In [19]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

5

In [20]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
vape_results = []

for header in business_headers:
    dic = {}
    rows = header.find_next_siblings('tr')
    dic['Header'] = header.find_all('td')[1].text.strip()
    print("HEADER is", header.text.strip())
    link = header.find('a')
    if link:
        url = header.find('a')['href']
        dic['Details'] = "https://jportal.mdcourts.gov/license/" +url
    print("More details", "https://jportal.mdcourts.gov/license/" +url)
    dic['name'] = rows[0].find_all('td')[1].text.strip()
    print("ROW 0 IS", rows[0].text.strip())
    dic['Status'] = rows[0].find_all('td')[2].text.strip().replace('Lic. Status: ', '')
    print("ROW 1 IS", rows[1].text.strip())
    dic['Address'] = rows[1].find_all('td')[1].text.strip() + rows[2].find_all('td')[1].text.strip()
    print("ROW 2 IS", rows[2].text.strip())
    dic['County'] = rows[3].text.strip()
    print("ROW 3 IS", rows[3].text.strip())
    dic['lisence'] = rows[1].find_all('td')[2].text.strip().replace('License: ', '')
    dic['Issued Date'] = rows[2].find_all('td')[2].text.strip().replace('Issued Date: ', '')
    vape_results.append(dic)
    print("----")

HEADER is 1.
VAPE IT STORE I
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=LdzVt5WuXSE%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 2.
VAPE IT STORE II
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=53oEnanNtSQ%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 3.
VAPEPAD THE
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=oCr%2FKGeVmPg%3D
ROW 0 IS ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
ROW 1 IS 2299 JOHNS HOPKINS ROAD
License: 02104436
ROW 2 IS GAMBRILLS, MD 21054
Issued Date: 4/05/2017
ROW 3 IS Anne Arundel County
----
HEADER is 4.
VAPE FROG
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=WmjynRfJAF8

### Save these into `vape-results.csv`

In [21]:
import pandas as pd
df = pd.DataFrame(vape_results)

In [22]:
df.to_csv("vape-results.csv", index = False)


### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [23]:
df = pd.read_csv("vape-results.csv")
df.head()
#len(df)

Unnamed: 0,Address,County,Details,Header,Issued Date,Status,lisence,name
0,"1724 N SALISBURY BLVD UNIT 2SALISBURY, MD 21801",Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE IT STORE I,4/27/2017,Issued,22173807.0,AMIN NARGIS
1,"1015 S SALISBURY BLVDSALISBURY, MD 21801",Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE IT STORE II,4/27/2017,Issued,22173808.0,AMIN NARGIS
2,"2299 JOHNS HOPKINS ROADGAMBRILLS, MD 21054",Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,VAPEPAD THE,4/05/2017,Issued,2104436.0,ANJ DISTRIBUTIONS LLC
3,"110 S. PINEY RDCHESTER, MD 21619",Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE FROG,5/31/2017,Issued,17165957.0,COX TRADING COMPANY L L C
4,"346 RITCHIE HIGHWAYSEVERNA PARK, MD 21146",Anne Arundel County,,VAPE FROG,,Pending,,COX TRADING LLC


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [24]:
vape_results_all = []

while True:
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    business_headers = doc.find_all('tr',class_='searchfieldtitle')
    for header in business_headers:
        dic = {}
        rows = header.find_next_siblings('tr')
        dic['Header'] = header.find_all('td')[1].text.strip()
        print("HEADER is", header.text.strip())
        link = header.find('a')
        if link:
            url = header.find('a')['href']
            dic['Details'] = "https://jportal.mdcourts.gov/license/" +url
            print("More details", "https://jportal.mdcourts.gov/license/" +url)
        dic['name'] = rows[0].find_all('td')[1].text.strip()
        print("ROW 0 IS", rows[0].text.strip())
        dic['Status'] = rows[0].find_all('td')[2].text.strip().replace('Lic. Status: ', '')
        print("ROW 1 IS", rows[1].text.strip())
        dic['Address'] = rows[1].find_all('td')[1].text.strip() + rows[2].find_all('td')[1].text.strip()
        print("ROW 2 IS", rows[2].text.strip())
        dic['County'] = rows[3].text.strip()
        print("ROW 3 IS", rows[3].text.strip())
        dic['lisence'] = rows[1].find_all('td')[2].text.strip().replace('License: ', '')
        dic['Issued Date'] = rows[2].find_all('td')[2].text.strip().replace('Issued Date: ', '')
        vape_results_all.append(dic)
        print("----")
    try:
        driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr').click()
    except:
        
        break
        

HEADER is 1.
VAPE IT STORE I
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=LdzVt5WuXSE%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 2.
VAPE IT STORE II
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=53oEnanNtSQ%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 3.
VAPEPAD THE
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=oCr%2FKGeVmPg%3D
ROW 0 IS ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
ROW 1 IS 2299 JOHNS HOPKINS ROAD
License: 02104436
ROW 2 IS GAMBRILLS, MD 21054
Issued Date: 4/05/2017
ROW 3 IS Anne Arundel County
----
HEADER is 4.
VAPE FROG
More details https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=WmjynRfJAF8

In [25]:
df_all = pd.DataFrame(vape_results_all)

In [26]:
df_all.to_csv("vaper-results-all.csv", index = False)
df_all = pd.read_csv("vaper-results-all.csv")

In [27]:
df_all

Unnamed: 0,Address,County,Details,Header,Issued Date,Status,lisence,name
0,"1724 N SALISBURY BLVD UNIT 2SALISBURY, MD 21801",Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE IT STORE I,4/27/2017,Issued,22173807.0,AMIN NARGIS
1,"1015 S SALISBURY BLVDSALISBURY, MD 21801",Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE IT STORE II,4/27/2017,Issued,22173808.0,AMIN NARGIS
2,"2299 JOHNS HOPKINS ROADGAMBRILLS, MD 21054",Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,VAPEPAD THE,4/05/2017,Issued,2104436.0,ANJ DISTRIBUTIONS LLC
3,"110 S. PINEY RDCHESTER, MD 21619",Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE FROG,5/31/2017,Issued,17165957.0,COX TRADING COMPANY L L C
4,"346 RITCHIE HIGHWAYSEVERNA PARK, MD 21146",Anne Arundel County,,VAPE FROG,,Pending,,COX TRADING LLC
5,"185 MITCHELLS CHANCE RDEDGEWATER, MD 21037",Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE LOFT (THE),4/13/2017,Issued,2102408.0,DISBROW II EMERSON HARRINGTON
6,"7104 MINSTREL UNIT #7COLUMBIA, MD 21045",Howard County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE N CIGAR,5/19/2017,Issued,13141786.0,DISCOUNT TOBACCO ESSEX LLC
7,"330 ONE FORTY VILLAGE ROADWESTMINSTER, MD 21157",Carroll County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE DOJO,4/21/2017,Issued,6126253.0,FAIRGROUND VILLAGE LLC
8,"29890 THREE NOTCH ROADCHARLOTTE HALL, MD 20622",St. Mary's County,,VAPE HAVEN,,Pending,,GRIMM JENNIFER
9,"356 ROMANCOKE ROADSTEVENSVILLE, MD 21666",Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,VAPE BIRD,4/13/2017,Issued,17166688.0,HUTCH VAPES LLC


In [28]:
len(df_all)

32