# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [29]:
#https://jportal.mdcourts.gov/license/pbPublicSearch.jsp

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [30]:
#.find_element_by_xpath('//*[@id="checkbox"]').click()

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [31]:
#.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]').click()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [32]:
#driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]').click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [33]:
#select = Select(driver.find_element_by_xpath('//*[@id="slcJurisdiction"]'))
#select.select_by_visible_taxt('Statewide)

### How do you type "vap%" into the Trade Name field?

In [34]:
#search_input = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
# search_input.send_keys('vap%')

### How do you click the submit button or submit the form?

In [35]:
# driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]').click()

### How can you find and click the 'Next' button on the search results page?

In [36]:
#driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr').click()

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [37]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select

In [38]:
driver = webdriver.Chrome()

In [39]:
driver.get("https://jportal.mdcourts.gov/license/pbPublicSearch.jsp")

In [40]:
checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]').click()
checkbox

In [41]:
select_button = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]').click()
select_button

In [42]:
licence = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]').click()
license

Type license() to see the full license text

In [43]:
select = Select(driver.find_element_by_xpath('//*[@id="slcJurisdiction"]'))
select.select_by_visible_text('Statewide')

In [44]:
search_input = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
search_input.send_keys('vap%')

In [45]:
submit = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]').click()
submit

In [46]:
next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
#next_button.click()

In [47]:
from bs4 import BeautifulSoup
doc=BeautifulSoup(driver.page_source, 'html.parser')

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [48]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

5

In [49]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
my_list = []
for header in business_headers:
    current = {}
    rows = header.find_next_siblings('tr')
    #print("HEADER is", header.text.strip())
    current['trade_name']= header.find_all('td')[1].text.strip()
    #print("ROW 0 IS", rows[0].text.strip())
    current['name']= rows[0].find_all('td')[1].text.strip()
    current['status'] = rows[0].find_all('td')[2].text.strip()
    #print("ROW 1 IS", rows[1].text.strip())
    current['license']= rows[1].find_all('td')[2].text.strip()
    current['address']= rows[1].find_all('td')[1].text.strip()
    #print("ROW 2 IS", rows[2].text.strip())
    current['date'] = rows[2].find_all('td')[2].text.strip()
    current['city'] = rows[2].find_all('td')[1].text.strip()
    #print("ROW 3 IS", rows[3].text.strip())
    current['county'] = rows[3].text.strip()
    my_list.append(current)
    #print("----")

In [50]:
my_list

[{'address': '1015 S SALISBURY BLVD',
  'city': 'SALISBURY, MD 21801',
  'county': 'Wicomico County',
  'date': 'Issued Date: 4/27/2017',
  'license': 'License: 22173808',
  'name': 'AMIN NARGIS',
  'status': 'Lic. Status: Issued',
  'trade_name': 'VAPE IT STORE II'},
 {'address': '1724 N SALISBURY BLVD UNIT 2',
  'city': 'SALISBURY, MD 21801',
  'county': 'Wicomico County',
  'date': 'Issued Date: 4/27/2017',
  'license': 'License: 22173807',
  'name': 'AMIN NARGIS',
  'status': 'Lic. Status: Issued',
  'trade_name': 'VAPE IT STORE I'},
 {'address': '2299 JOHNS HOPKINS ROAD',
  'city': 'GAMBRILLS, MD 21054',
  'county': 'Anne Arundel County',
  'date': 'Issued Date: 4/05/2017',
  'license': 'License: 02104436',
  'name': 'ANJ DISTRIBUTIONS LLC',
  'status': 'Lic. Status: Issued',
  'trade_name': 'VAPEPAD THE'},
 {'address': '110 S. PINEY RD',
  'city': 'CHESTER, MD 21619',
  'county': "Queen Anne's County",
  'date': 'Issued Date: 5/31/2017',
  'license': 'License: 17165957',
  'name'

### Save these into `vape-results.csv`

In [51]:
import pandas as pd
df = pd.DataFrame(my_list)
df.head

<bound method NDFrame.head of                         address                    city               county  \
0         1015 S SALISBURY BLVD     SALISBURY, MD 21801      Wicomico County   
1  1724 N SALISBURY BLVD UNIT 2     SALISBURY, MD 21801      Wicomico County   
2       2299 JOHNS HOPKINS ROAD     GAMBRILLS, MD 21054  Anne Arundel County   
3               110 S. PINEY RD       CHESTER, MD 21619  Queen Anne's County   
4           346 RITCHIE HIGHWAY  SEVERNA PARK, MD 21146  Anne Arundel County   

                     date            license                       name  \
0  Issued Date: 4/27/2017  License: 22173808                AMIN NARGIS   
1  Issued Date: 4/27/2017  License: 22173807                AMIN NARGIS   
2  Issued Date: 4/05/2017  License: 02104436      ANJ DISTRIBUTIONS LLC   
3  Issued Date: 5/31/2017  License: 17165957  COX TRADING COMPANY L L C   
4                                                       COX TRADING LLC   

                 status        trade_n

### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [52]:
df.to_csv("vape-results.csv" , index = False)
vape_df = pd.read_csv('vape-results.csv')
vape_df.head()

Unnamed: 0,address,city,county,date,license,name,status,trade_name
0,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",Wicomico County,Issued Date: 4/27/2017,License: 22173808,AMIN NARGIS,Lic. Status: Issued,VAPE IT STORE II
1,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",Wicomico County,Issued Date: 4/27/2017,License: 22173807,AMIN NARGIS,Lic. Status: Issued,VAPE IT STORE I
2,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",Anne Arundel County,Issued Date: 4/05/2017,License: 02104436,ANJ DISTRIBUTIONS LLC,Lic. Status: Issued,VAPEPAD THE
3,110 S. PINEY RD,"CHESTER, MD 21619",Queen Anne's County,Issued Date: 5/31/2017,License: 17165957,COX TRADING COMPANY L L C,Lic. Status: Issued,VAPE FROG
4,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",Anne Arundel County,,,COX TRADING LLC,Lic. Status: Pending,VAPE FROG


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [53]:
my_final_list =[]
while True:
    try:
        doc=BeautifulSoup(driver.page_source, 'html.parser')
        biz_headers = doc.find_all('tr',class_='searchfieldtitle')
        for header in biz_headers:
            current = {}
            rows = header.find_next_siblings('tr')
            current['trade_name']= header.find_all('td')[1].text.strip()
            current['name']= rows[0].find_all('td')[1].text.strip()
            current['status'] = rows[0].find_all('td')[2].text.strip()
            current['license']= rows[1].find_all('td')[2].text.strip()
            current['address']= rows[1].find_all('td')[1].text.strip()
            current['date'] = rows[2].find_all('td')[2].text.strip()
            current['city'] = rows[2].find_all('td')[1].text.strip()
            current['county'] = rows[3].text.strip()
            my_final_list.append(current)
        next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
        next_button.click()
    except:
        break
my_final_list

[{'address': '1015 S SALISBURY BLVD',
  'city': 'SALISBURY, MD 21801',
  'county': 'Wicomico County',
  'date': 'Issued Date: 4/27/2017',
  'license': 'License: 22173808',
  'name': 'AMIN NARGIS',
  'status': 'Lic. Status: Issued',
  'trade_name': 'VAPE IT STORE II'},
 {'address': '1724 N SALISBURY BLVD UNIT 2',
  'city': 'SALISBURY, MD 21801',
  'county': 'Wicomico County',
  'date': 'Issued Date: 4/27/2017',
  'license': 'License: 22173807',
  'name': 'AMIN NARGIS',
  'status': 'Lic. Status: Issued',
  'trade_name': 'VAPE IT STORE I'},
 {'address': '2299 JOHNS HOPKINS ROAD',
  'city': 'GAMBRILLS, MD 21054',
  'county': 'Anne Arundel County',
  'date': 'Issued Date: 4/05/2017',
  'license': 'License: 02104436',
  'name': 'ANJ DISTRIBUTIONS LLC',
  'status': 'Lic. Status: Issued',
  'trade_name': 'VAPEPAD THE'},
 {'address': '110 S. PINEY RD',
  'city': 'CHESTER, MD 21619',
  'county': "Queen Anne's County",
  'date': 'Issued Date: 5/31/2017',
  'license': 'License: 17165957',
  'name'

In [54]:
len(my_final_list)

32

In [55]:
df = pd.DataFrame(my_final_list)
df.head()

Unnamed: 0,address,city,county,date,license,name,status,trade_name
0,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",Wicomico County,Issued Date: 4/27/2017,License: 22173808,AMIN NARGIS,Lic. Status: Issued,VAPE IT STORE II
1,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",Wicomico County,Issued Date: 4/27/2017,License: 22173807,AMIN NARGIS,Lic. Status: Issued,VAPE IT STORE I
2,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",Anne Arundel County,Issued Date: 4/05/2017,License: 02104436,ANJ DISTRIBUTIONS LLC,Lic. Status: Issued,VAPEPAD THE
3,110 S. PINEY RD,"CHESTER, MD 21619",Queen Anne's County,Issued Date: 5/31/2017,License: 17165957,COX TRADING COMPANY L L C,Lic. Status: Issued,VAPE FROG
4,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",Anne Arundel County,,,COX TRADING LLC,Lic. Status: Pending,VAPE FROG


In [56]:
df.to_csv("vape-results-all.csv" , index = False)
vape_df = pd.read_csv('vape-results-all.csv')
vape_df.head(32)

Unnamed: 0,address,city,county,date,license,name,status,trade_name
0,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",Wicomico County,Issued Date: 4/27/2017,License: 22173808,AMIN NARGIS,Lic. Status: Issued,VAPE IT STORE II
1,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",Wicomico County,Issued Date: 4/27/2017,License: 22173807,AMIN NARGIS,Lic. Status: Issued,VAPE IT STORE I
2,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",Anne Arundel County,Issued Date: 4/05/2017,License: 02104436,ANJ DISTRIBUTIONS LLC,Lic. Status: Issued,VAPEPAD THE
3,110 S. PINEY RD,"CHESTER, MD 21619",Queen Anne's County,Issued Date: 5/31/2017,License: 17165957,COX TRADING COMPANY L L C,Lic. Status: Issued,VAPE FROG
4,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",Anne Arundel County,,,COX TRADING LLC,Lic. Status: Pending,VAPE FROG
5,185 MITCHELLS CHANCE RD,"EDGEWATER, MD 21037",Anne Arundel County,Issued Date: 4/13/2017,License: 02102408,DISBROW II EMERSON HARRINGTON,Lic. Status: Issued,VAPE LOFT (THE)
6,7104 MINSTREL UNIT #7,"COLUMBIA, MD 21045",Howard County,Issued Date: 5/19/2017,License: 13141786,DISCOUNT TOBACCO ESSEX LLC,Lic. Status: Issued,VAPE N CIGAR
7,330 ONE FORTY VILLAGE ROAD,"WESTMINSTER, MD 21157",Carroll County,Issued Date: 4/21/2017,License: 06126253,FAIRGROUND VILLAGE LLC,Lic. Status: Issued,VAPE DOJO
8,29890 THREE NOTCH ROAD,"CHARLOTTE HALL, MD 20622",St. Mary's County,,,GRIMM JENNIFER,Lic. Status: Pending,VAPE HAVEN
9,356 ROMANCOKE ROAD,"STEVENSVILLE, MD 21666",Queen Anne's County,Issued Date: 4/13/2017,License: 17166688,HUTCH VAPES LLC,Lic. Status: Issued,VAPE BIRD
