### MAINELY IN BUSINESS ###

The following scripts scrape the Maine Secretary of State's website for businesses with "Mainely" in the title. This happens in two steps. 

Because the SoS site limits searches to 100 results, the script first generates the full list of such businesses by searching for the combination of Mainely + each letter of the alphabet, combining that into one list that includes URLs for each individual business page.

The second portion of the script scrapes business information from those individual business URLs and pulls that into the same dataframe, for output to CSV.

In [2]:
import pandas as pd
import requests
import string
import urllib
import datadotworld
import os
from scrapy import Selector

#### Pull down Mainely business names

This loop uses a POST method to generate a list of businesses with "Mainely" in the title, from the Maine SoS website. It collects the tables and individual URLs for each page of results. 

The URLs serve as unique identifiers for records and are used to drop any duplicate records. They are then used in the next step, to pull in additional information about each business.

In [3]:
#List of uppercase letters for loop
alpha = string.ascii_uppercase

# Mainely loop and variables
id=0
q = 'Mainely '+alpha[id]
url = 'https://icrs.informe.org/nei-sos-icrs/ICRS?MainPage=x'
url_base = 'https://icrs.informe.org'

#ID and variable to loop through alphabet
data = {'WAISqueryString':q
       ,'number':''
       ,'search': {
           '0':'Click+Here+to+Search'
           ,'1':'search'
       }}

#POST headers
headers = {'Host':'icrs.informe.org'
            ,'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0'
            ,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
            ,'Accept-Language': 'en-US,en;q=0.5'
            ,'Accept-Encoding': 'gzip, deflate, br'
            ,'Content-Type': 'application/x-www-form-urlencoded'
            ,'Connection': 'keep-alive'
            ,'Cookie': 'JSESSIONID=0DF53489E916020D19FCCAB79D9255EB'
            ,'Referer': 'https://icrs.informe.org/nei-sos-icrs/ICRS?search=&MainPage=x&newsearch=New+Search'
            ,'Upgrade-Insecure-Requests': '1'}

### MAINELY NAMES LOOP ###
dfs=[]

#Loop limited to [alphabet length]-1, which gets to Z, at index 25
for x in range(0,len(alpha)-1):
    
    #Pull in request URL text
    r = requests.post(url, data=data, headers=headers)
    
    #Make Selector item to scrape
    sel = Selector(text = r.text)
    
    #Scrape Names, Type & URL and merge
    names = sel.xpath('//tr[position()>=6]/td[2]//text()').extract()
    type = sel.xpath('//tr[position()>=6]/td[3]//text()').extract()
    
    ##URL handler
    rel_urls = sel.xpath('//tr[position()>=6]/td[4]//a/@href').extract()
    n = 0
    full_urls=[]
    for x in rel_urls:
        full_urls.append(url_base + rel_urls[n])
        n += 1
    
    #Concatenate all lists to dataframe
    df = pd.DataFrame({'names':names
                      ,'type':type
                      ,'urls':full_urls
                      })
    dfs.append(df)
    id+=1
    q = 'Mainely '+alpha[id]
    data.update(WAISqueryString=q)

#Combine DF results, reset DF index, drop duplicate rows by URL only
mainely_biz=pd.concat(dfs,sort=False,ignore_index=True)
mainely_biz=mainely_biz.drop_duplicates(subset='urls').reset_index(drop = True)

#### Pull in Mainely business details

Using the URLs from the prior step, these operations pull in new details from the individual business registry pages, including filing dates and registered agents.

All of these lists are then concatenated with the original list into a new dataframe that is ready for cleaning steps.

In [None]:
#HARVEST INDIVIDUAL BUSINESS DETAILS

#Initialize lists to hold scraped variables
status=[]
org_type=[]
address=[]
filing_date=[]
owner_clerk=[]

#Index
i=int(0)

for x in mainely_biz['urls']:

    sel = Selector(text = requests.get(mainely_biz['urls'][i]).content)
    
    if mainely_biz['type'][i] == 'MARK':
        status.append(sel.xpath('//table//b[contains(text(),"Status")]//ancestor::tr[1]/following::tr[1]/td[2]/text()').get())
        org_type.append(sel.xpath('//table//b[contains(text(),"Owner Type")]//ancestor::tr[1]/following::tr[1]/td[5]/text()').get())
        address.append(sel.xpath('//table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[last()-1] | //table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[last()]').extract())
        owner_clerk.append(sel.xpath('//table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[1]').extract())
        filing_date.append(sel.xpath('//table//b[contains(text(),"Filing Date")]//ancestor::tr[1]/following::tr[1]/td[2]/text()').extract())
    else: 
        status.append(sel.xpath('//table//b[contains(text(),"Status")]//ancestor::tr[1]/following::tr[1]/td[4]').get())
        org_type.append(sel.xpath('//table//b[contains(text(),"Filing Type")]//ancestor::tr[1]/following::tr[1]/td[3]').get())
        filing_date.append(sel.xpath('//table//b[contains(text(),"Filing Date")]//ancestor::tr[1]/following::td[1]/text()').extract())
        
        if mainely_biz['type'][i] == 'RESERVED':
            address.append(sel.xpath('//table//b[contains(text(),"Contact")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][position()>(last()-(last()-1))]').extract())
            owner_clerk.append(sel.xpath('//table//b[contains(text(),"Contact")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][1]').extract())
        else:
            address.append(sel.xpath('//table//b[contains(text(),"Clerk")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][position()>(last()-(last()-1))]').extract())
            owner_clerk.append(sel.xpath('//table//b[contains(text(),"Clerk")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][1]').extract())
    i+=1
    
#COMBINE PARENT DATA AND BUSINESS DETAILS
big_df = pd.concat([mainely_biz,pd.DataFrame({'status':status
                                      ,'owner/org_type':org_type
                                      ,'address':address
                                      ,'owner_or_clerk':owner_clerk
                                      ,'filing_date':filing_date})], axis=1)
big_df

In [None]:
big_df[big_df['type']=='RESERVED']

#### Cleaning

List fields are converted to strings, preparing them for trimming and replacement of unneccessary characters. The script previews the dataframe again, to compare with the output from the previous step, before cleaning.

In [None]:
##DATA/STRING CLEANING

#Convert lists to strings
big_df[['address'
        ,'owner_or_clerk'
        ,'filing_date']] = big_df[['address'
                                   ,'owner_or_clerk'
                                   ,'filing_date']].astype(str)

#Eliminate <td> tags
replace_dict = {'<td>':''
               ,'</td>':''
               ,r'\[|\]':''
               ,r'\\n|\\t':''
               ,r"\'":''
               }

big_df.replace(replace_dict,regex=True,inplace=True)

#Cleaning HTML out of data
def trim_all_columns(df):
    """
    Trim whitespace from all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)

big_df = trim_all_columns(big_df)

#PREVIEW DATAFRAME
big_df

In [None]:
#OUTPUT TO CSV
cwd = os.getcwd()
big_df.to_csv('mainely_businesses_scraped.csv')

In [None]:
## WRITE TO DATA.WORLD ##
with dw.open_remote_file('darrenfishell/mainely-businesses', 'mainely-business-names.csv') as w:
    big_df.to_csv(w, index=False)