### MAINELY IN BUSINESS ###

The following scripts scrape the Maine Secretary of State's website for businesses with "Mainely" in the title. This happens in two steps. 

Because the SoS site limits searches to 100 results, the script first generates the full list of such businesses by searching for the combination of Mainely + each letter of the alphabet, combining that into one list that includes URLs for each individual business page.

The second portion of the script scrapes business information from those individual business URLs and pulls that into the same dataframe, for output to CSV.

In [1]:
import pandas as pd
import requests
import string
import urllib
import time
from scrapy import Selector

#### Pull down Mainely business names

This loop uses a POST method to generate a list of businesses with "Mainely" in the title, from the Maine SoS website. It collects the tables and individual URLs for each page of results. 

The URLs serve as unique identifiers for records and are used to drop any duplicate records. They are then used in the next step, to pull in additional information about each business.

In [2]:
#List of uppercase letters for loop
alpha = string.ascii_uppercase

# Mainely loop and variables
id=0
q = 'Mainely '+alpha[id]
url = 'https://icrs.informe.org/nei-sos-icrs/ICRS?MainPage=x'
url_base = 'https://icrs.informe.org'

#ID and variable to loop through alphabet
data = {'WAISqueryString':q
       ,'number':''
       ,'search': {
           '0':'Click+Here+to+Search'
           ,'1':'search'
       }}

#POST headers
headers = {'Host':'icrs.informe.org'
            ,'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0'
            ,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
            ,'Accept-Language': 'en-US,en;q=0.5'
            ,'Accept-Encoding': 'gzip, deflate, br'
            ,'Content-Type': 'application/x-www-form-urlencoded'
            ,'Connection': 'keep-alive'
            ,'Cookie': 'JSESSIONID=0DF53489E916020D19FCCAB79D9255EB'
            ,'Referer': 'https://icrs.informe.org/nei-sos-icrs/ICRS?search=&MainPage=x&newsearch=New+Search'
            ,'Upgrade-Insecure-Requests': '1'}

### MAINELY NAMES LOOP ###
dfs=[]

#Loop limited to [alphabet length]-1, which gets to Z, at index 25
for x in range(0,len(alpha)-1):
    
    #Pull in request URL text
    r = requests.post(url, data=data, headers=headers)
    
    #Make Selector item to scrape
    sel = Selector(text = r.text)
    
    #Scrape Names, Type & URL and merge
    names = sel.xpath('//tr[position()>=6]/td[2]//text()').extract()
    type = sel.xpath('//tr[position()>=6]/td[3]//text()').extract()
    
    ##URL handler
    rel_urls = sel.xpath('//tr[position()>=6]/td[4]//a/@href').extract()
    n = 0
    full_urls=[]
    for x in rel_urls:
        full_urls.append(url_base + rel_urls[n])
        n += 1
    
    #Concatenate all lists to dataframe
    df = pd.DataFrame({'names':names
                      ,'type':type
                      ,'urls':full_urls
                      })
    dfs.append(df)
    id+=1
    q = 'Mainely '+alpha[id]
    data.update(WAISqueryString=q)

#Combine DF results, reset DF index, drop duplicate rows by URL only
mainely_biz=pd.concat(dfs,sort=False,ignore_index=True)
mainely_biz=mainely_biz.drop_duplicates(subset='urls').reset_index(drop = True)

#### Pull in Mainely business details

Using the URLs from the prior step, these operations pull in new details from the individual business registry pages, including filing dates and registered agents.

All of these lists are then concatenated with the original list into a new dataframe that is ready for cleaning steps.

In [50]:
#HARVEST INDIVIDUAL BUSINESS DETAILS

#Initialize lists to hold scraped variables
status=[]
org_type=[]
address=[]
filing_date=[]
owner_clerk=[]

#Index
i=int(0)

for x in mainely_biz['urls']:

    sel = Selector(text = requests.get(mainely_biz['urls'][i]).content)
    
    if mainely_biz['type'][i] == 'MARK':
        status.append(sel.xpath('//table//b[contains(text(),"Status")]//ancestor::tr[1]/following::tr[1]/td[2]/text()').get())
        org_type.append(sel.xpath('//table//b[contains(text(),"Owner Type")]//ancestor::tr[1]/following::tr[1]/td[5]/text()').get())
        address.append(sel.xpath('//table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[last()-1] | //table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[last()]').extract())
        owner_clerk.append(sel.xpath('//table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[1]').extract())
        filing_date.append(sel.xpath('//table//b[contains(text(),"Filing Date")]//ancestor::tr[1]/following::tr[1]/td[2]/text()').extract())
    else: 
        status.append(sel.xpath('//table//b[contains(text(),"Status")]//ancestor::tr[1]/following::tr[1]/td[4]').get())
        org_type.append(sel.xpath('//table//b[contains(text(),"Filing Type")]//ancestor::tr[1]/following::tr[1]/td[3]').get())
        filing_date.append(sel.xpath('//table//b[contains(text(),"Filing Date")]//ancestor::tr[1]/following::td[1]/text()').extract())
        
        if mainely_biz['type'][i] == 'RESERVED':
            address.append(sel.xpath('//table//b[contains(text(),"Contact")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][position()>(last()-(last()-1))]').extract())
            owner_clerk.append(sel.xpath('//table//b[contains(text(),"Contact")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][1]').extract())
        else:
            address.append(sel.xpath('//table//b[contains(text(),"Clerk")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][position()>(last()-(last()-1))]').extract())
            owner_clerk.append(sel.xpath('//table//b[contains(text(),"Clerk")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][1]').extract())
    i+=1
    
#COMBINE PARENT DATA AND BUSINESS DETAILS
big_df = pd.concat([mainely_biz,pd.DataFrame({'status':status
                                      ,'owner/org_type':org_type
                                      ,'address':address
                                      ,'owner_or_clerk':owner_clerk
                                      ,'filing_date':filing_date})], axis=1)
big_df

131
164
445


Unnamed: 0,names,type,urls,status,owner/org_type,address,owner_or_clerk,filing_date
0,MAINE-LY A STITCH IN TIME,MARK,https://icrs.informe.org/nei-sos-icrs/ICRS?Mar...,EXPIRED,INDIVIDUAL,"[RR 2, BOX 1008 , \nPHILLIPS, ME 04966 \n\t\t\...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tMARY O....,[06/08/1993]
1,MAINE-LY ACCOUNTING INCORPORATED,LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> ADMINISTRATIVELY SUSPENDED</td>,<td>BUSINESS CORPORATION</td>,"[\n162 U.S. ROUTE 1 BOX 4 , \nSCARBOROUGH, ME...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tCRAIG R...,[07/01/1983]
2,"MAINE-LY ACTION RENTALS, INC.",LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> DISSOLVED</td>,<td>BUSINESS CORPORATION</td>,"[\n19 MAIN STREET , \nHARRISON, ME 04040 \n\t\...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tTHOMAS ...,[05/21/2012]
3,MAINE-LY AIRBORNE LLC,LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> ADMINISTRATIVELY DISSOLVED</td>,<td>LIMITED LIABILITY COMPANY (DOMESTIC)</td>,"[\n130 LAKE STREET , \nAUBURN, ME 04210 \n\t\t...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tJAMES J...,[12/30/2011]
4,MAINE-LY AMISH,ASSUMED,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> GOOD STANDING</td>,<td>BUSINESS CORPORATION</td>,"[123 FREE STREET, SUITE 200, \n PORTLAND, ME...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tEZEKIEL...,[04/28/1999]
5,MAINE-LY APPLES,LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> GOOD STANDING</td>,<td>BUSINESS CORPORATION</td>,"[\nP.O. BOX 205 , \nPITTSFIELD, ME 04967 \n\t\...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tALFRED ...,[02/06/2009]
6,"MAINE-LY AQUARIUMS, LLC",LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> ADMINISTRATIVELY DISSOLVED</td>,<td>LIMITED LIABILITY COMPANY (DOMESTIC)</td>,"[\n41 POND VIEW DR , \nLEBANON, ME 04027 \n\t\...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tERIC DU...,[02/04/2016]
7,MAINE-LY ASIAN IMPORTS INC.,LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> DISSOLVED</td>,<td>BUSINESS CORPORATION</td>,"[\n10 BRACKETT STREET , \nBIDDEFORD, ME 04005 ...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tDAVID A...,[05/29/1990]
8,MAINELY A MEMORY,ASSUMED,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> CANCELLED</td>,<td>LIMITED LIABILITY COMPANY (DOMESTIC)</td>,"[GOSSELIN & DUBORD, P.A., P.O. BOX 1081, , \...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tPAUL R ...,[02/21/2008]
9,"MAINELY ABODES, L.L.C.",LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> ADMINISTRATIVELY DISSOLVED</td>,<td>LIMITED LIABILITY COMPANY (DOMESTIC)</td>,"[\nPO BOX 74 , \nSULLIVAN, ME 04664 \n\t\t\t\t...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tMARLY S...,[05/28/2003]


In [52]:
big_df[big_df['type']=='RESERVED']

Unnamed: 0,names,type,urls,status,owner/org_type,address,owner_or_clerk,filing_date
131,MAINELY DEALS,RESERVED,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> EXPIRED</td>,<td>RESERVED NAME (BUSINESS)</td>,"[\nPO BOX 313 , \nSOLON, ME 04979 \n\t\t\t\t\t...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tMALLORY...,[02/08/2019]
164,MAINELY EQUINE BODYWORKS LLC,RESERVED,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> EXPIRED</td>,<td>RESERVED NAME (LIMITED LIABI...,"[\n94 TEN LOTS RD , \nFAIRFIELD, ME 04937 \n\t...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tERIKA S...,[03/11/2019]
445,"MAINELY PRIMAL, LLC",RESERVED,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,<td> ACTIVE</td>,<td>RESERVED NAME (LIMITED LIABI...,"[\n76 ARNOLD ROAD , \nCHINA, ME 04358 \n\t\t\t...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tSCOTT R...,[06/20/2019]


#### Cleaning

List fields are converted to strings, preparing them for trimming and replacement of unneccessary characters. The script previews the dataframe again, to compare with the output from the previous step, before cleaning.

In [3]:
##DATA/STRING CLEANING

#Convert lists to strings
big_df[['address'
        ,'owner_or_clerk'
        ,'filing_date']] = big_df[['address'
                                   ,'owner_or_clerk'
                                   ,'filing_date']].astype(str)

#Eliminate <td> tags
replace_dict = {'<td>':''
               ,'</td>':''
               ,r'\[|\]':''
               ,r'\\n|\\t':''
               ,r"\'":''
               }

big_df.replace(replace_dict,regex=True,inplace=True)

#Cleaning HTML out of data
def trim_all_columns(df):
    """
    Trim whitespace from all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)

big_df = trim_all_columns(big_df)

#PREVIEW DATAFRAME
big_df

NameError: name 'big_df' is not defined

In [2]:
#OUTPUT TO CSV
#big_df.to_csv('mainely_businesses_scraped.csv')

NameError: name 'big_df' is not defined

In [None]:
#Google Credentials
gc = pygsheets.authorize(service_file=cwd+'/me-congress-2020-creds.json')

#Select sheet and worksheet
sh = gc.open('Mainely businesses')
# sh = gc.open_by_key('1AKrgHT9NLpoddV16B7_M_0PEjJmMQAGtXJUnLCTDHjA')
wks = sh[3]

#Clear sheet before load
wks.clear(start='A1',fields='*')

#Write contribs dataframe to sheet
wks.set_dataframe(df_cull,(1,1))