### MAINELY IN BUSINESS ###

The following scripts scrape the Maine Secretary of State's website for businesses with "Mainely" in the title. This happens in two steps. 

Because the SoS site limits searches to 100 results, the script first generates the full list of such businesses by searching for the combination of Mainely + each letter of the alphabet, combining that into one list that includes URLs for each individual business page.

The second portion of the script scrapes business information from those individual business URLs and pulls that into the same dataframe, for output to CSV.

In [1]:
import pandas as pd
import requests
import string
import time
import datadotworld as dw
import os
import random
from scrapy import Selector
from datetime import date

#### Pull down Mainely business names

This loop uses a POST method to generate a list of businesses with "Mainely" in the title, from the Maine SoS website. It collects the tables and individual URLs for each page of results. 

The URLs serve as unique identifiers for records and are used to drop any duplicate records. They are then used in the next step, to pull in additional information about each business.

In [4]:
#List of uppercase letters and numbers for search loop
alpha = string.ascii_uppercase + string.digits

# Mainely loop and variables
id=0
q = 'Mainely '+alpha[id]
url = 'https://icrs.informe.org/nei-sos-icrs/ICRS?MainPage=x'
url_base = 'https://icrs.informe.org'

#ID and variable to loop through alphabet
data = {'WAISqueryString':q
       ,'number':''
       ,'search': {
           '0':'Click+Here+to+Search'
           ,'1':'search'
       }}

#POST headers
headers = {'Host':'icrs.informe.org'
            ,'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0'
            ,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
            ,'Accept-Language': 'en-US,en;q=0.5'
            ,'Accept-Encoding': 'gzip, deflate, br'
            ,'Content-Type': 'application/x-www-form-urlencoded'
            ,'Connection': 'keep-alive'
            ,'Cookie': 'JSESSIONID=0DF53489E916020D19FCCAB79D9255EB'
            ,'Referer': 'https://icrs.informe.org/nei-sos-icrs/ICRS?search=&MainPage=x&newsearch=New+Search'
            ,'Upgrade-Insecure-Requests': '1'}

### MAINELY NAMES LOOP ###
dfs=[]

#Loops through alphabet and digits 0-9
for x in range(0,len(alpha)-1):
    
    #Pull in request URL text
    r = requests.post(url, data=data, headers=headers)
    
    #Make Selector item to scrape
    sel = Selector(text = r.text)
    
    #Scrape Names, Type & URL and merge
    names = sel.xpath('//tr[position()>=6]/td[2]//text()').extract()
    type = sel.xpath('//tr[position()>=6]/td[3]//text()').extract()
    
    ##URL handler
    rel_urls = sel.xpath('//tr[position()>=6]/td[4]//a/@href').extract()
    n = 0
    full_urls=[]
    for x in rel_urls:
        full_urls.append(url_base + rel_urls[n])
        n += 1
    
    #Concatenate all lists to dataframe
    df = pd.DataFrame({'names':names
                      ,'type':type
                      ,'urls':full_urls
                      })
    dfs.append(df)
    id+=1
    q = 'Mainely '+alpha[id]
    data.update(WAISqueryString=q)

#Combine DF results, reset DF index, drop duplicate rows by URL only
mainely_biz=pd.concat(dfs,sort=False,ignore_index=True)
mainely_biz=mainely_biz.drop_duplicates(subset='urls').reset_index(drop = True)

In [6]:
# #Write initial scrape to disk, to enable testing
# today = date.today().strftime("%d-%m-%Y")
# mainely_biz.to_csv('mainely-biz-scrape-'+today+'.csv')
mainely_biz = pd.read_csv('mainely-biz-scrape-01-01-2020.csv')

#### Pull in Mainely business details

Using the URLs from the prior step, these operations pull in new details from the individual business registry pages, including filing dates and registered agents.

All of these lists are then concatenated with the original list into a new dataframe that is ready for cleaning steps.

In [9]:
#HARVEST INDIVIDUAL BUSINESS DETAILS
#Initialize lists to hold scraped variables
status=[]
org_type=[]
address=[]
filing_date=[]
owner_clerk=[]


#New headers
headers = {'Host':'icrs.informe.org'
            ,'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0'
            ,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
            ,'Accept-Language': 'en-US,en;q=0.5'
            ,'Accept-Encoding': 'gzip, deflate, br'
            ,'Content-Type': 'application/x-www-form-urlencoded'
            ,'Connection': 'keep-alive'
            ,'DNT':'1'
            ,'Connection':'keep-alive'
            ,'Cookie:':'JSESSIONID=5DB3513D307D954878674950FF081499'
            ,'Upgrade-Insecure-Requests':'1'
            ,'Cache-Control':'max-age=0'
            ,'Referer': 'https://icrs.informe.org/nei-sos-icrs/ICRS'
            ,'Upgrade-Insecure-Requests': '1'}


# Host: icrs.informe.org
# User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0
# Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
# Accept-Language: en-US,en;q=0.5
# Accept-Encoding: gzip, deflate, br
# Referer: https://icrs.informe.org/nei-sos-icrs/ICRS
# DNT: 1
# Connection: keep-alive
# Cookie: JSESSIONID=5DB3513D307D954878674950FF081499
# Upgrade-Insecure-Requests: 1
# Cache-Control: max-age=0


#Index
i=int(0)

for x in range(0,10):
    
    #Set variable time delay for scrape
    delay = random.randint(1,3)
    print('delay: ' + str(delay) + ' seconds' + '  index: ' + str(i))
    
    try:
        sel = Selector(text = requests.get(mainely_biz['urls'][i],headers=headers).content)
    except:
        print('URL ERROR on ' + mainely_biz['urls'][i])
        break
    
    if mainely_biz['type'][i] == 'MARK':
        status.append(sel.xpath('//table//b[contains(text(),"Status")]//ancestor::tr[1]/following::tr[1]/td[2]/text()').get())
        org_type.append(sel.xpath('//table//b[contains(text(),"Owner Type")]//ancestor::tr[1]/following::tr[1]/td[5]/text()').get())
        address.append(sel.xpath('//table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[last()-1] | //table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[last()]').extract())
        owner_clerk.append(sel.xpath('//table//b[contains(text(),"Owner")]//ancestor::tr[1]/following::tr[3]/td/text()[1]').extract())
        filing_date.append(sel.xpath('//table//b[contains(text(),"Filing Date")]//ancestor::tr[1]/following::tr[1]/td[2]/text()').extract())
    else: 
        status.append(sel.xpath('//table//b[contains(text(),"Status")]//ancestor::tr[1]/following::tr[1]/td[4]').get())
        org_type.append(sel.xpath('//table//b[contains(text(),"Filing Type")]//ancestor::tr[1]/following::tr[1]/td[3]').get())
        filing_date.append(sel.xpath('//table//b[contains(text(),"Filing Date")]//ancestor::tr[1]/following::td[1]/text()').extract())
        
        if mainely_biz['type'][i] == 'RESERVED':
            address.append(sel.xpath('//table//b[contains(text(),"Contact")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][position()>(last()-(last()-1))]').extract())
            owner_clerk.append(sel.xpath('//table//b[contains(text(),"Contact")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][1]').extract())
        else:
            address.append(sel.xpath('//table//b[contains(text(),"Clerk")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][position()>(last()-(last()-1))]').extract())
            owner_clerk.append(sel.xpath('//table//b[contains(text(),"Clerk")]//ancestor::tr[1]/following::tr[1]/td/text()[following::br][1]').extract())
    i+=1
    # Big sleep between requests
    time.sleep(delay)
    
#COMBINE PARENT DATA AND BUSINESS DETAILS
big_df = pd.concat([mainely_biz,pd.DataFrame({'status':status
                                      ,'owner/org_type':org_type
                                      ,'address':address
                                      ,'owner_or_clerk':owner_clerk
                                      ,'filing_date':filing_date})], axis=1)
big_df

delay: 2 seconds  index: 0
delay: 1 seconds  index: 1
URL ERROR on https://icrs.informe.org/nei-sos-icrs/ICRS?CorpSumm=19840005+D


Unnamed: 0.1,Unnamed: 0,names,type,urls,status,owner/org_type,address,owner_or_clerk,filing_date
0,0,MAINE-LY A STITCH IN TIME,MARK,https://icrs.informe.org/nei-sos-icrs/ICRS?Mar...,EXPIRED,INDIVIDUAL,"[RR 2, BOX 1008 , \nPHILLIPS, ME 04966 \n\t\t\...",[\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tMARY O....,[06/08/1993]
1,1,MAINE-LY ACCOUNTING INCORPORATED,LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,,,,,
2,2,"MAINE-LY ACTION RENTALS, INC.",LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,,,,,
3,3,MAINE-LY AIRBORNE LLC,LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,,,,,
4,4,MAINE-LY AMISH,ASSUMED,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,,,,,
...,...,...,...,...,...,...,...,...,...
601,601,"MAINELY YORK TRAILER PARK, INC.",LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,,,,,
602,602,"MAINELY YOUNG, LLC",LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,,,,,
603,603,MAINELY YOURS,MARK,https://icrs.informe.org/nei-sos-icrs/ICRS?Mar...,,,,,
604,604,THE MAINELY YOGA NOOK LLC,LEGAL,https://icrs.informe.org/nei-sos-icrs/ICRS?Cor...,,,,,


#### Cleaning

List fields are converted to strings, preparing them for trimming and replacement of unneccessary characters. The script previews the dataframe again, to compare with the output from the previous step, before cleaning.

In [None]:
##DATA/STRING CLEANING

#Convert lists to strings
big_df[['address'
        ,'owner_or_clerk'
        ,'filing_date']] = big_df[['address'
                                   ,'owner_or_clerk'
                                   ,'filing_date']].astype(str)

#Eliminate <td> tags
replace_dict = {'<td>':''
               ,'</td>':''
               ,r'\[|\]':''
               ,r'\\n|\\t':''
               ,r"\'":''
               }

big_df.replace(replace_dict,regex=True,inplace=True)

#Cleaning HTML out of data
def trim_all_columns(df):
    """
    Trim whitespace from all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)

big_df = trim_all_columns(big_df)

#PREVIEW DATAFRAME
big_df

In [None]:
#Pull in list of registered agents
#From: https://www5.informe.org/cgi-bin/online/moraa/cra_list.pl
#Separates out addresses and picks only text

sel = Selector(text = requests.get('https://www5.informe.org/cgi-bin/online/moraa/cra_list.pl').content)

name=sel.xpath('//html//table[@class="at-data-table"]//tr/td[1]/text()').getall()
number=sel.xpath('//html//table[@class="at-data-table"]//tr/td[2]/text()').getall()
address1=sel.xpath('//html//table[@class="at-data-table"]//tr/td[3]/text()[1]').getall()
address2=sel.xpath('//html//table[@class="at-data-table"]//tr/td[3]/text()[2]').getall()
tel=sel.xpath('//html//table[@class="at-data-table"]//tr/td[4]/text()').getall()
email=sel.xpath('//html//table[@class="at-data-table"]//tr/td[5]/text()').getall()

agents = pd.DataFrame({'name':name
              ,'number':number
              ,'address1':address1
              ,'address2':address2
              ,'tel':tel
              ,'email':email})

agents.drop([0])

#Concatenate full address
agents['full_address'] = agents['address1'] + ' ' + agents['address2']

In [None]:
## WRITE TO DATA.WORLD ##
with dw.open_remote_file('darrenfishell/mainely-businesses', 'maine-registered-agents.csv') as w:
    agents.to_csv(w, index=False)

In [None]:
#OUTPUT TO CSV
cwd = os.getcwd()
big_df.to_csv('mainely_businesses_scraped.csv')

In [None]:
# ## WRITE TO DATA.WORLD ##
# with dw.open_remote_file('darrenfishell/mainely-businesses', 'raw-mainely-business-names.csv') as w:
#     big_df.to_csv(w, index=False)