### Written by: Brandon Ong

### Project: Web scraping script for scraping company infomation from California pest companies directory.

#### Website: http://pcoc.officialbuyersguide.net/

#### Methodology:

#### 1) Before writing code, inspect the website, starting from home page, to determine logical (manual) steps required to get to each company's info. At the same time, inspect html code and locate elements that you need for each step. 

#### 2) Steps required:
   #### a. From home page, click into first category (total of 9 categories)
   #### b. For each company listed, click into company's name to access full company info (Name, phone and email)
   #### c. Go through all pages (if applicable) to cover all companies within each category
   #### d. Repeat a-c for all 9 categories, and skip category if listing is empty

In [11]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

home_url = "http://pcoc.officialbuyersguide.net/"

def link_content(link):
    '''
    Retrieve html and return it's full content
    '''
    r = requests.get(link, timeout=5)
    soup = BeautifulSoup(r.content)
    return soup

# retrieve home page content 
home_content = link_content(home_url)

#### **Use inspect element on your browser 

#### The 'a href' tags containing link to company listings in each category can be found within div tags grouped under class='HomeCategory'

In [2]:
def category_url(home_content):  
    '''
    Retrieve all category a hrefs, 
    construct full urls, 
    and append it to a list for later usage
    '''
    cat_links = []
    cat_tags = [div.a for div in home_content.findAll('div',{'class' : 'HomeCategory'})]
    for param in cat_tags:
        cat_param = param.get("href")
        cat_url = home_url + cat_param
        cat_links.append(cat_url)
    return cat_links 

category_links = category_url(home_content)
print category_links

['http://pcoc.officialbuyersguide.net//SearchResults?categories=1', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=2', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=4', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=3', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=5', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=6', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=18', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=19', 'http://pcoc.officialbuyersguide.net//SearchResults?categories=21']


#### For each page of listings, the company link can be found inside href tags grouped under class="companyNameLink", nested within div class="ListingNameAddress"

#### For categories with multiple pages of listings, pagination is possible by simply adding a parameter "&pg={pagenumber}" at the end of each category url.

#### For example, page 2 of category 1:

#### "http://pcoc.officialbuyersguide.net//SearchResults?categories=1&pg=2"

#### To scrape through all company links in each category we will need a function to iterate through all pages, scrape each company's info and stop when it hits a page with empty listings.

In [4]:
def extract_listings(link, num):
    '''
    Main function for scraping, this is a recursive function that will do the following:
    1) Access a category page - category link given as argument
    2) Extract all company links starting on page 1
    3) For each company link, extract individual company's information using extract_info() function
    4) Append company information into a list assigned as "co_list"
    5) Iterate through all pages and repeat step 2, 3 and 4 until page returns empty listing
    6) Finally, return full list of companies and their infomation 
    '''
    full_link = link + "&pg=" + str(num)
    page_content = link_content(full_link)
    company_tags = [div.a for div in page_content.findAll('div',{'class' : 'ListingNameAddress'})]
    
    co_list = [] #Final list that will house information of all companies scraped
    
    if not company_tags: #for handling pages with no company listings
        return 
    else:
        for a in company_tags:
            company_param = a.get("href")
            company_url = home_url + company_param 
            co_info = extract_info(company_url)
            co_list.append(co_info)
        num+=1
        listings = extract_listings(link, num)
        if listings:
            co_list.extend(listings)
        return co_list
    
def extract_info(co_link):
    '''
    This function is used to extract the required information
    from individual company page
    *Added exceptions for handling missing information
    '''
    company_content = link_content(co_link)
     
    try:
        name = company_content.find('div',{'class' : 'ListingPageNameAddress NONE'}).h1.get_text()
    except:
        name = None
    
    try:
        phone = company_content.find('span',{'id' : 'hiddenSaTp'}).get_text()
    except:
        phone = None
        
    #extract email within script tag using regex
    regex = re.compile('([\w\d\.!#$%&\'\*\+\-\/=\?\^_`{|}~;]+@[\w\d\-]+\.[\w]{2,})')
    
    try:
        script_content = company_content.find('div',{'class' : 'ListingPageNameAddress NONE'}).script.get_text()
    except:
        script_content = None
    
    try:
        email = re.search(regex, script_content).group(0)
    except:
        email = None
        
    return [str(name), str(phone), str(email)]



#### Over here, I'm looping through the list of catergory links constructed earlier and calling the main scraping function extract_listings, all starting with page 1

In [5]:
company_list = []
for page in category_links:
    company_info = extract_listings(page,1)
    if company_info:
        company_list.extend(company_info)


#### First 15 of final list containing all companies on the site and their information.

In [8]:
print company_list[0:15]

[['None', '866-891-3863', 'None'], ['A 1 Termite & Pest Control Inc', '213-388-4506', 'mikemhnam@aol.com'], ['A-1 Fumigation Inc.', '562-866-7535', 'a-1fumigation@verizon.net'], ['Access Exterminator Service, Inc.', '714-630-6310', 'russ@accessext.com'], ['Admiral Pest Control Inc', '562-925-8308', 'jeff@admiralpest.com'], ['Ag-Fume Service Inc', '562-803-0256', 'agfume@flash.net'], ['Algon Exterminating Co', '619-561-1991', 'merry@algonpest.com'], ['Assured Audit Pest Prevention', '909-767-8940', 'joe@assuredaudit.com'], ['BG Inspections & Pest Control', '707-410-7907', 'beetleman@comcast.net'], ['BPC, Inc', '805-650-6828', 'pat@bpcx.com'], ['Brezden Pest Control, Inc.', '805-544-9446', 'sales@brezdenpest.com'], ['Bugman Termite & PC, The', '714-992-1292', 'brian@thebugman.com'], ['C H Boddie Pest Control Inc.', '310-839-9270', 'carlton@boddiepestmanagement.com'], ['Cal State Termite & Pest Ctl', '559-896-9320', 'None'], ['California Pest Control', '209-992-1190', 'calpestcontrol@gmai

#### Create dataframe with final list

In [9]:
import pandas as pd

columns = ['company_name', 'phone', 'email']
df = pd.DataFrame(company_list, columns=columns)

#remove return line code in couple of company names 
df['company_name'] = df['company_name'].replace(to_replace='\r', value='', regex=True)

df.head()

Unnamed: 0,company_name,phone,email
0,,866-891-3863,
1,A 1 Termite & Pest Control Inc,213-388-4506,mikemhnam@aol.com
2,A-1 Fumigation Inc.,562-866-7535,a-1fumigation@verizon.net
3,"Access Exterminator Service, Inc.",714-630-6310,russ@accessext.com
4,Admiral Pest Control Inc,562-925-8308,jeff@admiralpest.com


#### Write data into csv file

In [10]:
df.to_csv(path_or_buf='pest_control_companies.csv', sep=",")