### Hoovers scraper 2020
#### Zhi Yi Yeo

Updating Hoovers scraper for 2020, since the database has been changed as well


In [1]:
# Initialization 
import requests, pickle, time
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool, Pool
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

import threading

import pandas as pd
import re
import time
from tqdm import tqdm

### Load Data that needs to be scraped 


In [2]:
cdp = pd.read_csv('~/Documents/Yale-NUS/Data-Driven Yale/Global Climate Action 2020/Net Zero/Companies/CDP_companies_nz_revenue_to_scrape_27Aug.csv',
                 encoding = "UTF-8")
cdp.head()

Unnamed: 0,name,iso,entity_type,country,GICS_Sector_Name,GICS_Sub_Industry_Name,account_id,target_year,raw_commitment,target_scope,...,total_scope1_emissions,ghg_emissions_inventory_year,total_scope2_emissions_market_based,total_scope2_emissions_location_based,total_scope3_emissions,net_zero,data_source,revenue,employees,revenue_units
0,A2A,ITA,Company,Italy,"Energy utility networks, Non-energy utilities,...",Power generation,87.0,2023.0,A2A has set specific goals to help reduce the ...,,...,,,,,,0,CDPCompanies2019,,,USD$ M
1,A2A,ITA,Company,Italy,"Energy utility networks, Non-energy utilities,...",Power generation,87.0,2023.0,The definitive project will see the replacemen...,,...,,,,,,0,CDPCompanies2019,,,USD$ M
2,A2A,ITA,Company,Italy,"Energy utility networks, Non-energy utilities,...",Power generation,87.0,2023.0,A2A has set specific goals to develop solution...,,...,,,,,,0,CDPCompanies2019,,,USD$ M
3,A2A,ITA,Company,Italy,"Energy utility networks, Non-energy utilities,...",Power generation,87.0,2023.0,A2A evaluate the natural gas dispersed from ro...,,...,,,,,,0,CDPCompanies2019,,,USD$ M
4,A2A,ITA,Company,Italy,"Energy utility networks, Non-energy utilities,...",Power generation,87.0,2030.0,In a process of setting a target for decarboni...,Scope 1,...,7491395.0,2018-12-31,109807.0,Question not applicable,,1,CDPCompanies2019,,,USD$ M


In [3]:
# Get the names for scraping 
company_names = cdp[['name', 'iso', 'entity_type', 'country']]
company_names = company_names.drop_duplicates()
company_names.head()

Unnamed: 0,name,iso,entity_type,country
0,A2A,ITA,Company,Italy
10,ABN Amro Holding,NLD,Company,Netherlands
15,AccorHotels,FRA,Company,France
17,ADLER & ALLAN,GBR,Company,United Kingdom
19,"Adobe, Inc.",USA,Company,United States of America


#### Set up Driver for scraper

Because the data we need is also hidden behind a paywall (which NUS/YNC Accounts have access to), we need to login to our student/staff account for the scraper to work 

This part of the code sets up the driver for that. 



In [13]:
# To use chrome, will need to ensure that chromedriver is in the path
driver = webdriver.Chrome()
driver.get("https://linc.nus.edu.sg/record=b2679651")
elem = driver.find_element_by_link_text("D&B Hoovers (formerly known as D&B Business Browser)")
elem.send_keys(Keys.RETURN)

In [313]:
# Enter your username and password here to send over selenium, or alt tab over to the opened browser to key in manually
username = "test"
password = "test"

In [309]:
# Work in progress - better to sign in manually
#user = driver.find_element_by_name("user")
#pw = driver.find_element_by_name("pass")

#user.send_keys(username)
#pw.send_keys(password)

#driver.find_element_by_css_selector("input[type='submit']").click()

### Start Scraping 

After signing in, we can start searching for companies and scraping data.

We start with writing some helper functions to get the data from the page, use those helper functions to parse the different pages and get the data we want.


In [5]:
# def login_check(url):
    
def get_actor(container):
    # Get soup
    soup = BeautifulSoup(container.get_attribute('outerHTML'), 'lxml')
    # Name, sector, and location are in every actor, so no need for checks 
    name = soup.find(attrs={'class':'clickable'}).text
    sector = soup.find(attrs={'class':'large-black-text'}).text
    location = re.sub('\s', '', soup.find(attrs={'class':'location'}).text)
    # For the others (revenue, employees, etc.) it is not guaranteed the data is present
    # so need to check 
    if soup.find(attrs={'class':'sales black'}) is not None:
        revenue = soup.find(attrs={'class':'sales black'}).text.replace('\n\xa0\n\n', '')
    else:
        revenue = 'NA'
    labels = []
    for label in soup.find_all(attrs={'class': 'data-label'}):
        labels.append(re.sub('\n|\xa0|  |•', '', label.text))
    labels = '; '.join(labels)
    # Now check for the different fields in labels and scrape whatever we can find 
    # Check for employees and assets 
    if bool(re.search("Employees \(This Site\)", labels)):
        employees_local = re.sub(".*?Employees \(This Site\):([0-9,\\.kMB]+).*", '\\1', labels)
    else:
        employees_local = 'NA'
    if bool(re.search("Employees \(All Sites\)", labels)):
        employees_all = re.sub(".*?Employees \(All Sites\):([0-9,\\.kMB]+).*", '\\1', labels)
    else:
        employees_all = 'NA'
    if bool(re.search("Assets", labels)):
        assets = re.sub(".*?Assets:([0-9\\.MB]+).*", '\\1', labels)
    else: 
        assets = 'NA'
            
    actor_data = pd.DataFrame({'name':name, 'sector':sector, 'location':location,
                               'revenue':revenue, 'employees_local':employees_local,
                               'employees_all': employees_all, 'assets':assets},
                             index = [0])
    return actor_data

In [6]:
def get_page(url):
    driver.get(url)
    time.sleep(5)
    # Parse info on page
    # Check if any results returned first 
    num_results = driver.find_elements_by_class_name('js-total-results.total-count-number')[0].text
    if int(num_results.replace(',', '')) == 0:
        return
    actors = driver.find_elements_by_class_name('container-fluid')
    df = []
    for actor in actors:
        df.append(get_actor(actor))
    return(df)

### Main Scraping loop 

Incorporate error handling so that not everything breaks if there are any errors. 

Also periodically save files so that I won't cry if everything breaks.


In [7]:
company_names = company_names.reset_index(drop = True)

In [9]:
# Main Scraping loop 
stem =  'https://app.avention.com/search/company?q='
main = []
errors = []
for i in tqdm(range(company_names.shape[0])):
    # Save every 300 names get scraped
    if i % 300 == 0 & i != 0:
        main_rm = [i for i in main if i] 
        main_export = pd.concat(flatten(main_rm))
        main_export.to_csv('~/Documents/Yale-NUS/Data-Driven Yale/Global Climate Action 2020/Net Zero/Companies/hoovers_revenue_upto' + i + 'actors.csv',
                          encoding = 'UTF-8')
    try: 
        url = stem + company_names['name'][i]
        main.append(get_page(url))
    except:
        print('Error scraping for ' + company_names['name'][i])
        errors.append(company_names['name'][i])



  8%|▊         | 106/1274 [13:09<2:21:45,  7.28s/it]

Error scraping for Eneco


 34%|███▎      | 428/1274 [53:07<1:41:06,  7.17s/it]

Error scraping for Nos SGPS


 35%|███▍      | 440/1274 [54:38<1:49:10,  7.85s/it]

Error scraping for Iberia Lineas Aereas de Espana SA


 36%|███▌      | 455/1274 [56:28<1:37:39,  7.15s/it]

Error scraping for FullCycle


 42%|████▏     | 541/1274 [1:07:25<1:36:48,  7.92s/it]

Error scraping for Marlin Communications


 46%|████▌     | 588/1274 [1:13:17<1:27:08,  7.62s/it]

Error scraping for Ionica


 48%|████▊     | 616/1274 [1:16:43<1:15:52,  6.92s/it]

Error scraping for Pela


 53%|█████▎    | 677/1274 [1:24:08<1:10:51,  7.12s/it]

Error scraping for Smart Air


 60%|██████    | 769/1274 [1:35:17<1:02:09,  7.38s/it]

Error scraping for Palm


 70%|██████▉   | 889/1274 [1:49:30<42:45,  6.66s/it]  

Error scraping for Twist Sa | Sweet Rebels


 84%|████████▍ | 1073/1274 [2:11:56<23:51,  7.12s/it]

Error scraping for Workspace


 86%|████████▌ | 1097/1274 [2:14:48<20:42,  7.02s/it]

Error scraping for Aber Food Surplus


 86%|████████▋ | 1100/1274 [2:15:15<24:48,  8.55s/it]

Error scraping for African Bronze Honey Company


 88%|████████▊ | 1124/1274 [2:18:15<19:40,  7.87s/it]

Error scraping for De Smaakspecialist


 89%|████████▊ | 1128/1274 [2:18:45<19:15,  7.91s/it]

Error scraping for Dolphin Blue Inc


 93%|█████████▎| 1184/1274 [2:25:34<10:59,  7.33s/it]

Error scraping for New Hope Ecotech | eureciclo


 94%|█████████▍| 1198/1274 [2:27:08<08:46,  6.93s/it]

Error scraping for PATHFINDER


 97%|█████████▋| 1236/1274 [2:31:44<04:34,  7.22s/it]

Error scraping for Up Marketing


 97%|█████████▋| 1241/1274 [2:32:22<04:07,  7.49s/it]

Error scraping for VIANOVA


100%|██████████| 1274/1274 [2:36:28<00:00,  7.29s/it]


In [18]:
# Try scraping for errors again
stem =  'https://app.avention.com/search/company?q='
for i in tqdm(range(len(errors))):
    try: 
        url = stem + errors[i]
        main.append(get_page(url))
    except:
        print('Error scraping for ' + errors[i])
        errors.append(errors[i])

 47%|████▋     | 9/19 [01:13<01:21,  8.20s/it]

Error scraping for AKF INTERNATIONAL CORP


 79%|███████▉  | 15/19 [01:59<00:31,  7.99s/it]

Error scraping for AMCOL(TIANJIN) INDUSTRIAL MINERALS CO., LTD


100%|██████████| 19/19 [02:28<00:00,  7.79s/it]


In [26]:
flatten = lambda l: [item for sublist in l for item in sublist]
main_rm = [i for i in main if i] 
main_df = pd.concat(flatten(main_rm))

In [28]:
main_df.shape

(13148, 7)

In [29]:
main_df.to_csv('~/Documents/Yale-NUS/Data-Driven Yale/Global Climate Action 2020/Net Zero/Companies/hoovers_revenue_full_28Aug.csv',
           encoding = 'UTF-8')