Regular information on Sri Lankan state-owned enterprises (SOEs) is surprisingly sparse. As part of a project to assist Sri Lankan civil society, this code scrapes publicly-available data (from the Colombo Stock Exchange website) to collect names of SOE management.

Within the CSE's ToS, web-scraping/automated data collection is considered acceptable for non-commercial purposes. All data collected is publicly-available and non-proprietary.

The output CSV file has the following columns: 
<ul> 
<li> <i>Name</i>: Name of individual, scraped from respective company's CSE.lk profile </li>
<li> <i>Role</i>: E.g. Board Member, CEO, Chairperson. </li>
<li> <i>Notes</i>: Additional detail on role </li> 
<li> <i>Company_Name</i>: Name of company to look up.  </li> 
<li> <i>CSE_Name</i>: Name of company, as listed on Colombo Stock Exchange. </li> 
<li> <i>CSE_Symbol</i>: Company symbol, as listed on Colombo Stock Exchange. For each company, management data is scraped from www.cse.lk/home/company-info/CSE_symbol/company-profile. </li>
<li> <i>Retrieved</i>: Day/Month/Year. Date the data was scraped. </li> 
<li> <i>Check_Needed</i>: Boolean, True/False. If the Company_Name does not exactly match CSE_Name, this is True. Flags row for an additional manual check. </li> 
</ul>

This preliminary code serves as a proof-of-concept, and runs on a list of 17 plantation companies to produce 'regional_plantation_companies.csv'. 

TO DO:
<ul> 
<li>Clean code (object-oriented; better manage global/local variables) </li> 
<li>Create a CSV file with names of all SOEs. In this version, I have manually entered the list of company names (and I only have a hard-copy of the 300+ SOEs). Scrape relevant Ministry of Finance page? </li> 
<li>Create a CSV file with all CSE company names and symbols. Currently, there is no comprehensive list of companies and their correspondng CSE symbol. This code uses CSE website's auto-complete drop-down menu to identify the correct symbol, which limits fuzzy-matching options. </li> 
</ul> 


In [1]:
import pandas as pd
import numpy as np
import time
import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC

In [2]:
##Declare functions 
def find_symbol(company_name, elem):
    'Scrapes the CSE auto-populate search box to return company name and symbol'
    elem.clear()
    elem.send_keys(company_name)
    time.sleep(1)
    elem_check = driver.find_element_by_class_name('search-auto-complete-result')
    button_two = elem_check.find_element_by_class_name("ng-scope")
    cse_name, cse_symbol=  str.split(button_two.text, "\n")
    print (button_two.text) 
    return (cse_name, cse_symbol)

def find_profile(company_name):
    no_symbol_found = False
    'Loads the CSE webpage and searches for the company name; runs the find_symbol function; returns company name and symbol '
    button = WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.ID, 'searchToggle')))
    button.click()
    elem = driver.find_element_by_id("search")
    try: 
        return find_symbol(company_name, elem)
    except:
        no_symbol_found = True
    #If the exact match (with full company name) cannot be found, run search again with only the first term in company name. 
    if no_symbol_found: 
        try:
            company_slice = company_name.split(" ")[0]
            return find_symbol(company_slice, elem)
        except:
            return ("NA", "NA")
            
def scrape_page(company_name, cse_symbol, cse_name):
    'Scrapes the company CSE profile page and adds management data to the master list'
    new_url = "https://www.cse.lk/home/company-info/"+ cse_symbol + "/company-profile"
    driver.get(new_url)
    content = WebDriverWait(driver, 3).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.company-profile-header')))
    check2 = content[1].find_elements_by_xpath("//tbody")
    names_list = []
    crop_index = 6 #Hard-coded value; this is where Board of Directors info begins 
    for index, value in enumerate(check2):
        names_list.append(value.text) 
    for item in names_list[crop_index:]: 
        temp_value = str.split(item, "\n")
        [final_list.append([x, company_name, cse_symbol, cse_name]) for x in temp_value]

In [3]:
#List of companies 
company_list = ["Agalawatte Plantations PLC", "Balangoda Plantations PLC", 
                "Bogawantalawa Tea Estates PLC", "Elpitiya Plantations", 
                "Horana Plantations PLC", "Kahawatte Plantations PLC", 
                "Kegalle Plantations PLC", 
                "Kelani Valley Plantations PLC", "Kotagala Plantations PLC", 
                "Madulsima Plantations PLC","Malwatte Valley Plantations PLC", 
                "Maskeliya Plantations PLC", 
                "Maturata Plantations Limited","Namunukula Plantations PLC", 
                "Pussellawa Plantations Limited", "Talawakelle Plantations",
                "Watawala Plantations PLC"]

#CSE.lk home page 
home_url = 'https://www.cse.lk/home/market'

In [96]:
final_list = []
na_list = []
driver = webdriver.Chrome()
driver.get(home_url)
for company_name  in company_list: 
    cse_name, cse_symbol = find_profile(company_name)
    if (cse_name != "NA") & (cse_symbol != "NA"): 
        print ("scraping...")
        scrape_page(company_name, cse_symbol, cse_name)
    else: 
        na_list.extend([company_name])

AGALAWATTE PLANTATIONS PLC
AGAL.N0000
scraping...
BALANGODA PLANTATIONS PLC
BALA.N0000
scraping...
BOGAWANTALAWA TEA ESTATES PLC
BOPL.N0000
scraping...
ELPITIYA PLANTATIONS PLC
ELPL.N0000
scraping...
HORANA PLANTATIONS PLC
HOPL.N0000
scraping...
KAHAWATTE PLANTATIONS PLC
KAHA.N0000
scraping...
KEGALLE PLANTATIONS PLC
KGAL.N0000
scraping...
KELANI VALLEY PLANTATIONS PLC
KVAL.N0000
scraping...
KOTAGALA PLANTATIONS PLC
KOTA.N0000
scraping...
MADULSIMA PLANTATIONS PLC
MADU.N0000
scraping...
MALWATTE VALLEY PLANTATIONS PLC
MAL.X0000
scraping...
MASKELIYA PLANTATIONS PLC
MASK.N0000
scraping...
NAMUNUKULA PLANTATIONS PLC
NAMU.N0000
scraping...
UDAPUSSELLAWA PLANTATIONS PLC
UDPL.N0000
scraping...
TALAWAKELLE TEA ESTATES PLC
TPL.N0000
scraping...
WATAWALA PLANTATIONS PLC
WATA.N0000
scraping...


In [141]:
final_df = pd.DataFrame(data = final_list, columns = ["Name", "Company_Name", "CSE_ID", "CSE_Name"])
final_df[["Name", "Role", "Delete"]] =  final_df["Name"].str.split("(", expand = True)
final_df["Role"] = final_df["Role"].str.replace(")", "")
final_df["Notes"] = final_df["Name"].str.extract(r'(?:^|(?<= ))(Co-Chairman|Chairman|Managing Director / CEO|Managing Director|Chief Executive Officer \(Acting\)|Deputy Chief Executive Officer|Chief Executive Officer|Chairperson|Executive Director / CEO)(?:(?= )|$)', expand = True)
final_df["Name"] = final_df["Name"].str.replace(r'(?:^|(?<= ))(Co-Chairman|Chairman|Managing Director / CEO|Managing Director|Chief Executive Officer \(Acting\)|Deputy Chief Executive Officer|Chief Executive Officer|Chairperson|Executive Director / CEO)(?:(?= )|$)', "")
final_df.head()

Unnamed: 0,Name,Company_Name,CSE_ID,CSE_Name,Notes,Role,Delete
0,Amarasuriya A.S.,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,Chairman,"Independent, Non-Executive Director",
1,Gunathilake G.P.N.A.G.,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,Chief Executive Officer,"Non-Independent, Executive Director",
2,Ramanayake R.P..L.,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,,"Non-Independent, Non-Executive Director",
3,Asanga W.A.A.,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,,"Non-Independent, Non-Executive Director",
4,Rajasekara L.R.W.S.,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,,"Non-Independent, Non-Executive Director",


In [201]:
final_df = final_df[["Name", "Role", "Notes", "Company_Name", "CSE_ID", "CSE_Name"]]
final_df["Retrieved"] = pd.to_datetime('today').strftime("%d/%m/%Y")
final_df["Check_Needed"] = final_df["CSE_Name"] != final_df["Company_Name"].str.upper()
final_df.head()

Unnamed: 0,Name,Role,Notes,Company_Name,CSE_ID,CSE_Name,Retrieved,Check_Needed
0,Amarasuriya A.S.,"Independent, Non-Executive Director",Chairman,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,04/07/2019,False
1,Gunathilake G.P.N.A.G.,"Non-Independent, Executive Director",Chief Executive Officer,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,04/07/2019,False
2,Ramanayake R.P..L.,"Non-Independent, Non-Executive Director",,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,04/07/2019,False
3,Asanga W.A.A.,"Non-Independent, Non-Executive Director",,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,04/07/2019,False
4,Rajasekara L.R.W.S.,"Non-Independent, Non-Executive Director",,Agalawatte Plantations PLC,AGAL.N0000,AGALAWATTE PLANTATIONS PLC,04/07/2019,False


In [202]:
new_role = final_df.apply(lambda x: x[2] if x[1] is None else x[1], axis = 1)
final_df["Role"] = new_role
mask = final_df.index[final_df["Role"] == final_df["Notes"]]
final_df.iloc[mask, 2] = None
final_df

In [225]:
final_df.to_csv("regional_plantation_companies.csv", index = False)

In [None]:
#TESTING
final_list = []
driver = webdriver.Chrome()
driver.get(home_url)
for company_name  in company_list: 
    button = driver.find_element_by_id("searchToggle")
    button.click()
    elem = driver.find_element_by_id("search")
    elem.send_keys(company_name)
    time.sleep(1)
    elem_check = driver.find_element_by_class_name('search-auto-complete-result')
    try: 
        button_two = elem_check.find_element_by_class_name("ng-scope")
        cse_name, cse_symbol=  str.split(button_two.text, "\n")
        print (button_two.text) 
        new_url = "https://www.cse.lk/home/company-info/"+ cse_symbol + "/company-profile"
        driver.get(new_url)
        #time.sleep(3)
        content = WebDriverWait(driver, 3).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.company-profile-header')))
        #content = driver.find_elements_by_css_selector('p.company-profile-header')
        check2 = content[1].find_elements_by_xpath("//tbody")
        names_list = []
        for index, value in enumerate(check2):
            names_list.append(value.text)   
            if value.text.startswith('Chairman'):
                crop_index = index
        for value in names_list[crop_index:]: 
            temp_value = str.split(value, "\n")
            [final_list.append([x, company_name, cse_symbol]) for x in temp_value]
        print ("Done", company_name )
    #If company is not listed on Colombo Stock exchange
    except:
        print ("Not available")