# Enriching Companies House data

![title](Images/ch_screenshot.png)

This notebook gives an overview on how business data from Companies House is imported, formatting and then enriched by exploiting various APIs and websites. Examples of applications include:
-  Retrieving official company websites using Google Places API
-  Scraping websites to get keywords to classify the industry of businesses.
-  Obtaining social media accounts and handles for companies and then use these to get a proxy for their web presence (number of followers, likes etc.)

## Importing data

The Free Company Data Product is a downloadable data snapshot containing basic company data of live companies on the Companies House register, and is the principal dataset for this project. This is updated monthly and needs to be downloaded before importing as a pandas dataframe. First, we need to import some modules...

### Modules

-  Pandas: provide easy-to-use data structures in Python
-  Numpy: provides fast and efficient multidimensional arrays, in addition to linear algebra and mathematical operations.
-  Matplotlib: provides plot to visualise data

In [23]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np

import matplotlib.pyplot as plt
# Increase figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

### Loading and formatting Companies House dataset

The latest version of the Free Company Data Product can be downloaded here. http://download.companieshouse.gov.uk/en_output.html. The zip file that is downloaded is approximately 300MB, and the raw CSV file around 2GB. Once downloaded, ensure that the data is saved in the root folder of this notebook (or amend directory as required).

In [None]:
# to-do: investigate warning on mixed data types
ch_raw = pd.read_csv('/Users/dataexploitationmac1/Desktop/Faisal/Datasets/BasicCompanyDataAsOneFile-2018-02-01.csv')
# ch_raw = pd.read_csv('/Users/fasamin/Desktop/DS/Datasets/BasicCompanyDataAsOneFile-2018-02-01.csv')

In [None]:
# preview the data
ch_raw.head(10) # first 10 rows 

In [None]:
# fields available
ch_raw.columns

In [None]:
# remove unnecessary columns for this project
# why is copy() used? See explanation at link below:
# https://stackoverflow.com/questions/27673231/why-should-i-make-a-copy-of-a-data-frame-in-pandas
ch = ch_raw.iloc[:,[0,1,4,5,6,7,8,9,10,11,12,18,19,21,26,27,28,29]].copy()

# rename columns
ch.columns = ['name','crn','address1','address2','postTown','county','country', \
            'postcode','category','status','origin','accounts_lastMadeUpDate','accountCategory',\
            'returns_lastMadeUpDate','sic1','sic2','sic3','sic4']

In [None]:
# format missing values
ch.sic1.replace('None Supplied', np.NaN, inplace=True)
ch = ch.dropna(subset=['name']) # delete rows with null business names (usually only a few values)

In [None]:
# Produce a range of key stats 
# Key stats
print('---------')
print('Number of businesses: %s' %len(ch))
print('Missing SIC codes: %s' %ch.sic1.isnull().sum())
sic_comp = (1.0 - (float(ch.sic1.isnull().sum())/len(ch)))*100
print('SIC code completion: %.2f' %sic_comp + '%')
post_comp = (1.0 - (float(ch.postcode.isnull().sum())/len(ch)))*100
print('Postcode completion: %.2f' %post_comp + '%')
print('---------')
print('Category breakdown (top 5)')
print('')
print(ch.category.value_counts().head())
print('---------')
print('Account category (top 5)')
print('')
print(ch.accountCategory.value_counts().head())
print('---------')
print('Geographical breakdown (top 5)')
print('')
print(ch.origin.value_counts().head())
print('---------')
print('SIC code breakdown (top 5)')
print('')
print(ch.sic1.value_counts().head())

### Data exploration

Some pandas commands to explore the dataset, including setting up a function to find companies.

In [None]:
ch.dtypes # types of each column - all objects

In [None]:
def find_company(name):
    '''
    Searches companies house dataset for company name which include the given input which must be a string.
    '''
    name = name.lower()
    n = ch.name.str.lower().str.contains(name)
    x = input(str(n.sum()) + ' companies found. See list of companies? Y or N? ')
    if x.lower() == 'y':
        return ch[n]
    else:
        return True

In [None]:
find_company('Burberry') # testing function on a few cases 

In [None]:
find_company('Dyson')

Exploring SIC codes...

In [None]:
ch.sic1.describe() # counts occurences and unique values

Sorting by the top 20 SIC codes shows that some of these are not very descriptive. Top of the list is 'Other business support service activities n.e.c'. Third is 'Dormant Company' and this is followed by 'Other service activities n.e.c'.

In [None]:
ch.sic1.value_counts().head(20) # sort by top 20 sic codes

In [None]:
ch.sic1.value_counts().head(30).plot() # shows skew of top categories
plt.show()

Checking if company reference numbers are unique

In [None]:
ch.crn.describe() # all crns are unique

In [None]:
ch.crn.isnull().sum() # 0

Exploring the address data

In [None]:
ch.head() # reminder of the address fields

In [None]:
ch.address1.describe() # 1.6 million unique address

In [None]:
ch.address1.isnull().sum() # 27K null addresses

In [None]:
ch.postTown.isnull().sum() # 93K missing town names

In [None]:
ch.postcode.isnull().sum() # 52K missing post codes

### Export formatted dataset

In [None]:
# Export dataset, named after MMYY of ch data
ch.to_csv('ch_2018-02.csv',index=False)

Optional: remove non-UK companies 

In [None]:
ch_uk = ch[ch['origin'].isin(['United Kingdom','Great Britain','UNITED KINGDOM','GREAT BRITAIN','ENGLAND & WALES','UK'])]
ch_uk.reset_index(inplace=True)
ch_uk.to_csv('ch_2018-02_uk.csv',index=False)

## Scraping data from Google Search Results

This section goes through the process of running google searches of business names in Companies House, scraping text from the results, and then returning a wordcloud of text from the first page of results.

The code below builds up the code for functions that run searches and produce
worldclouds as follows:

cloud(keyWords(search('Company Name')))

- search(string): returns a list of URLs from Google for the given term
- keyWords(list): screen-scrapes all visible text from the given list of URLs, and cleans
- cloud(string): after removing a given list of stopwords, produces a wordcloud

### Importing further modules and data

In [24]:
import webbrowser # to open web links
import nltk # natural language toolkit
from nltk.corpus import stopwords # Import the stop word list, may require download
# WordCloud modules
from wordcloud import WordCloud, STOPWORDS

import re # regular expressions 
from time import sleep # to pause web-scraper
import requests # allows you to send HTTP requests via Python
from bs4 import BeautifulSoup # beautiful soup for parsing of HTML

Read in formatted CH dataset if starting a new session. We'll refer to this at the end of the section after we've built up our tools to scrape and clean website text.

In [None]:
ch = pd.read_csv('ch_2018-02.csv')

### Returning links for a Google Search term

First, we need to build some functionality to scrape the search results returned by Google.

In [None]:
# Set business search term as an example
biz = 'DYSON LIMITED'

In [None]:
# Read HTML
html = requests.get('https://www.google.co.uk/search?q='+ biz)
# Parse HTML into a BeautifulSoup object
soup = BeautifulSoup(html.content, 'html5lib')

In [None]:
# Get all links and put into list
list_of_links = []
for link in soup.find_all('a'):
    list_of_links.append(link.get('href'))

In [None]:
print(list_of_links) # needs cleaning up 

In [None]:
# Cleaning up results
links = DataFrame({'urls':list_of_links}) #turn list into DF
links = links[links.urls.str.contains('/url?')] #Only search results

In [None]:
# remove cached sites
links = links[links.urls.str.contains('webcache.googleusercontent') == False]

In [None]:
# remove opening url?q= string
links = links.urls.str.replace('/url\?q=',"")
# after this, you don't need to call list anymore on the column

In [None]:
# remove suffixed &sa bit by splitting and drop index
links = links.str.split('&sa',1).reset_index().drop('index',1)
# this is now a dataframe

In [None]:
# use iterrows to grab first entry in each list which should be the working url
links_cleaned = []
for row in links.iterrows():
    links_cleaned.append(row[1][0][0])

In [None]:
# convert to dataframe
links_cleaned = DataFrame(links_cleaned)

We can now bring this together in one function. Note that we're screen-scraping from Google Search results so we'll need to be careful to not overload Google with search requests in quick succession (and potentially get our IP blocked).

In [None]:
def search_google(business_name):
    '''
    Takes in a business name and returns the links returned in the first page of Google Search results
    '''
    # Read HTML
    html = requests.get('https://www.google.co.uk/search?q='+business_name)
    # Parse HTML into a BeautifulSoup object
    soup = BeautifulSoup(html.content, 'html5lib')

    # Get all links and put into list
    list_of_links = []
    for link in soup.find_all('a'):
        list_of_links.append(link.get('href'))

    # Cleaning up results
    links = DataFrame({'urls':list_of_links}) #turn list into DF
    links = links[links.urls.str.contains('/url?')] #Only search results
    
    # remove cached sites
    links = links[links.urls.str.contains('webcache.googleusercontent') == False]

    # remove opening url?q= string
    links = links.urls.str.replace('/url\?q=',"")
    # after this, you don't need to call list anymore on the column

    # remove suffixed &sa bit by splitting and drop index
    links = links.str.split('&sa',1).reset_index().drop('index',1)
    # this is now a dataframe

    # use iterrows to grab first entry in each list which should be the working url
    links_cleaned = []
    for row in links.iterrows():
        links_cleaned.append(row[1][0][0])
        
    # convert to dataframe
    links_cleaned = DataFrame(links_cleaned)

    return links_cleaned[0]

In [None]:
burberry_links = search_google('burberry limited')

In [None]:
# use webbrowser library to open all links in browser (if needed)
for link in burberry_links:
    webbrowser.open(link)

### Extract key text from company websites 

Now that we've got the functionality to return links from Google search results, we want to navigate to each link, scrape and format the text to find words with explanatory value after removing stopwords and other standard website text. Let's use the Dyson website as an example. 

In [None]:
dyson_links = search_google('Dyson Limited')

In [None]:
# Inspecting the first search result
# Read HTML
html = requests.get(dyson_links[0])
# Parse HTML into a BeautifulSoup object
soup = BeautifulSoup(html.content, 'html5lib')

In [None]:
# Extracting text from key html sections
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]

In [None]:
# Get all visible text
text = soup.getText().encode('ascii','ignore')

We want to clean up this text by removing HTML tags and new line indicators. 

In [None]:
text_r = str(text).replace('\\n','').replace('\\t','').replace('\\r','')
print(text_r)

In [None]:
def stripsymbols(text):
    '''
    Use regular expressions to do a find-and-replace of HTML text.
    Function found online
    '''
    text = str(text)
    x = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text)
    x = re.sub('/(^|\b)@\S*($|\b)/'," ",x)
    x = re.sub('/(^|\b)#\S*($|\b)/'," ",x)
    x = re.sub("[^a-zA-Z]"," ",x)
    x = re.sub(r"(?:\@|https?\://)\S+", " ",x)
    return x

In [None]:
text_stripped = stripsymbols(text_r)
print(text_stripped)

Getting better! The final step is remove stop words by invoking the Natural Language Toolkit Library we imported earlier. Let's see what kind of stop words are classified in the library. 

In [None]:
print(stopwords.words('English'))

We can create a function to remove stopwords from website text.

In [None]:
def removeStopWords(text):
    '''
    Remove stopwords from given text string
    '''
    words = [w for w in text if not w in stopwords.words("english")]
    return words

In [None]:
# Prepare text for function by (a) changing to lowercase (2) remove whitespace from beginning and end 
# (3) splitting the text to create a list of individual wordsvisible_text_stripped.lower().strip().split() 
text_stripped_split = text_stripped.lower().strip().split()
text_stripped_split

Finally, running this through the removeStopWords function gives us something in a much better shape than the original text. 

In [None]:
text_cleaned = removeStopWords(text_stripped_split)
text_cleaned

Wrapping this all up into a function...

In [None]:
def cleantext(text):
    text_r = str(text).replace('\\n','').replace('\\t','').replace('\\r','')
    text_stripped = stripsymbols(text_r)
    text_stripped_split = text_stripped.lower().strip().split()
    text_cleaned = removeStopWords(text_stripped_split)
    return ' '.join(text_cleaned) # returns a joined list of the remaining words 

In [None]:
dyson_cleaned = cleantext(text) # example of a function
dyson_cleaned

### Using the Natural Language Toolkit to tokenise and tag words

Another feature of the Natural Language Toolkit we can exploit is the ability to tag words and categorise them into their 'parts of speech' (i.e. nouns, verbs, adverbs). Perhaps this can be used to tag words from the scraped text, and retrieve nouns under the assumption that they provide the greatest explanatory power.

Let's test this with the scraped text from the previous section.

In [None]:
dyson_cleaned

In [None]:
tokens = nltk.word_tokenize(dyson_cleaned) # tokenize (split) the string
print(tokens)

The tags bellow are given according to the Penn Treebank Project here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

The results are quite interesting. Words like 'vacuums', 'cylinders', 'robot' and 'dryers' are correctly given as nouns. However, some words are double-tagged. For instance, 'dyson' is given as a verb as well as a noun. 'Airblade' is given as a past-tense verb. 

From this example alone, filtering on nouns would yield the best results.

In [None]:
tagged = nltk.pos_tag(tokens) # categorise words according to their parts of speech
print(tagged)

In [None]:
# Keep nouns only
tagged_nouns = " ".join([word[0] for word in tagged if word[1] in ['NNS','NN']])
tagged_nouns

In [None]:
# Difference in list sizes - 1,352 removed words (out of 3,767)
len(dyson_cleaned) - len(tagged_nouns)

### Master function and wordclouds

Let's collect everything we've built in Section 2 and build our final master functions.

In [None]:
def key_words(list_of_urls):
    '''
    From the given list of urls, this function scrapes, cleans (removing symbols and non-nouns)
    and gathers all text into a single string
    '''
    
    words_cleaned = '' # Set up empty string
    
    for website in list_of_urls:
        # try / except for troublesome websites
        try:
            html = requests.get(website)
            soup = BeautifulSoup(html.content, 'html5lib')
            [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
            visibleText = soup.getText().encode('ascii','ignore')
            words_cleaned += cleantext(visibleText) # Calling the function we built earlier 
        except:
            continue
            
    # Tokenize and tag words
    tokens = nltk.word_tokenize(words_cleaned)
    tagged = nltk.pos_tag(tokens)

    # Keep nouns only
    keywords = " ".join([word[0] for word in tagged if word[1] in ['NNS','NN']])
    return keywords

In [None]:
key_words(search_google('Dyson Limited')) # testing on an example - takes a few seconds to run

This is best visualised in a word cloud...

In [None]:
def cloud(words):
    '''
    Function that takes in a string and produces a word cloud
    '''
    wordcloud = WordCloud(stopwords=STOPWORDS,
                          background_color='black',
                          width=1800,
                          height=1400
                         ).generate(words)

    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
    return wordcloud

In [None]:
cloud(key_words(search_google('Royal Dutch Shell'))) # can take upto a minute to run

In [None]:
x = cloud(key_words(search_google('carillion')))

There is a lot of scope to improve this. Certain terms from websites that commonly re-occur can be filtered out by updating the stopwords set as follows.

In [None]:
# ways to improve this: list of common names, weights and quantities, months and dates,

custom_words = ['sun','new','showbiz','tv','uk','john','lewis','partnership','offers','store',
'business','company','stores','shop','department','partner','street','london','partners','peter',
'jones','duration', 'views', 'minutes', 'month', 'version', 'system','tesco','september','privacy',
'policy','customer','service','home', 'company', 'london', 'price', 'offer', 'customer',
'service', 'home', 'year', 'london', 'day', 'march', 'business', 'shop','item','level','logo','menu',
'account','co','road','centre']


#STOPWORDS is a set, so need to use update method
STOPWORDS.update(custom_words)

## Connecting to Google Places API

We've relied on Google Search results to hopefully get websites that relate to a company term. However, the Google Places API can be linked to with a company's name and postcode and used to retrieve the company website and other details. 

Information from Google Places is displayed for certain businesses and places of interest as shown below.

![title](Images/google_places_screenshot.png)

### Modules and setup

We begin as usual by importing the required modules

In [25]:
import pandas as pd
import numpy as np
import requests
from time import sleep # to pause requests when web-scraping
import geopy.distance # to calculate distances
import re # regex to remove symbols

Google Places API requests can be called with the requests package. By default, you'll be able to send 1,000 requests per day but this can be increased free of charge, up to 150,000 requests per 24 hour period, by enabling billing on the Google API Console to verify your identity. Check quota here (after registering): http://code.google.com/apis/console

In [26]:
YOUR_API_KEY = '' # enter API key which you'll receive from Google 

Check documentation here for info on the various Google Places APIs: https://developers.google.com/places/web-service/

Let's use the postcode for London Eye as an example: SE1 7PB. We'll need to input the longitude and latitude for this

In [None]:
postcode = 'SE1 7PB'

In [None]:
html = requests.get("http://api.postcodes.io/postcodes/" + postcode)
html.json()

In [None]:
lat = html.json()['result']['latitude']
lon = html.json()['result']['longitude']
lat, lon

### Nearby search

A Nearby Search lets you search for places within a specified area. You can refine your search request by supplying keywords or specifying the type of place you are searching for. A nearby search request is an HTML URL of the following form:

https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=-33.8670522,151.1957362&radius=500&type=restaurant&keyword=cruise&key=YOUR_API_KEY

Using the London Eye as an example:

In [None]:
london_eye = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=' + \
                          str(lat) + ',' + str(lon)  + '&' + 'keyword=London+Eye' + '&' + \
                          'rankby=distance' + '&' + \
                          'key=' + str(YOUR_API_KEY))
london_eye.json()

Exploring the json() file...

In [None]:
london_eye.json()['results'] # list of results


In [None]:
london_eye.json()['results'][0]['name'] # name of place

In [None]:
le_id = london_eye.json()['results'][0]['place_id'] # id of location 
le_id

Now that we've got the id, we can get the website from the text search API.

Before leaving this section, let's build a function that returns the location id from a given set of parameters. We'll need to call upon this function later.

In [27]:
def distance(lat1, lon1, lat2, lon2):
    '''
    Calculate distance (in km) between two sets of latitude and longitude coordinates
    Uses geopy package
    '''
    x = lat1, lon1
    y = lat2, lon2
    return geopy.distance.vincenty(x, y).km

In [28]:
def nearby_search(lat,lon,keyword,api_key):
    '''
    input: latitude and longitude (floats), api_key for Google Places (string)
    output: location_id of place closest to co-ordinates and that matches the keyword
    if the company is not found, then a message will be given
    '''
    
    # HTML wrapped for arguments
    html = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=' + \
                        str(lat) + ',' + str(lon)  + '&' + 'keyword=' + keyword + '&' + \
                        'rankby=distance' + '&' + 'key=' + str(api_key))
        
    # Error message if location not found
    try: 
        place_id = html.json()['results'][0]['place_id']
    except:
        return 'Location not found'
    
    # Check how close the search result is, reject if more than 1km distance
    
    # Coordinates from Google Places
    lat_g = html.json()['results'][0]['geometry']['location']['lat']
    lon_g = html.json()['results'][0]['geometry']['location']['lng']

    if distance(lat, lon, lat_g, lon_g) > 10: # more than 10km away
        return 'Location too far'
    
    # return html.json()
    return place_id

In [None]:
# testing the function
nearby_search(51.5028202620979,-0.119252376858172,'London Eye',YOUR_API_KEY) # correct

In [None]:
# Returns a KFC about 1 mile from the London Eye. Since we can't restrict radius with the rankby parameter, 
# depending on how strict the business name matching is, we run the risk of mismatches
nearby_search(51.5028202620979,-0.119252376858172,'KFC',YOUR_API_KEY)

In [None]:
nearby_search(51.5028202620979,-0.119252376858172,'London',YOUR_API_KEY) # returns 'London' - the entire city?

In [46]:
nearby_search(51.5028202620979,-0.119252376858172,'Eye',YOUR_API_KEY) # Specsavers in the Strand

'ChIJDYZYNsoEdkgRHCQ30Uysy_M'

In [None]:
nearby_search(51.5028202620979,-0.119252376858172,'Wheel',YOUR_API_KEY) # Kwik-fit in Elephant & Castle

In [29]:
nearby_search(51.5028202620979,-0.119252376858172,'Ferris Wheel',YOUR_API_KEY) # Correct - London Eye

'ChIJc2nSALkEdkgRkuoJJBfzkUI'

In [None]:
nearby_search(50.547036619931,-3.49639521511686,'R.B.W. DEVELOPMENTS LTD.',YOUR_API_KEY) # Testing sample

#### QA (reinforce postcode check)

With the London Eye example, the postcode which is input is 'SE1 7PB', which has corresponding coordinates of (51.5028202620979, -0.119252376858172). Google Nearby Search returns (51.503324, -0.119543) which is close enough.

However, if any results are more than a certain distance away (say 1km), we reject them.

In [None]:
# Get coordinates from google places
lat_g = london_eye.json()['results'][0]['geometry']['location']['lat']

In [None]:
lon_g = london_eye.json()['results'][0]['geometry']['location']['lng']

In [None]:
lat, lon

In [None]:
lat_g, lon_g

In [30]:
def distance(lat1, lon1, lat2, lon2):
    '''
    Calculate distance (in km) between two sets of latitude and longitude coordinates
    Uses geopy package
    '''
    x = lat1, lon1
    y = lat2, lon2
    return geopy.distance.vincenty(x, y).km

In [None]:
distance(lat,lon,lat_g,lon_g)

### Place search

Link here: https://developers.google.com/places/web-service/details

Once you have a place_id from a Place Search, you can request more details about a particular establishment or point of interest by initiating a Place Details request. A Place Details request returns more comprehensive information about the indicated place such as its complete address, phone number, user rating and reviews. A Place Details request is an HTTP URL of the following form:

In [None]:
https://maps.googleapis.com/maps/api/place/details/output?parameters

Calling this on the London Eye place id..

In [47]:
london_eye_website = requests.get('https://maps.googleapis.com/maps/api/place/details/json?' + \
                    'placeid=' + le_id + '&' + 'key=' + str(YOUR_API_KEY))

NameError: name 'le_id' is not defined

In [48]:
london_eye_website.json()['result'] # list of results

NameError: name 'london_eye_website' is not defined

In [None]:
london_eye_website.json()['result']['formatted_address']

In [None]:
london_eye_website.json()['result']['international_phone_number']

In [None]:
london_eye_website.json()['result']['opening_hours'] # dictionary of opening hours

In [None]:
london_eye_website.json()['result']['reviews'] # reviews

In [None]:
london_eye_website.json()['result']['reviews'] # example of a review - NLP potential

In [None]:
london_eye_website.json()['result']['website'] # place website

As before, let's build a helper function for convenience in the following section

In [31]:
def place_search(place_id):
    '''
    input: place_id from Nearby Search API (string)
    output: company website (if found)
    If a website is not found, then an error message will be given
    '''
    
    # exception in case place_id doesn't yield results in Place Search
    try:
        html = requests.get('https://maps.googleapis.com/maps/api/place/details/json?' + \
                    'placeid=' + place_id + '&' + 'key=' + str(YOUR_API_KEY))
    except:
        return 'No results in Place Search'
    
    # exception if website is not found
    try:
        website = html.json()['result']['website']
    except:
        return 'Website not found'
    
    return website

In [49]:
# Testing function with London Eye ID
place_search('ChIJc2nSALkEdkgRkuoJJBfzkUI')

'http://www.londoneye.com/'

In [50]:
# Testing function with other IDs
place_search('ChIJmbKqQlgDdkgRfkx7-tWdNU4') # KFC in Lambeth - not branch-specific

'http://www.kfc.co.uk/'

In [None]:
place_search('ChIJdd4hrwug2EcRmSrV3Vo6llI') # for London - website not avaiable

In [None]:
place_search('ChIJDYZYNsoEdkgRHCQ30Uysy_M') # Specsavers - branch-specific

In [None]:
place_search('ChIJWwBL554EdkgRluPLRtvCKek') # Kwik-fit - branch specific

### Function to return website from given companies house details

Let's bring some of the pieces in the previous sections together to build a function that returns the company website (if found) from details found in the publicly-available Companies House dataset.

In Section 1.4, we exported a cleaned-up dataset of Companies House to a local directly. Let's import this now.

In [34]:
ch = pd.read_csv('ch_2018-02.csv')

In [33]:
ch.shape

NameError: name 'ch' is not defined

We'll be running the business name and the postcode through the Google Places API. The business name should be given but we'll need to exclude those records where the postcode is not given.

In [51]:
ch_pc = ch[ch.postcode.notnull()].copy()
ch_pc.shape

(4088872, 18)

Ok, so we've got our dataset. Let's now put our function together.

In [32]:
def return_website(company_name,postcode):
    '''
    Input: company name and postcode (both strings)
    Output: Company website
    '''
    
    # Step 1: Convert postcodes to latitude and longitude using postcodes.io
    # Some companies may give invalid postcodes in Companies House so build exception
    
    try: 
        html = requests.get("http://api.postcodes.io/postcodes/" + postcode)
        lat = html.json()['result']['latitude']
        lon = html.json()['result']['longitude']
    except:
        return 'Postcode not found'
    
    # Step 2: Get place_id by calling upon a Nearby Search (see Section 3.2)
    place_id = nearby_search(lat,lon,company_name,YOUR_API_KEY)
    
    # Step 3: Return website by feeding place_id into Place Search
    website = place_search(place_id)
    
    return website    

### Testing function on Companies House

Now for the fun stuff: let's test this function on a sample of companies in Companies House and see how many companies we get websites for.

In [52]:
# Random sample of a 100 companies from those entries with a postcode, only return name and postcode
ch_random_100 = ch_pc.sample(100)[['name','postcode']].copy()

In [56]:
ch_random_100.loc['website'] = np.NAN # set up blank column
ch_random_100

Unnamed: 0,name,postcode,website
751960,CHANNEL FIVE TELEVISION GROUP LIMITED,EH7 5JA,Website not found
3029580,R.D. HAYLOCK & SON LIMITED,IP33 1NE,
3264450,SCABIET LTD,NN4 7PA,
1413995,FOULDS SOLICITORS LIMITED,SG12 0EF,
2458498,MIDAS COMPONENTS LIMITED,NR31 0DU,
2884993,PIXEL COMPUTERS SOLUTIONS LIMITED,NP44 1HE,
2932398,PREDICTO.AI UK LIMITED,WD3 1JE,
1229534,ELVIVIO LLP,E13 9PJ,
1240114,EMPTAGE WESTWOOD CONTRACTING LTD,M25 9NX,
2026237,K.T.E. LEGAL SOLUTIONS LIMITED,BR1 4AW,


In [57]:
# Retrieve websites for each company with a for loop
i = 1
for index, row in ch_random_100.iterrows():
    ch_random_100.loc[index,'website'] = return_website(row[0],row[1])
    print(str(i) + ' - ' + str(row[0]) + ' - ' + str(row[1]) + ' - ' + str(row[2]))
    i += 1
    sleep(0.001)

1 - CHANNEL FIVE TELEVISION GROUP LIMITED - EH7 5JA - Website not found
2 - R.D. HAYLOCK & SON LIMITED - IP33 1NE - Website not found
3 - SCABIET LTD - NN4 7PA - http://www.x-tremesystems.com/
4 - FOULDS SOLICITORS LIMITED - SG12 0EF - http://www.foulds.uk.com/
5 - MIDAS COMPONENTS LIMITED - NR31 0DU - http://midas-components.com/
6 - PIXEL COMPUTERS SOLUTIONS LIMITED - NP44 1HE - Website not found
7 - PREDICTO.AI UK LIMITED - WD3 1JE - Website not found
8 - ELVIVIO LLP - E13 9PJ - Website not found


KeyboardInterrupt: 

In [None]:
ch_random_100['website'].value_counts().head() # 17 websites out of 100 

In [None]:
ch_random_100.to_csv('sample_of_100_websites.csv') # save results

Let's see if we can improve the match-rate by removing 'LTD. and 'LIMITED' from the end of the name

In [None]:
ch_random_100.loc[:,'website_2'] = np.NAN # set up blank column
ch_random_100

In [None]:
ch_random_100 = pd.read_csv('sample_of_100_websites.csv',index_col=0)

In [None]:
ch_random_100 = ch_random_100.head(100).copy()

In [35]:
def standardise(company_name):
    stopwords = ['limited','ltd','ltd.','lp']
    querywords = str(company_name).split()
    resultwords  = [word for word in querywords if word.lower() not in stopwords]
    result = ' '.join(resultwords)
    result_no_symbols = re.sub(r'[^\w]', '', result) # removes symbols
    return result_no_symbols.lower()

In [36]:
standardise('SHARP & SHINE INDUSTRIAL CO., LTD')

'sharpshineindustrialco'

In [None]:
ch_random_100['name_stripped'] = ch_random_100['name'].apply(standardise)

In [None]:
ch_random_100 = pd.read_csv('datasets/sample_of_100_websites_2.csv', index_col = 0)

In [None]:
ch_random_100.head()

In [None]:
for index, row in ch_random_100.iterrows():
    print(row[0], row[1])

In [None]:
# Retrieve websites for each company with a for loop
i = 1
for index, row in ch_random_100.iterrows():
    ch_random_100.loc[index,'website_4'] = return_website(row[0],row[1])
    print(str(i) + ' - ' + str(row[0]) + ' - ' + str(row[1]) + ' - ' + str(row[6]))
    i += 1
    sleep(0.01)

In [None]:
ch_random_100['website_4'].value_counts() # 18 websites out of 100, lose some cases, gain others

In [None]:
ch_random_100.to_csv('sample_of_100_websites_3.csv') # save results

### Testing on NYER LEP

Import postcode lookups for the North Yorkshire and East Riding LEP from the National Postcode Lookup Table

In [None]:
nyer_postcodes = pd.read_csv('nyer_postcode.csv')

In [None]:
nyer_postcodes.head() # three postcodes to lookup for 

Let's do an inner join from the CH data to the NYER postcodes on the POSTCODE_PCDS column.

In [None]:
ch_nyer = pd.merge(ch_pc, nyer_postcodes, how='inner', left_on='postcode', right_on='POSTCODE_PCDS')

In [None]:
ch_nyer.shape # 12,486 companies. Similiar to what we're seeing in CSAT.

In [None]:
ch_nyer.loc[:,'website'] = np.NAN # set up blank column
ch_nyer

In [42]:
ch_nyer_1000 = ch_nyer.loc[:999,].copy()

In [43]:
ch_nyer_1000

Unnamed: 0,name,crn,address1,address2,postTown,county,country,postcode,category,status,...,POSTCODE_PCDS,LAT,LONG,LEP1,LEP2,COUNTY_CODE,COUNTY_DIST_CODE,WARD_CODE,website,social_media
0,"""STREONSHALH"" WHITBY LIMITED",01636551,"""STREONSHALH""",KHYBER PASS,WHITBY,NORTH YORKSHIRE,,YO21 3DQ,"PRI/LTD BY GUAR/NSC (Private, limited by guara...",Active,...,YO21 3DQ,54.488915,-0.615535,E37000039,,E10000023,E07000168,E05006340,http://www.streonshalh.co.uk/,set()
1,'OUT OF THIS WOLD' LTD,10553612,MILLHOLME FARM HOUSE,SPEETON STREET,FILEY,,UNITED KINGDOM,YO14 9TG,Private Limited Company,Active,...,YO14 9TG,54.155893,-0.239788,E37000039,,E10000023,E07000168,E05006327,Website not found,Not found
2,CLIFF TOP LTD,04623454,"MILLHOLME FARM HOUSE, SPEETON",FILEY,YORKSHIRE,,,YO14 9TG,Private Limited Company,Active,...,YO14 9TG,54.155893,-0.239788,E37000039,,E10000023,E07000168,E05006327,Website not found,Not found
3,01722212 LIMITED,01722212,HILL VIEW HOUSE,"CORNBOROUGH ROAD, SHERIFF HUTTON",YORK,NORTH YORKSHIRE,,YO60 6QJ,Private Limited Company,Active,...,YO60 6QJ,54.091054,-1.009659,E37000039,,E10000023,E07000167,E05006313,Website not found,Not found
4,WORLD WIDE SHOPPING MALL LIMITED,03307834,HILL VIEW HOUSE,"CORNBOROUGH ROAD, SHERIFF HUTTON",YORK,NORTH YORKSHIRE,,YO60 6QJ,Private Limited Company,Active,...,YO60 6QJ,54.091054,-1.009659,E37000039,,E10000023,E07000167,E05006313,http://www.worldwideshoppingmall.co.uk/,{'https://plus.google.com/110394507306641485115'}
5,01AA LIMITED,10017240,THE ROYAL HUNTING LODGE SHIPTON BY BENINGBROUGH,SHIPTON BY BENINGBROUGH,YORK,NORTH YORKS,UNITED KINGDOM,YO30 1BD,Private Limited Company,Active,...,YO30 1BD,54.051810,-1.166750,E37000039,,E10000023,E07000164,E05009677,Website not found,Not found
6,1 VOYAGE LIMITED,04683525,6 FEVERSHAM ROAD,,HELMSLEY,NORTH YORKSHIRE,,YO62 5HN,Private Limited Company,Active,...,YO62 5HN,54.250017,-1.058227,E37000039,,E10000023,E07000167,E05006302,http://www.voyagecare.com/,"{'https://twitter.com/voyagecare', 'https://ww..."
7,INSITE TRAINING & RESPONSE SOLUTIONS LIMITED,09132004,6 FEVERSHAM ROAD,,HELMSLEY,NORTH YORKSHIRE,,YO62 5HN,Private Limited Company,Active,...,YO62 5HN,54.250017,-1.058227,E37000039,,E10000023,E07000167,E05006302,Website not found,Not found
8,1-4 MUSEUM TERRACE (SCARBOROUGH) MANAGEMENT CO...,06759604,62-63 WESTBOROUGH,,SCARBOROUGH,NORTH YORKSHIRE,,YO11 1TS,Private Limited Company,Active,...,YO11 1TS,54.278977,-0.408069,E37000039,,E10000023,E07000168,E05006317,Website not found,Not found
9,15 ABERDEEN STREET FREEHOLD MANAGEMENT LIMITED,06629449,62 - 63 WESTBOROUGH,,SCARBOROUGH,NORTH YORKSHIRE,,YO11 1TS,Private Limited Company,Active,...,YO11 1TS,54.278977,-0.408069,E37000039,,E10000023,E07000168,E05006317,Website not found,Not found


In [45]:
# Retrieve websites for each company with a for loop
for index, row in ch_nyer_1000.iterrows():
    if pd.notnull(row[28]) == True:
        continue
    ch_nyer_1000.loc[index,'website'] = return_website(row[0],row[7]) 
    print(str(index) + ' - ' + str(row[0]) + ' - ' + str(row[7]) + ' - ' + str(ch_nyer_1000.loc[index,'website']))
    sleep(0.001)

In [None]:
ch_nyer_1000['website'].value_counts() # 3,176 websites found for 12,486 (25%), some of these are duplicates

In [None]:
ch_nyer_1000.to_csv('nyer_with_websites.csv')

#### Test with improved accuracy (26/03)

In [113]:
# import previous dataset that contained websites and social media accounts using old method
ch_nyer = pd.read_csv('datasets/nyer_with_websites_sm.csv', index_col = 0)

In [114]:
ch_nyer.head(30)

Unnamed: 0,name,crn,address1,address2,postTown,county,country,postcode,category,status,...,POSTCODE_PCDS,LAT,LONG,LEP1,LEP2,COUNTY_CODE,COUNTY_DIST_CODE,WARD_CODE,website,social_media
0,"""STREONSHALH"" WHITBY LIMITED",1636551,"""STREONSHALH""",KHYBER PASS,WHITBY,NORTH YORKSHIRE,,YO21 3DQ,"PRI/LTD BY GUAR/NSC (Private, limited by guara...",Active,...,YO21 3DQ,54.488915,-0.615535,E37000039,,E10000023,E07000168,E05006340,http://www.streonshalh.co.uk/,set()
1,'OUT OF THIS WOLD' LTD,10553612,MILLHOLME FARM HOUSE,SPEETON STREET,FILEY,,UNITED KINGDOM,YO14 9TG,Private Limited Company,Active,...,YO14 9TG,54.155893,-0.239788,E37000039,,E10000023,E07000168,E05006327,Website not found,Not found
2,CLIFF TOP LTD,4623454,"MILLHOLME FARM HOUSE, SPEETON",FILEY,YORKSHIRE,,,YO14 9TG,Private Limited Company,Active,...,YO14 9TG,54.155893,-0.239788,E37000039,,E10000023,E07000168,E05006327,Website not found,Not found
3,01722212 LIMITED,1722212,HILL VIEW HOUSE,"CORNBOROUGH ROAD, SHERIFF HUTTON",YORK,NORTH YORKSHIRE,,YO60 6QJ,Private Limited Company,Active,...,YO60 6QJ,54.091054,-1.009659,E37000039,,E10000023,E07000167,E05006313,Website not found,Not found
4,WORLD WIDE SHOPPING MALL LIMITED,3307834,HILL VIEW HOUSE,"CORNBOROUGH ROAD, SHERIFF HUTTON",YORK,NORTH YORKSHIRE,,YO60 6QJ,Private Limited Company,Active,...,YO60 6QJ,54.091054,-1.009659,E37000039,,E10000023,E07000167,E05006313,http://www.worldwideshoppingmall.co.uk/,{'https://plus.google.com/110394507306641485115'}
5,01AA LIMITED,10017240,THE ROYAL HUNTING LODGE SHIPTON BY BENINGBROUGH,SHIPTON BY BENINGBROUGH,YORK,NORTH YORKS,UNITED KINGDOM,YO30 1BD,Private Limited Company,Active,...,YO30 1BD,54.05181,-1.16675,E37000039,,E10000023,E07000164,E05009677,Website not found,Not found
6,1 VOYAGE LIMITED,4683525,6 FEVERSHAM ROAD,,HELMSLEY,NORTH YORKSHIRE,,YO62 5HN,Private Limited Company,Active,...,YO62 5HN,54.250017,-1.058227,E37000039,,E10000023,E07000167,E05006302,http://www.voyagecare.com/,"{'https://twitter.com/voyagecare', 'https://ww..."
7,INSITE TRAINING & RESPONSE SOLUTIONS LIMITED,9132004,6 FEVERSHAM ROAD,,HELMSLEY,NORTH YORKSHIRE,,YO62 5HN,Private Limited Company,Active,...,YO62 5HN,54.250017,-1.058227,E37000039,,E10000023,E07000167,E05006302,Website not found,Not found
8,1-4 MUSEUM TERRACE (SCARBOROUGH) MANAGEMENT CO...,6759604,62-63 WESTBOROUGH,,SCARBOROUGH,NORTH YORKSHIRE,,YO11 1TS,Private Limited Company,Active,...,YO11 1TS,54.278977,-0.408069,E37000039,,E10000023,E07000168,E05006317,Website not found,Not found
9,15 ABERDEEN STREET FREEHOLD MANAGEMENT LIMITED,6629449,62 - 63 WESTBOROUGH,,SCARBOROUGH,NORTH YORKSHIRE,,YO11 1TS,Private Limited Company,Active,...,YO11 1TS,54.278977,-0.408069,E37000039,,E10000023,E07000168,E05006317,Website not found,Not found


In [115]:
ch_nyer['website_2'] = np.NAN
ch_nyer['sm_2'] = np.NAN

In [105]:
for index, row in ch_nyer.iterrows():
    print(standardise(row[0]))

streonshalhwhitby
outofthiswold
clifftop
01722212
worldwideshoppingmall
01aa
1voyage
insitetrainingresponsesolutions
14museumterracescarboroughmanagementcompany
15aberdeenstreetfreeholdmanagement
15almasquaremanagementcompany
156castleroad
2almacourtmanagementco
272earlsfieldroad
31and33vesperroadrtmcompany
35trafalgarsquare
6blenheimterracemanagement
admotorservicesscarborough
ajstreffordbeverley
ajstreffordscarborough
absolutezero
alexandercourtflatsmanagementco
alnemewsmanagementcompany
andrewcowenestateagent
arrasrestaurants
attenboroughhotelscarborough
auto66racing
autonomoussystems
ayckbournchaptersmanagementcompany
basrosdevelopments
bellehairstudio
belvederemansionsmanagement
boatmanstavern
boydsmillmanagementcompany
breckonservices
bromptonholidayflats
broommillsmanagementcompany
browconhomes
catherinegee
cdvsolutions
centaurhousemanagement
cftsolutions
cheungspropertygroup
chilternseastyorkshire
cholso
cigiit
cityworth
cliffbridgeplacemanagement
cngfoodserviceequipment
cngtec

inspireandenable
insurebefore
ixura
jcfabrications
jetblackofwhitby
jjhughes
johnhodgson
justinwaringbuildingroofing
justinwaringproperties
kpconsultingandproperties
keithlocker
kennelminerals
lmminerals
lawrencecross
lawsongeophysicalaviationuk
lawsonhobbs
lopanesquetranslationservices
mjrigservices
mcengineeringservices
marinak
mariondalefisheries
martinnarey
mhminvestmentadvisersllp
misterchipsholdings
misterchipsproperty
mittenhillminerals
mjlmickley
moutreyswhitby
mrduffpotashco
mwcf
natureslaboratorymanufacturing
nbv
newtonarchitectsdesigners
northeastecigs
northeastfishingvessels
og2016
pprudomson
parkesbuilders
parkesconstructionwhitby
partridgenestcottage
philipburleywhitby
pkswhitby
plumbtechcleveland
prestigeperformanceyorkshire
rjcleaningandmaintenance
rdattridgebuilders
rgcanasolidfuels
rstjay
rdlscaffolding
residentsof11brunswickmews
residentsof82ruswarplane
rhpsafety
robindrilling
roseengineeringwhitby
rubytuesdayswhitby
rursusproperties
rwprojectservices
ryedalesubseaen

allseasonswholesale
ambrosiavillamanagement
baycourtsmanagementfileylimited
bridlingtonescape
churchclifffarmmanagementcompany
deepdenefileymanagementcompany
eastcoastflatsfiley
fileychapelmanagementcompany
fileycommunitysportsclub
hrf4hr
hrf4property
hunmanbyhalloldhall
hunmanbyhallsouth
jcsolutions
newtoncourtfileymanagementcompany
normantonrisemanagementcompany
royalcrescentcourtbridlington
southcrescentcourt
swisscottagecourtmanagement
universalenglishlanguageservices
xennochangemanagement
36esplanaderoadscarboroughmanagementco
36rutlandstreet
37holdings
ageckouk
bearexports
colourheroes
colourhistory
easiexportsllp
easirecyclingservices
ebidesign
limecoast
ukcasemanagement
vestigoprivateinvestigations
38noelscourtmanagementcompany
3edgesworkplace
armstrongsafetymanagement
chronosaeon
gstassociates
jobstopper
keenengineering
nastavnik
pluckmusicagency
stocktonelectricalservices
tlhcontractors
workplaceadvantage
4churchsquarepropertymanagement
7churchsquarewhitbymanagementcompany
4m

autotest2009
adsellerslogistics
asclepiusmedicalsolutions
bcjordanproperties
baildonvillageestates
bendaure
bettybooptrucking
bpeyork
bradfordestatesmanagement
buttercupbranch
cjdholdings
cjdpropertiesbradford
castleburnpropertymanagement
cosfordhouse2014
cosfordhouseaccommodation
cosfordhouse
djoatescoscarborough
dysonblack
eyautomation
earfend
eastcoasthotels
excelsiorwoodmanagement
flatkeys
foneboothlogistics
freckleshairdressing
freightshuttle
g2integratedsecuritysolutions
grahamburchill
harrismechanicalservices
homefinderofbridlington
jjdrivingservices
jdmfreightservices
micrographics
micrographicsuk
nigelpicklestransport
oatesripleyandco
onestopbusinesstraining
paradiseleisureyork
propfindyorkshire
ringalinglogistics
taflooring
theorangetreerosedale
villageestatestitanic
willingtransport
adaisyroomdesignespeciallyforyou
almondsonhaulage
daisyroomdesigns
aebanksandson
aglamondservices
agoodallbuildersandcontractors
amshipmansons
agplantholdings
agplant
aroundandaboutdistribution
a

piervale
venturemanagedprojects
agrikit
agristructure
agrivation
cfast
digitalorthodonticsolutions
smilefast
ukautoelectrics
ukscan
agriwashuk
amscottservices
sandhuttongrowers
shelldrake
theagilerequirementscompany
agronomicservices
bedaleelectrical
bedaleplumbingandheating
bedaleprintshop
blacksheeptraders
bridgefieldsgroup
britmins
chapelsideconsulting
cockburnbutchers
connectiontrainingequine
consortiumfinancialandwealthmanagement
coorecom
corporateallianceagainstdomesticviolence
craggsb
creativemanagementandbusinesssolutions
ddrelectrical
ddrsolar
doughtyplant
draftstructuraldetailing
dspb
emotionalhealth
etruscany
finoteca
gavinsmithbuildingandlandscaping
h10consulting
hadleyhounds
hallsonelectricalcontractors
hannahrussell
hazelbrowevents
holmedalepreschool
humannurture
ianatkinsonhaulage
jfhudson
janlinsleyeducationconsultancy
jeremymoondecoratorandpaperhanger
juliewalkerpathology
keithbattersbyjoinery
kteastearoom
mhfencingandgates
mlsportsandfitness
mlsportscoaching
moorcroft

stokesleydeli
timbrownson
andamconstruction
andarfinancingmarketingservices
jolbyproperties
andersonthompsonproperties
carterjamesassociates
oneaonlineestateagents
woollypigbaconcompany
andersonbrownwhitby
andersonbuilderswhitby
misterchips
plaxtonbuildingplumbing
whitbytownfootballclub
andersoncrawlercranehire
andersonheavycranes
andersonsnorthernengland
farshires
anderwoodproperties
andesmarqualityservices
andhar
andrewbowles
fettled
scienceequip
andrewbrepistainedglass
stainedglasscentre
andrewcrossosteopathicclinic
changingtowardsexcellence
andrewfeatherstonehomemaintenance
grosmontcrossingclub
thegeallgallery
andrewgadsby
arthursland
fortiusconsultingalgeria
fortiusconsulting
andrewhutchinsontransport
davinalovegreencatering
andrewmilburnjoinery
andrewporrittelectricalservices
bronaghhannon
decosemosolutions
andrewrichardsontransport
hutchcassidy
andrewruddickconsulting
andrewstampofstokesleylandscapes
stokesleycommercialservices
andrewsofscarborough
dwmotorcycles
scarboroughcarte

barkerchickens
georgewbarker
barkersnorthallerton
barkislandhomes
britishmohairmarketing
dailycare
secretspot
thepictureplace
wykehamtearooms
barleysyard1958
barlowwilson
bumpdale
barnacles
catchlord
theaytonwatercompany
barnardcastleaviation
enmint
goldcrestconsultants
kdapestcontrol
wbconstuctionnortheast
barrettsolutions
blinkini
westgarthfarm
barrettsbutchers
barriestaveley
barriesices
bartlettgroundcare
bartonrichmondshireplayingfield
bartoncommercialtyres
gcsjohnsonholdings
gcsjohnsonstoragehandling
kneeton
ljc
synergygroupne
bartonshelmsley
cinnamontwist
nicethingshelmsley
pottergatemewsmanagementcompanyno2
barwickhealthsafetyproductionmaintenance
bassmegeve
engageforaction
freedomsnowsports
rytonconsultancy
yorkshireoutdoors
yorkshireshootingschool
bassetlaweducationalconsultancy
jpldrawingservices
bastowco
jonboltonconsulting
bathroomandwetroomspecialists
ianwalters
mcjdesignbuild
battersbyjunctioncommunityassociation
battlefieldstud
burythorpeholdings
burythorpehousehotel
bax

castlehowardinvestments
coneysthorpetrustees
howardtrustees
thearboretumtrusttradingcompany
bridgedaleinvestments
yorkshirewoldssolutions
bridgeheadhousing
bridgesealproducts
mclaneydevelopments
mclaneylettings
brigantium
brightbeginningsdaycare2007
brightbluedigitalmarketing
brightpathishayascic
brightsteels
brillsoft
brillsoftwarellp
ksoftnorthern
ksoftllp
ksoftsweden
lettfaktura
brindlefood
pacemedical
britanniabio
britsrock
givemeachance
thepersonalpotentialpartnership
brittondesignconsultants
diproduce
idpsurveying
lewisconsultants
lewissurveyingassociatesyorkshire
broadacreshousingassociation
broadacresservices
broadacresservicesmanagement
leeminggatemanagementcompany
marcusrichardsonenvironmentalservices
marketgateresidentialmanagementcompany
mowdencontrols
mulberryhomesyorkshire
mulberrylivingyorkshire
sjbrecycling
theelectricalrecyclingcompany
theoakssowerbymanagementcompany
thestablesnorthallertonmanagementcompany
toddwastemanagementgroup
toddwastemanagement
toddpak
yorwaste


clarionhomes
clarksbakerseasingwold
clarksdyeworks
retfordhallsgrounds
victoriacourtmanagementfiley
clarksrestaurant
clarkeprojects
clarkesenvironmental
countycontractorsyorkshire
hclarkesons
harryclarkeholdings
sweetmoments
wensleydaleheating
workplacesafesolutions
clarksplanthirecontractors
clarksondairyservices
clarkysproperties
eaglescliffegas
lakesideeggsllp
sallywalkerdressageplus
southteesmx
classicsportscarholdings
classicsportscar
gilead
maltoncoachworks
theviewsouth
classiccarriers
jddennissons
just1sourceandsupply
logisticsnortheast
sabell
seamerengineeringcompany
yorkshiredistributionlogisticsholdings
classicracewear
claxtonhallcottage
newittco
newittsofyork
clayfoxassociates
electralog
mowthorpeuk
theradiostation
ukradiosystems
claypennyconsulting
cleamondsoftware
cleanbreak
kjsmfrench
cleanflodrinkers
fieldtoforkgame
clearandsimpletaxandbusinessservices
kurtsmobilecaravanservices
lydan
simplifyconstruction
stlinternationaluk
clearbrewnorthyorkshire
equestrianproducts
clea

tspinksons
zeevisserijbedrijfandreabv
danbymanagementservices
danesdalecommercialservices
danielradoi
danielwarringtonpropertyrentals
theviewmanagementcompanythorntondale
thorntonheightsmanagementcompany
wwestatesthorntondale
wwestates
dannycameron
dannyconnellycleaning
sontech
daphaworth
darkblue
johnskaife
tjrsurveys
darlingtonbrewinganddistillingcompany
kilderkinns
northbay
thefamousfirkin
darrenchandler
darrenferrey
darrenhowesautoservices
globalstocksystems
stuartsfoods
vanguardcaravanservices
datacomtechnology
stockdalegascare
daveraymondmobileagriculturalengineers
davecarterjoinery
fccountrywear
wheretheheartis
davelli
levenestates
davidaddisonfinancialservices
davidchadwickdesignservices
davidchalmersphotography
latinoheatscarborough
davidclintengineering
davidclose
davidfogginconsulting
davidgallsolicitors
davidharrisondevelopments
dhgroup
dhgroupmalton
jmpackaging
onlinekitchenware
sjcmaltondevelopments
davidhillagriculturalcontractor
tonyhillenterprises1962
davidhunt
davidke

wrbggroup
externalbusinessreviewebr
hopscotchearlyyearsconsultancy
multileadmedia
extremeenterprise
eyesthetic
fabroadwithsons
fmikewillis
spareleg
fmachinsons
jelsye
machinyorkshirelamb
stonegraveconstruction
stonegraveproperties
svgpropertiesllp
vfourteen
ftconstructiongroupholdings
ftconstruction
fordyfarmsingleby
fordythompsonholdings
georgefordyson
georgefordypropertiesholdings
georgefordyproperties
langtonsnorthallerton
tomwilloughby
walterthompsoncontractors
fwlythbuildingcontractors
fabraweld
fabtekkengineering
howardyorkshire
indiclothingcompany
facetofaceresearch
ivyconstruction
jibrownbuildingservices
professionalmortgagesourcing
whitbyshire
facialrecognitionservices
marckerr
swainbyselfstorage
theluxetradingcompany
trinitydesigns
fairdrive
filproconsultancy
hwoodsonsleyburn
leyburnpets
miad
niaddevelopments
serendipityinteriors
serendipitysaltaire
fairhurstscatering
glowdry
middlehammotors
fairlambremoteservices
fairwaysports
fairx
fairydustcleaningserviceswhitby
fallingfos

resimac
sinderbystainless
thetubecreationco
thirskbodyshopservicecentre
tubecreationthirsk
unishowerdistribution
unishower
honeybournedevelopments
honeybournewest
honeymansbutchers
honeypotinn
honeypotsdodsworthhall
honeysuckleminerals
hooklinesinkerwhitby
theendeavourwhitby
thefleecewhitby
whitbyindianmoments
hopkinsonandsons
normangray
rvroger
ryedalecontainerhire
wallansonbuilding
hoppersremovals
horleysyorkshire
kdawsonengineeringsolutions
hornerbuildingjoinery
hosthousing
hotmamahealthandfitness
hottoddi
patersonsofthirsk
hotelphoenix
hovinghamdevelopments
howardrussellelectricalplumbing
howeestatescompany
howieproperty
onenorthproperty
howshampowerco
renewableheritagetrust
straightrange
themarketplaceofmalton
yorkshireseniorcare
hssilver
sprayshack
hsgengineering
ns
htenergy
htdfiley
huddengineering
huddersfieldgauges
hudsondesignconstruction
hugoparceldelivery
jlmfoodsuk
jlmfoodsassociates
humblesrealestate
pipelineengineeringsupplyco
provincialinns
humeelectricalserviceswhitby


marshallpropertiesprojects
minsterindustrialproperties
minsterproperties
minsterpropertiesspacetowork
smithprentice
trafalgarsquarescarborough
martelnorthern
martialfitllp
martyn
martynsmith
maryculterfoods
mashamsheepbreedersassociation
tanfieldlodgedevelopments
thearchimedesscrewcompany
yorkshirehydroliabilitypartnership
masonmasters
northyorkmoorshistoricalrailwaytrust
northyorkshiremoorsrailwayenterprisesplc
whitbyandpickeringrailway
masonventuresgroup
meeksfarm
swan10
maste
mathewwebsterhandengraver
smallbumpcom
towbarexpress
maxcoatesmotorsportsandmarketing
maxfieldminerals
rawfarmpotash
sensibleproducts
thevictoriahotelrhb
maxos
maxwellevansestateagents
mayfairresidentialcarehome
mayfieldsrentals
maystrading
thirskcommunitybistroscic
mbmechanicalserviceswhitby
passcaffolding
peterbrowndecorator
mbsubseaservices
mbwconsultancy
mcmedics
mccainfinancegb
mccainfoodsgb
mccainukh1
mccainukh2
mcfoyorkshire
mcintyremeats
theburgerqueen
mcluckieprojects
yorkshirecountryholidays
mcsmanage

rollingrestorations
vickyannisphysiotherapy
romatrading
rondor
roofwrap
rootwise
roperscaravanworldcatterick
ropersleisure
ropersproperty
ropner
thorpperrowarboretum
rorykemp
rosebunch
roseberrycontractservices
ssoffshoreservices
rosedaleconsulting
rosedaledevelopments
rosewoodcourtresidentsmanagementcompany
uniquegiftsinternational
rosewooddentalclinic
rosspaxton
rougeinvestments
totalstainless
rountonsvillagehall
rowanmanagementco
rowantreegardendesign
taalco
rowellsgentshairdressers
rowlandjayneproperties
stubbsinspectionservices
rpscivilengineering
rsgildroy
walterboyeswhitby
rsgconsultants
rssyorkshire
rtwroadrestoration
yorkshireweddingbarn
rubyriversboutique
rueburyconsultants
tristansillars
rugbypropertyinvestments
rumblingtums
rupertdruryco
russellskirbymoorside
rustyshears
trattles
ruswarpminiaturerailway
turnerdale
ruthgordonassociates
rutlandrtmcompany
rvelectricalservicesscarborough
rwconsultancyservices
rwsbodyworks
teamengineeringne
yorkshirevanlease
ryecatenterprises
ry

In [126]:
ch_nyer.loc[index,'website_2']

'Website not found'

In [125]:
# Retrieve websites for each company with a for loop
for index, row in ch_nyer.iterrows():
    if pd.notnull(row[30]) == True:
        continue
    ch_nyer.loc[index,'website_2'] = return_website(standardise(row[0]),row[7]) 
    ch_nyer.loc[index, 'sm_2'] = str(get_social_media(ch_nyer.loc[index,'website_2']))
    print(str(index) + ' - ' + str(row[0]) + ' - ' + str(ch_nyer.loc[index,'website_2']) + ' - ' + str(ch_nyer.loc[index,'sm_2']))
    sleep(0.001)

In [128]:
ch_nyer.to_csv('ch_nyer_website_sm_2.csv')

## Scraping social media accounts 

Now that we have the websites from Google Places, we can scrape social media accounts and other interesting sites (i.e. trip advisor) for these and then subsequently scrape activity, and follower information from those.

Let's import our NYER dataset from the previous section.

In [14]:
import os
cwd = os.getcwd()
print(cwd)

/Users/dataexploitationmac1/Desktop/Faisal/companies-house


In [None]:
nyer = pd.read_csv('nyer_with_websites.csv', index_col=0) 

In [None]:
nyer_websites = nyer[nyer['website'] != 'Website not found'] # select those with websites

In [None]:
nyer_websites.head(10)

Choosing 'AUTO 66 RACING LIMITED', with crn 00980520 for our test as this has social media handles.

In [None]:
auto = nyer_websites[nyer_websites['crn'] == '00980520']

In [None]:
auto

In [None]:
print(auto.website.values) # website

Python package exists to extract social media: https://pypi.python.org/pypi/extract-social-media/0.4.0. Can also used regex: https://github.com/lorey/social-media-profiles-regexs. We'll test the package for now.

In [15]:
# Code copied and pasted from above link
import requests
from html_to_etree import parse_html_bytes
from extract_social_media import find_links_tree # install first with setup.py from link above

In [16]:
res = requests.get('http://www.oliversmountracing.com/')
tree = parse_html_bytes(res.content, res.headers.get('content-type'))

In [17]:
set(find_links_tree(tree))

{'https://twitter.com/auto66racing',
 'https://www.facebook.com/oliversmountracing',
 'https://www.instagram.com/oliversmountracing/',
 'https://www.youtube.com/user/oliversmountracing'}

Wrapping this up in a function

In [96]:
def get_social_media(url):
    '''
    Retrieve social media accounts from a company website
    Input: string of URL
    Output: dictionary of links
    '''
    try:
        res = requests.get(url)
        tree = parse_html_bytes(res.content, res.headers.get('content-type'))
        if len(set(find_links_tree(tree))) > 0:
            return set(find_links_tree(tree))
    except:
        return 'Not found'

In [112]:
get_social_media(ch_nyer.loc[26,'website_2'])

{'https://twitter.com/auto66racing',
 'https://www.facebook.com/oliversmountracing',
 'https://www.instagram.com/oliversmountracing/',
 'https://www.youtube.com/user/oliversmountracing'}

In [111]:
get_social_media('http://www.oliversmountracing.com/')

{'https://twitter.com/auto66racing',
 'https://www.facebook.com/oliversmountracing',
 'https://www.instagram.com/oliversmountracing/',
 'https://www.youtube.com/user/oliversmountracing'}

In [None]:
nyer_websites.website

Running this on dataset (took about an hour to run)

In [None]:
nyer['social_media'] = nyer.website.apply(get_social_media)

### Getting twitter followers

In [None]:
nyer_websites.head()

In [None]:
nyer.social_media.head(100)

In [None]:
nyer_sm = nyer[(nyer['social_media'] != 'Not found') & (nyer['social_media'] != {})] # select those with websites

In [None]:
nyer_sm.social_media

In [None]:
nyer_sm.social_media.isnull().sum()

In [None]:
nyer.head()

In [None]:
nyer.to_csv('nyer_with_websites_sm.csv')