# Last 500 CH Analysis

After reading a Linkedin [post](https://www.linkedin.com/posts/greybrow53_i-explained-yesterday-at-our-companies-house-activity-6886977306439507968-utQ4) which discussed how shell companies could potentially be spotted by clustering the 500 most recently registered companies on [Companies House](https://www.gov.uk/government/organisations/companies-house) by address or SIC code.

This notebook aims to automate that process (for the last 500 companies registered yesterday) and display a network diagram of the clusters which could be a useful foundation for further analysis. There are plenty of reasons why companies can be registered at the same address or indeed operate in the same sector so confirmatory OSINT is necessary.

In using this notebook, you will conduct the following steps:

- Retrieve yesterday's 500 most recently registered companies from Companies House
- Authenticate to the Companies House API (you will need your own API key)
- Style your chart so companies/addresses/directors/sic codes are different colours
- Filter the data for "interesting" nodes
- Display the network diagram



In [1]:
#!pip install -r requirements.txt #if you haven't done already

In [2]:
# imports the modules needed

import requests
import pandas as pd
from ipycytoscape import *
import json
import re
import chwrapper
from collections import Counter
from datetime import datetime
from datetime import timedelta

In [3]:
# get yesterdays date

yesterday = datetime.now() - timedelta(days=1)


## Retrieve yesterday's 500 most recently registered companies from Companies House

The below cells captures yesterday's 500 most recently registered companies from Companies House and stores it in a pandas dataframe (not too dissimilar to an excel spreadsheet if you are unfamiliar). If you wished to take a look at a different day you can either;
- alter the cell above by changing the 1 in "timedelta(days=1)". Changing it to 7 would be last week, changing it to 365 would be last year
- consult the datetime [documentation](https://docs.python.org/3/library/datetime.html)

In [4]:
csv = requests.get("https://find-and-update.company-information.service.gov.uk/advanced-search/download?companyNameIncludes=&companyNameExcludes=&registeredOfficeAddress=&incorporatedFrom=" + str(yesterday.strftime('%d')) + "%2F" + str(yesterday.strftime('%m')) + "%2F" + str(yesterday.strftime('%y')) + "&incorporatedTo=" + str(yesterday.strftime('%d'))  +  "%2F01%2F2022&sicCodes=&dissolvedFrom=&dissolvedTo=")

In [5]:
df = pd.DataFrame([x.split(',') for x in csv.text.split('\n')])
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header #set the header row as the df header

## Authenticate to the Companies House API (you will need your own API key)

You will need to insert your own access_token as a string (in inverted commas) to authenticate to the Companies House API. You can get one of these by signing up [here](https://developer.company-information.service.gov.uk/) 

In [9]:
search_client = chwrapper.Search(access_token='yourtokenhere')


## Style your chart so companies/addresses/directors/sic codes are different colours

my_style is a variable that affects the style of the chart featured at the bottom of the notebook. As it stands; address nodes are green, officer nodes are blue, sic_codes are yellow and companies are red.

Fonts and Font sizes can also be changed by altering the font-family and font-size variables.

In [7]:

my_style = [
    {'selector': 'node','style': {
        'font-family': 'arial',
        'font-size': '10px',
        'label': 'data(label)',
        'background-color': 'red'}},
    
    {'selector': 'node[type = "address"]','style': {
        'font-family': 'arial',
        'font-size': '10px',
        'label': 'data(label)',
        'background-color': 'green'}},
    
    {'selector': 'node[type = "officer"]','style': {
        'font-family': 'arial',
        'font-size': '10px',
        'label': 'data(label)',
        'background-color': 'blue'}},
    
    {'selector': 'node[type = "sic_code"]','style': {
        'font-family': 'arial',
        'font-size': '10px',
        'label': 'data(label)',
        'background-color': 'yellow'}}
    
    
    
    
    ]

## Filter the data for "interesting" nodes

The below step is the most complex series of operations but in summary the following sub-steps are taken;

- Each node is assessed for importance based on the number of associations it has
    - An address is considered important if 2 or more companies are registered there
    - A SIC code is considered important if 10 or more companies have declared it as their "nature of business"
    - Any company linked to one of the above nodes is considered important
    - Any officer linked to the above company nodes is considered important
    
Don't worry too much if errors are outputted while this step completes, these can happen unavoidably for the following reasons;

- I believe much of the data at companies house has been manually entered at one point or another so there are lots of errors in the underlying database. API calls are made to retrieve the directors of the 500 companies so if the company number has been entered incorrectly or in the wrong column an error will be raised.

- The companies house API allows 600 requests every 5 minutes. If you conduct the below step more than once every 5 minutes you will run into a lot of errors.

- If you experience authorisation header errors, check you API credential is correct

You can adjust the thresholds for "interesting" in the filterer function

In [22]:
nodelist = []
edgelist = []

#for i,j in df.head(150).iterrows(): #uncomment this line to look at a smaller sample size
for i,j in df.iterrows():
    #print(j['company_number'])
    nodelist.append({"data": {"id": j["company_name"], "label": j["company_name"], "type": "company"}})
    nodelist.append({"data": {"id": j["registered_office_address\r"], "label": re.findall("[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}",j["registered_office_address\r"]), "type": "address"}})
    nodelist.append({"data": {"id": j["nature_of_business"], "label": j["nature_of_business"], "type": "sic_code"}})
    
    try:
        #print(j['company_number'])
        response = search_client.officers(j['company_number'])
        r = response.json()
        for x in r['items']:
            nodelist.append({"data": {"id": x["name"], "label": x["name"], "type": "officer"}})
            edgelist.append({"data": {"source": j["company_name"], "target": x["name"]}})

        edgelist.append({"data": {"source": j["company_name"], "target": j["registered_office_address\r"]}})
        edgelist.append({"data": {"source": j["company_name"], "target": j["nature_of_business"]}})
    
    except Exception as e:
        print(e)
        pass
    

    

mycount = Counter()
for edge in edgelist:
    mycount.update({str(edge['data']['source']): 1})
    mycount.update({str(edge['data']['target']): 1})

compaines_to_add = []

def filterer(c, k):
    #c is count
    #k is key
    result = ''
    
    if c >= 2 and k in df['registered_office_address\r'].tolist():
        result = 'yes'
        df2 = df.loc[df['registered_office_address\r'] == k]
        for cn in df2['company_name']:
            compaines_to_add.append(cn)
    
    if c >= 10 and k in df['nature_of_business'].tolist():
        result = 'yes'
        df3 = df.loc[df['nature_of_business'] == k]
        for cn in df3['company_name']:
            compaines_to_add.append(cn)
    
    return result
    
    
new_count = Counter({k: c for k, c in mycount.items() if filterer(c,k) == 'yes'})

newedges = []
newnodes = []

for edge in edgelist:
    if edge['data']['source'] in list(new_count.keys()):
        newedges.append(edge)
    if edge['data']['target'] in list(new_count.keys()):
        newedges.append(edge)
    if edge['data']['source'] in list(set(compaines_to_add)):
        newedges.append(edge)
    if edge['data']['target'] in list(set(compaines_to_add)):
        newedges.append(edge)
    
temp = []
for edge in newedges:
    temp.append(str(edge['data']['source']))
    temp.append(str(edge['data']['target']))

for node in nodelist:
    if node['data']['id'] in temp:
        newnodes.append(node)
    


    
    
    
overall = {}
overall['nodes'] = newnodes

overall['edges'] = newedges

overall_json = json.loads(json.dumps(overall))
ipycytoscape_obj = CytoscapeWidget()
ipycytoscape_obj.graph.add_graph_from_json(overall_json)
ipycytoscape_obj.set_style(my_style)



404 Client Error: Not Found for url: https://api.companieshouse.gov.uk/company/%20U.K.%20LIMITED%22/officers?access_token=bdcf808a-e28b-4697-8f43-30d085f2f29c


## Create the chart

Running the below step may take a while and depends on your processing power and how much "interesting" data there is. If it has gone on for longer than a minute or the chart looks "weird" then run the below step again.

The below chart represents the output of the steps when conducted on the 26/01/2022.

You can zoom in/out and drag and drop the nodes as you see fit.

If the chart is too busy you can create a smaller sample by uncommenting and amending the "for i,j in df.iterrows():" line above.

In [24]:
display(ipycytoscape_obj)

CytoscapeWidget(cytoscape_layout={'name': 'cola'}, cytoscape_style=[{'selector': 'node', 'style': {'font-famil…