# Join the Register of Overseas Entities companies to HMLR titles registered to overseas companies

Notebook to join the new [Register of Overseas Entities data](https://find-and-update.company-information.service.gov.uk/advanced-search/get-results?companyNameIncludes=&companyNameExcludes=&registeredOfficeAddress=&incorporationFromDay=&incorporationFromMonth=&incorporationFromYear=&incorporationToDay=&incorporationToMonth=&incorporationToYear=&sicCodes=&type=registered-overseas-entity&dissolvedFromDay=&dissolvedFromMonth=&dissolvedFromYear=&dissolvedToDay=&dissolvedToMonth=&dissolvedToYear=) to HM Land Registry (HMLR) titles registered to overseas companies. 

In other words, to join the beneficial owners of property in England & Wales that is owned by an overseas company, with the details of the property they own, where possible 

Joining these two datasets isn't totally straightforward because Companies House hasn't yet asked ROE entities to specify their registered titles, and HMLR doesn't yet include the ROE identifier in its own data, so we need to join fuzzily based on company name.  

Nevertheless we can match a reasonable number of titles - though anything matched should be double-checked. 

This notebook was first run in November 2022, and most recently at the start of February 2023.

**How to use**

1. Make sure you have the latest versions of the following datasets loaded into Google BigQuery (or contact anna@centreforpublicdata.org if you want to take a copy of our BigQuery datasets):

- HM Land Registry [Overseas companies that own property in England and Wales](https://use-land-property-data.service.gov.uk/datasets/ocod).
- Companies House [Basic Companies Data](http://download.companieshouse.gov.uk/en_output.html)
- Companies House [Persons of Significant Control](http://download.companieshouse.gov.uk/en_pscdata.html) (PSC)

2. Update the constants below with your own BigQuery project ID and table names.

3. Enjoy!

**Please use any matched data responsibly, and don't republish or map data without removing residential properties.**

**Notes for users**

- Be aware that HMLR records legal title to property, and ROE is about beneficial title. So there is a conceptual difference between the two datasets.

**Attribution and contacts**
- This work is released under CC-BY-SA. If you publish work based on it, please attribute the Centre for Public Data and link to this GitHub repo.
- Please contact anna@centreforpublicdata.org with any questions or for a copy of the raw data.

In [16]:
import os
import pandas as pd
import pandas_gbq

In [17]:
PROJECT_ID = os.getenv('BQ_LR_PROJECT_ID')
OCOD_TABLE = "ocod.ocod_2023_01"
CH_BASIC_TABLE = "co_house.basic_2023_02"
CH_PSCS_TABLE = "co_house.pscs_2023_02"

## Baseline data: number of overseas companies registered as owning titles in E&W

First let's look at the HMLR overseas companies data. This is our data source for titles that are directly owned by overseas companies. 

As of 1 February 2022 there are 92,772 titles registered at HMLR as belonging to overseas companies. Of these, 90,725 were added after January 1999 and therefore would be expected to be affected by ROE. These titles belong to 30,413 unique companies, so that's the baseline for how many companies we expect to see on the ROE register by the registration deadline of end January 2023.

In [18]:
sql = "SELECT * FROM `%s.%s`" % (PROJECT_ID, OCOD_TABLE)
df_titles = pandas_gbq.read_gbq(sql, project_id=PROJECT_ID, progress_bar_type=None)
len(df_titles)

92772

In [19]:
df_titles.date_proprietor_added = pd.to_datetime(df_titles.date_proprietor_added)
df_relevant_titles = df_titles.query("date_proprietor_added >= '1999-01-01'")
print(len(df_relevant_titles))
print(df_relevant_titles.proprietor_name_1.nunique())

90725
30413


## Baseline data: number of beneficially-owning entities registered in ROE

Now let's look at the Companies House data, which tells us about the companies now on the ROE register, and their beneficial owners (persons of significant control, or PSCs).

At the start of February, there were 19,790 companies with an `OE*` identifier in the Companies House Basic Company Data. Quite a few less than the 30,413 above - though Companies House says there are another 5,000 or so in the pipeline.

For comparison, on past runs:

- at the start of January, there were 9,994 companies
- on Nov 21 2022, there were 2,609 companies.

In [20]:
sql = "SELECT * FROM `%s.%s` where companynumber LIKE 'OE%%'" % (PROJECT_ID, CH_BASIC_TABLE)
df_companies = pandas_gbq.read_gbq(sql, project_id=PROJECT_ID, progress_bar_type=None)
len(df_companies)

19790

And there are currently 78,185 PSCs in the CoHouse PSC data with an `OE*` company number. Of these, all but 2 can be joined to companies in the CoHouse Basic data - so the CoHouse Basic data and PSC data are basically in sync. 

For comparison, on past runs:

- at the start of January, there were 32,570 PSCs (and all but about 400 could be joined to companies)
- on Nov 21 2022, there were 9593 PSCs registered.

In [21]:
sql = """
SELECT *
FROM `%s.%s` 
WHERE  company_number LIKE 'OE%%'
ORDER BY company_number DESC
""" % (PROJECT_ID, CH_PSCS_TABLE)
df_pscs = pandas_gbq.read_gbq(sql, project_id=PROJECT_ID, progress_bar_type=None)
len(df_pscs)

78185

In [22]:
sql = """
SELECT
  pscs.*, companies.*
FROM
  `%s.%s` AS pscs 
INNER JOIN
  `%s.%s` AS companies
ON
  companies.company_number=pscs.companynumber
WHERE
  pscs.companynumber LIKE 'OE%%'
""" % (PROJECT_ID, CH_BASIC_TABLE, PROJECT_ID, CH_PSCS_TABLE)
df_pscs_with_companies = pandas_gbq.read_gbq(sql, project_id=PROJECT_ID, progress_bar_type=None)
len(df_pscs_with_companies)

78183

## Link the two datasets, and look at the coverage

Now we will see how well the CoHouse and HMLR datasets can be matched up. 

First: if we attempt to match our 90,725 OCOD titles to a ROE entity on direct name match alone, we end up with 49,614 titles - so a match rate of about 55%.

This is over-matched though, because there will be some companies that just happen to have the same names.

(NB, at the start of Jan there were 26,069 matched titles, so things have improved quite a bit.)

In [23]:
sql = """
SELECT
  companies.*,  ocod.*
FROM
  `%s.%s` AS companies
INNER JOIN
  `%s.%s` AS ocod
ON
  ocod.proprietor_name_1=companies.companyname 
  -- TODO: Add these in future. 
  -- OR ocod.proprietor_name_2=companies.companyname
  -- OR ocod.proprietor_name_3=companies.companyname
  -- OR ocod.proprietor_name_4=companies.companyname
WHERE
  companies.companynumber LIKE 'OE%%'
""" % (PROJECT_ID, CH_BASIC_TABLE, PROJECT_ID, OCOD_TABLE)
df_titles_joined = pandas_gbq.read_gbq(sql, project_id=PROJECT_ID, progress_bar_type=None)
len(df_titles_joined)

49614

Eyeball the matched results. 

In [24]:
cols = ['companyname', 'companynumber', 'regaddresscounty', 'regaddresscountry', 
        'title_number', 'property_address', 'country_incorporated_1']
df_titles_joined.head()[cols]

Unnamed: 0,companyname,companynumber,regaddresscounty,regaddresscountry,title_number,property_address,country_incorporated_1
0,INDUSTRIALS UK TRUSTEE 1 LIMITED,OE002214,,JERSEY,WSX302629,"land at Cecil Pashley Way, Shoreham Brighton C...",JERSEY
1,ADRIATIC LAND 10 LIMITED,OE018421,,GUERNSEY,WSX360764,"Land on the south side of Upper Shoreham Road,...",GUERNSEY
2,MCDONALD'S GLOBAL MARKETS LLC,OE016227,DELAWARE,UNITED STATES,WSX177545,"Restaurant premises at Tesco Store, Holmbush F...","DELAWARE, U.S.A."
3,COBSTONE PROPERTIES LIMITED,OE011061,ONCHAN,ISLE OF MAN,WSX63652,"134 and 136 South Street, Lancing",ISLE OF MAN
4,COBSTONE PROPERTIES LIMITED,OE011061,ONCHAN,ISLE OF MAN,WSX71736,Land and buildings lying to the east of Cheste...,ISLE OF MAN


Now let's experiment with improving our crude first attempts at matching. How many titles with the same company name don't have a matching country name? About a quarter.

In [25]:
query = "regaddresscountry != country_incorporated_1 "+ \
        "& regaddresscounty != country_incorporated_1"
df_unmatched = df_titles_joined.query(query)
len(df_unmatched)

13857

Eyeballing the difference suggests that almost all the delta do match on country, just country identifiers are messy and incompatible between the two datasets. 

So there are probably some false matches in this, but not loads.

In [38]:
df_unmatched.head()[cols]

Unnamed: 0,companyname,companynumber,regaddresscounty,regaddresscountry,title_number,property_address,country_incorporated_1
2,MCDONALD'S GLOBAL MARKETS LLC,OE016227,DELAWARE,UNITED STATES,WSX177545,"Restaurant premises at Tesco Store, Holmbush F...","DELAWARE, U.S.A."
56,BNP PARIBAS DEPOSITARY SERVICES (JERSEY) LIMITED,OE000912,,UNITED KINGDOM,SX160994,"Angmering Medical Centre, Station Road, Angmer...",JERSEY
68,EASTLIGHT INVESTMENTS LIMITED,OE005719,,GIBRALTAR,WSX221902,"66 High Street, Littlehampton (BN17 5EA)",JERSEY
71,AMAZON PROPERTIES LIMITED,OE003085,,BERMUDA,WSX332981,"10 Beach Road, Littlehampton (BN17 5HT)",ISLE OF MAN
90,HELIX PROPERTY LIMITED,OE004506,,,WSX396739,"Bizspace, Courtwick Lane, Wick, Littlehampton ...",JERSEY


Fix the most obvious problems and re-run. 

In [29]:
df_titles_joined.regaddresscountry = \
    df_titles_joined.regaddresscountry.str.\
    replace("VIRGIN ISLANDS, BRITISH", "BRITISH VIRGIN ISLANDS")
df_titles_joined.regaddresscountry = \
    df_titles_joined.regaddresscountry.str.replace("SAINT ", "ST ")
df_unmatched = df_titles_joined.query(query)
len(df_unmatched)

3182

We could probably reduce this number quite a bit by just dealing with the most common companies that are obvious matches, but have just stated different countries on the two datasets.

In [33]:
df_unmatched.companyname.value_counts()

BNP PARIBAS DEPOSITARY SERVICES (JERSEY) LIMITED    452
CASTLE NEW TOWER HOLDINGS LIMITED                   279
REDWOOD (LIGHT INDUSTRIAL) PROPCO S.A.R.L.          109
HELIX PROPERTY LIMITED                              108
ALDGATE PROPERTY LIMITED                             61
                                                   ... 
JAMAR CREATIVE LIMITED                                1
MAY LIMITED                                           1
WHENUAPAI LIMITED                                     1
GREEN CRESCENT ENTERPRISES LIMITED                    1
TEATIME REALTY LLC                                    1
Name: companyname, Length: 798, dtype: int64

TODO: Map jurisdictions in both datasets to ISO codes and match on those.

## How many of our ROE entries have we been able to find titles for?

In these matched titles, we have 13,311 unique company names - so about two-thirds of all the ROE companies in the basic data. 

This means we have failed to find related titles for at least a third of the companies, but many of them will be Scottish or Northern Irish titles, which aren't in the OCOD dataset. (Thanks to TIUK for pointing that out!)

In [None]:
df_titles_joined.companynumber.nunique()

## Join ROE companies with PSC information and related OCOD titles

Get a list of titles per PSC with any related company and OCOD information that we can match, and save to a file.

Note that the country names haven't been normalised, so matches should be checked manually.

In [30]:
sql = """
SELECT
  companies.*, ocod.*, pscs.*
-- First get all the OE companies with their PSCs.
FROM
  `%s.%s` AS companies
INNER JOIN
  `%s.%s` AS pscs
ON
  companies.companynumber=pscs.company_number
-- Then get any matching titles. 
LEFT JOIN
  `%s.%s` AS ocod
ON
  ocod.proprietor_name_1=companies.companyname
WHERE
  companies.companynumber LIKE 'OE%%'
""" % (PROJECT_ID, CH_BASIC_TABLE, PROJECT_ID, CH_PSCS_TABLE, PROJECT_ID, OCOD_TABLE)

In [31]:
df_pscs_with_companies = pandas_gbq.read_gbq(sql,\
        project_id=PROJECT_ID, progress_bar_type=None)
len(df_pscs_with_companies)

230895

In [32]:
df_pscs_with_companies.\
    to_csv("./data/roe_pscs_with_companies_and_any_matched_titles.csv", index=False)