# Merging RepRisk and Markit

In order to compute the ratios we need but also to be able to compare the two datasets, we need to merge them.

A first idea was to use the `isin` columns available in both dataframe but it appears that it is not well populated in the RepRisk dataset. That means we might need to match companies on their name. Doing so would require to clean the names in both datasets.

Let's first explore the idea of merging on ISIN.

In [1]:
import re
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

import numpy as np
import pandas as pd
import ray
from cleanco import basename
from name_matching.name_matcher import NameMatcher
from pandasql import sqldf

import config

DATA_DIR = Path(config.DATA_DIR)
file_path = Path(DATA_DIR) / "pulled"

In [None]:
RepRisk_company = pd.read_parquet(file_path / "reprisk_company.parquet")
RepRisk_company.head()

In [4]:
(RepRisk_company[['primary_isin']].isna().sum() / len(RepRisk_company)).to_frame(
    "percentage_missing_isin").style.format("{:.2%}")

Unnamed: 0,percentage_missing_isin
primary_isin,85.03%


Out of more than 600k referenced companies, 85% of them have a missing ISIN in RepRisk.

Let's have a look at Markit now.

In [77]:
Markit = pd.read_parquet(file_path / "markit.parquet")
Markit_company = Markit[["isin", "instrumentname"]].drop_duplicates()
Markit_company.head()

Unnamed: 0,isin,instrumentname
0,US98956P1021,Zimmer Biomet Holdings Inc
1,US7901481009,St Joe Co
2,US6934751057,Pnc Financial Services Group Inc
3,US6516391066,Newmont Corporation
4,US5951121038,Micron Technology Inc


In [78]:
(Markit_company[['isin']].isna().sum() / len(Markit_company)).to_frame("percentage_missing_isin").style.format("{:.2%}")

Unnamed: 0,percentage_missing_isin
isin,0.00%


For Markit, only 5% of the companies have a missing ISIN.

We are now going to see if we can match all the available ISIN in Markit with the ones in RepRisk.

In [79]:
isin_intersection = Markit_company['isin'].dropna().isin(RepRisk_company['primary_isin'].dropna())

In [80]:
(isin_intersection.sum() / len(Markit_company['isin'].dropna())).__format__("0.2%")

'61.08%'

Only 47% of the ISIN in Markit are available in RepRisk so this is not going to work for us. We cannot merge the two datasets on ISIN only. We will have to merge on company name also.

In [81]:
RepRisk_id_on_isin = Markit_company.merge(RepRisk_company[['reprisk_id', 'primary_isin']].dropna(), left_on="isin",
                                          right_on="primary_isin", how="left")
RepRisk_id_on_isin.head()

Unnamed: 0,isin,instrumentname,reprisk_id,primary_isin
0,US98956P1021,Zimmer Biomet Holdings Inc,182884,US98956P1021
1,US7901481009,St Joe Co,7502,US7901481009
2,US6934751057,Pnc Financial Services Group Inc,1849,US6934751057
3,US6516391066,Newmont Corporation,93,US6516391066
4,US5951121038,Micron Technology Inc,8371,US5951121038


In [82]:
(RepRisk_id_on_isin[['reprisk_id']].isna().sum() / len(RepRisk_id_on_isin)).to_frame(
    "percentage_missing_reprisk_id").style.format("{:.2%}")

Unnamed: 0,percentage_missing_reprisk_id
reprisk_id,38.92%


Matching on ISIN only, we cannot match 55% of the companies in Markit with their reprisk_id in Reprisk. We will now have to look at matching companies missing isin on their name.

In [83]:
companies_missing_isin = RepRisk_id_on_isin[RepRisk_id_on_isin['reprisk_id'].isna()][['isin', 'instrumentname']]
companies_missing_isin.head()

Unnamed: 0,isin,instrumentname
40,US8724436014,Thq Inc
99,US3696043013,General Electric Co
129,CA8849037095,Thomson Reuters Corp
145,US55273C1071,Mfs Intermediate Income Trust
175,US03748R7474,Apartment Investment And Management Co


In [84]:
RepRisk_id_on_company_name = companies_missing_isin.merge(RepRisk_company, left_on="instrumentname",
                                                          right_on="company_name", how="left")
RepRisk_id_on_company_name.head()

Unnamed: 0,isin,instrumentname,reprisk_id,company_name,primary_isin,isins
0,US8724436014,Thq Inc,,,,
1,US3696043013,General Electric Co,,,,
2,CA8849037095,Thomson Reuters Corp,,,,
3,US55273C1071,Mfs Intermediate Income Trust,,,,
4,US03748R7474,Apartment Investment And Management Co,,,,


In [85]:
(RepRisk_id_on_company_name[['reprisk_id']].isna().sum() / len(RepRisk_id_on_company_name)).to_frame(
    "percentage_still_missing_reprisk_id").style.format("{:.2%}")

Unnamed: 0,percentage_still_missing_reprisk_id
reprisk_id,88.45%


Merging on company name without cleaning it only allows us to match 15% of the companies without matching ISIN. We will have to clean the company names in both datasets to be able to merge them.

In [86]:
companies_still_missing_reprisk_id = RepRisk_id_on_company_name[RepRisk_id_on_company_name['reprisk_id'].isna()][
    ['isin', 'instrumentname']]
companies_still_missing_reprisk_id.head()

Unnamed: 0,isin,instrumentname
0,US8724436014,Thq Inc
1,US3696043013,General Electric Co
2,CA8849037095,Thomson Reuters Corp
3,US55273C1071,Mfs Intermediate Income Trust
4,US03748R7474,Apartment Investment And Management Co


Let's summarize what we did into a single SQL request.

In [87]:
RepRisk_company_isin = RepRisk_company[['reprisk_id', 'primary_isin']].dropna()
RepRisk_company_name = RepRisk_company[['reprisk_id', 'company_name']].dropna()

match_reprisk_for_company = sqldf(""
                                  "SELECT mkc.isin, mkc.instrumentname, rrn.reprisk_id AS reprisk_id_name, rri.reprisk_id AS reprisk_id_isin "
                                  "FROM Markit_company AS mkc "
                                  "LEFT JOIN RepRisk_company_name AS rrn "
                                  "ON mkc.instrumentname = rrn.company_name "
                                  "LEFT JOIN RepRisk_company_isin AS rri "
                                  "ON mkc.isin = rri.primary_isin "
                                  )

In [88]:
match_reprisk_for_company['reprisk_id_merge'] = match_reprisk_for_company['reprisk_id_isin'].fillna(
    match_reprisk_for_company['reprisk_id_name'])

In [89]:
(1 - match_reprisk_for_company['reprisk_id_merge'].isna().sum() / len(match_reprisk_for_company)).__format__("0.2%")

'65.61%'

We are able to match 53% of the companies in Markit with their reprisk_id in Reprisk.

In [90]:
def clean_company_name(name):
    """
    Clean the company name by applying the following transformations:
    - Handle non-string inputs.
    - Convert to lowercase.
    - Remove punctuation and special characters.
    - Replace common corporate abbreviations.
    - Remove legal entity identifiers.
    - Trim whitespace.
    """
    if pd.isnull(name) or not isinstance(name, str):
        return None

    name = basename(name)
    # Convert to lowercase
    name = name.lower()
    # Remove punctuation and special characters (keep alphanumeric and spaces)
    name = re.sub(r'[^\w\s]', '', name)
    # Replace common corporate abbreviations and legal entity identifiers
    # abbreviations = {
    #     ' corporation': ' corp',
    #     ' incorporated': ' inc',
    #     ' company': ' co',
    #     ' limited': ' ltd',
    #     ' plc': '',
    #     ' llc': '',
    #     ' l p': ' lp',
    #     ' lp': ' lp'
    # }
    # for key, value in abbreviations.items():
    #     name = name.replace(key, value)
    # Trim whitespace
    name = re.sub(r'\s+', ' ', name).strip()
    return name

Again, we can update our previous sql request to match the companies on their clean names.

In [91]:
RepRisk_company_isin = RepRisk_company[['reprisk_id', 'primary_isin']].dropna()
RepRisk_company_name = RepRisk_company[['reprisk_id', 'company_name']].dropna()
Markit_company_clean = Markit_company.copy()

Markit_company_clean['cleaned_name'] = Markit_company_clean['instrumentname'].apply(clean_company_name)
RepRisk_company_name['clean_company_name'] = RepRisk_company_name['company_name'].apply(clean_company_name)

match_reprisk_for_company_clean = sqldf(""
                                        "SELECT mkc.isin, mkc.instrumentname, rrn.reprisk_id AS reprisk_id_name, rri.reprisk_id AS reprisk_id_isin, mkc.cleaned_name "
                                        "FROM Markit_company_clean AS mkc "
                                        "LEFT JOIN RepRisk_company_name AS rrn "
                                        "ON mkc.cleaned_name = rrn.clean_company_name "
                                        "LEFT JOIN RepRisk_company_isin AS rri "
                                        "ON mkc.isin = rri.primary_isin "
                                        )

In [92]:
match_reprisk_for_company_clean['reprisk_id_merge'] = match_reprisk_for_company_clean['reprisk_id_isin'].fillna(
    match_reprisk_for_company_clean['reprisk_id_name'])

In [93]:
(1 - match_reprisk_for_company_clean['reprisk_id_merge'].isna().sum() / len(
    match_reprisk_for_company_clean)).__format__("0.2%")

'70.03%'

In [94]:
still_missing2 = match_reprisk_for_company_clean[match_reprisk_for_company_clean['reprisk_id_merge'].isna()][
    ['isin', 'instrumentname', 'cleaned_name']]

In [95]:
RepRisk_company_name_unique_index = RepRisk_company_name.reset_index(drop=True, inplace=False)

In [96]:
ray.shutdown()
ray.init()

matcher = NameMatcher(ngrams=(2, 5),
                      top_n=10,
                      number_of_rows=500,
                      number_of_matches=3,
                      lowercase=True,
                      punctuations=True,
                      remove_ascii=True,
                      legal_suffixes=False,
                      common_words=False,
                      preprocess_split=False,
                      verbose=False)

matcher.set_distance_metrics(['iterative_sub_string', 'pearson_ii', 'bag', 'fuzzy_wuzzy_partial_string', 'editex'])

matcher.load_and_process_master_data(column='clean_company_name',
                                     df_matching_data=RepRisk_company_name_unique_index,
                                     transform=True)

@ray.remote
def match_name_parallel(adjusted_names, matcher):
    results = matcher.match_names(to_be_matched=adjusted_names, column_matching='cleaned_name')
    return results

results = []
for i in range(0, len(still_missing2), 100):
    results.append(match_name_parallel.remote(still_missing2[i:i + 100], matcher))

matches = pd.concat(ray.get(results))

2024-03-01 13:59:47,957	INFO worker.py:1724 -- Started a local Ray instance.
[36m(raylet)[0m Spilled 2597 MiB, 4 objects, write throughput 576 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[36m(raylet)[0m Spilled 4545 MiB, 7 objects, write throughput 611 MiB/s.
[36m(raylet)[0m Spilled 8441 MiB, 13 objects, write throughput 734 MiB/s.
[36m(raylet)[0m Spilled 16882 MiB, 26 objects, write throughput 827 MiB/s.


In [99]:
matches_merge = matches.merge(RepRisk_company_name_unique_index, left_on="match_index_0", right_index=True, how="left")

In [136]:
matches_merge[(matches_merge['score_0'] > 93) | ((matches_merge['score_0'] > 85) & (matches_merge['match_name_0'].str.len() > 15))]

Unnamed: 0,original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,reprisk_id,company_name,clean_company_name
342,blackrock muniyield california fund,blackrock muniyield arizona fund,86.284391,522499,blackrock muniyield quality fund,83.883048,485193,blackrock muniyield fund,81.291700,346164,39472,BlackRock MuniYield Arizona Fund Inc,blackrock muniyield arizona fund
583,millicom international cellular,millicom international cellular sa millicom,86.238425,619245,rural cellular,58.565879,34171,cellular,57.898700,13250,6542,Millicom International Cellular SA (Millicom),millicom international cellular sa millicom
673,trustco bank corp n y,trustco bank corp ny,95.829297,51694,trust,56.965457,672646,nk co,52.464998,338008,11958,Trustco Bank Corp NY,trustco bank corp ny
689,ebt international,bt international,96.411592,319404,mbt international,94.588956,29893,sbt international,94.588956,509215,22628,BT International AG,bt international
741,ferrellgas partners unt,ferrellgas partners,90.957862,48311,ferrellgas partners finance,85.255312,263564,ferrellgas,69.727922,371862,11825,Ferrellgas Partners LP,ferrellgas partners
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23328,jpmorgan chase financial company,jpmorgan chase financial co,91.973072,597976,morgan,49.849051,215658,morgan,49.849051,57642,597932,JPMorgan Chase Financial Co LLC,jpmorgan chase financial co
23343,venzee technologies,cee technologies,87.556382,362988,ne technologies,86.654805,394921,le technologies,84.549541,5549,2427065,CEE Technologies Pte Ltd,cee technologies
23362,sun pacific holding,pacific holdings,85.734862,647501,sun pacific,78.376674,485451,sun pacific energy,73.935600,252704,84966,Pacific Holdings Ltd,pacific holdings
23376,peak discovery capital,discovery capital,88.162525,659353,discovery,65.655942,317331,discover,62.530403,237345,93560,Discovery Capital Corp,discovery capital
