# Merging RepRisk and Markit

In order to compute the ratios we need but also to be able to compare the two datasets, we need to merge them.

A first idea was to use the `isin` columns available in both dataframe but it appears that it is not well populated in the RepRisk dataset. That means we might need to match companies on their name. Doing so would require to clean the names in both datasets.

Let's first explore the idea of merging on ISIN.

In [1]:
from pathlib import Path

import pandas as pd

import config

DATA_DIR = Path(config.DATA_DIR)
file_path = Path(DATA_DIR) / "pulled"

Let's have a look at RepRisk first.

In [2]:
RepRisk_company = pd.read_parquet(file_path  / "reprisk_company.parquet")
RepRisk_company.head()

Unnamed: 0,reprisk_id,company_name,primary_isin,isins
0,10,Acer Inc,TW0002353000,US0044341065 | US0044342055 | TW0002353000
1,100,Rio Tinto PLC,GB0007188757,GB0007406639 | BRRIOTBDR007 | ARDEUT112638 | G...
2,1000,Terrane Metals Corp,CA88103A1084,CA88103A1167 | US88103A3068 | CA88103A1084 | C...
3,10000,RAK Properties PJSC,AER000601016,AER000601016
4,100000,BLUECOM Co Ltd,KR7033560004,KR7033560004


In [3]:
(RepRisk_company[['primary_isin']].isna().sum() / len(RepRisk_company)).to_frame("percentage_missing_isin").style.format("{:.2%}")

Unnamed: 0,percentage_missing_isin
primary_isin,85.03%


Out of more than 600k referenced companies, 85% of them have a missing ISIN in RepRisk.

Let's have a look at Markit now.

In [4]:
Markit = pd.read_parquet(file_path  / "markit.parquet")
Markit_company = Markit[["isin", "instrumentname"]].drop_duplicates()
Markit_company.head()

Unnamed: 0,isin,instrumentname
0,DE0005552004,Deutsche Post Ag
1,US98956P1021,Zimmer Holdings Inc
2,US86764P1093,Sunoco Inc
3,US7901481009,St. Joe Co
4,US8265521018,Sigma-aldrich Corp


In [5]:
(Markit_company[['isin']].isna().sum() / len(Markit_company)).to_frame("percentage_missing_isin").style.format("{:.2%}")

Unnamed: 0,percentage_missing_isin
isin,5.33%


For Markit, only 5% of the companies have a missing ISIN.

We are now going to see if we can match all the available ISIN in Markit with the ones in RepRisk.

In [6]:
isin_intersection = Markit_company['isin'].dropna().isin(RepRisk_company['primary_isin'].dropna())

In [7]:
(isin_intersection.sum() / len(Markit_company['isin'].dropna())).__format__("0.2%")

'47.27%'

Only 47% of the ISIN in Markit are available in RepRisk so this is not going to work for us. We cannot merge the two datasets on ISIN only. We will have to merge on company name also.

In [8]:
RepRisk_id_on_isin = Markit_company.merge(RepRisk_company[['reprisk_id', 'primary_isin']].dropna(), left_on="isin", right_on="primary_isin", how="left")
RepRisk_id_on_isin.head()

Unnamed: 0,isin,instrumentname,reprisk_id,primary_isin
0,DE0005552004,Deutsche Post Ag,3794,DE0005552004
1,US98956P1021,Zimmer Holdings Inc,182884,US98956P1021
2,US86764P1093,Sunoco Inc,978,US86764P1093
3,US7901481009,St. Joe Co,7502,US7901481009
4,US8265521018,Sigma-aldrich Corp,7620,US8265521018


In [9]:
(RepRisk_id_on_isin[['reprisk_id']].isna().sum() / len(RepRisk_id_on_isin)).to_frame("percentage_missing_reprisk_id").style.format("{:.2%}")

Unnamed: 0,percentage_missing_reprisk_id
reprisk_id,55.24%


Matching on ISIN only, we cannot match 55% of the companies in Markit with their reprisk_id in Reprisk. We will now have to look at matching companies missing isin on their name.

In [10]:
companies_missing_isin = RepRisk_id_on_isin[RepRisk_id_on_isin['reprisk_id'].isna()][['isin', 'instrumentname']]
companies_missing_isin.head()

Unnamed: 0,isin,instrumentname
14,US3371621018,First Horizon National Corp
15,US3199631041,First Data Corp
16,US2473611083,Delta Air Lines Inc
19,US0442041051,Ashland Inc
20,US1251291068,CDW COMPUTER CENTERS INC


In [11]:
RepRisk_id_on_company_name = companies_missing_isin.merge(RepRisk_company, left_on="instrumentname", right_on="company_name", how="left")
RepRisk_id_on_company_name.head()

Unnamed: 0,isin,instrumentname,reprisk_id,company_name,primary_isin,isins
0,US3371621018,First Horizon National Corp,,,,
1,US3199631041,First Data Corp,1524305.0,First Data Corp,US32008D1063,US32008D1063
2,US2473611083,Delta Air Lines Inc,,,,
3,US0442041051,Ashland Inc,,,,
4,US1251291068,CDW COMPUTER CENTERS INC,,,,


In [12]:
(RepRisk_id_on_company_name[['reprisk_id']].isna().sum() / len(RepRisk_id_on_company_name)).to_frame("percentage_still_missing_reprisk_id").style.format("{:.2%}")

Unnamed: 0,percentage_still_missing_reprisk_id
reprisk_id,85.13%


Merging on company name without cleaning it only allows us to match 15% of the companies without matching ISIN. We will have to clean the company names in both datasets to be able to merge them.

In [13]:
companies_still_missing_reprisk_id = RepRisk_id_on_company_name[RepRisk_id_on_company_name['reprisk_id'].isna()][['isin', 'instrumentname']]
companies_still_missing_reprisk_id.head()

Unnamed: 0,isin,instrumentname
0,US3371621018,First Horizon National Corp
2,US2473611083,Delta Air Lines Inc
3,US0442041051,Ashland Inc
4,US1251291068,CDW COMPUTER CENTERS INC
5,CA4532584022,Inco Ltd


Let's summarize what we did into a single SQL request.

In [14]:
from pandasql import sqldf

In [15]:
RepRisk_company_isin = RepRisk_company[['reprisk_id', 'primary_isin']].dropna()

In [16]:
RepRisk_company_name = RepRisk_company[['reprisk_id', 'company_name']].dropna()

In [17]:
match_reprisk_for_company = sqldf(""
      "SELECT mkc.isin, mkc.instrumentname, rrn.reprisk_id AS reprisk_id_name, rri.reprisk_id AS reprisk_id_isin "
        "FROM Markit_company AS mkc "
        "LEFT JOIN RepRisk_company_name AS rrn "
            "ON mkc.instrumentname = rrn.company_name "
        "LEFT JOIN RepRisk_company_isin AS rri "
            "ON mkc.isin = rri.primary_isin "
        )

In [18]:
# merge the 2 reprisk_id columns
match_reprisk_for_company['reprisk_id_merge'] = match_reprisk_for_company['reprisk_id_isin'].fillna(match_reprisk_for_company['reprisk_id_name'])

In [19]:
(1 - match_reprisk_for_company['reprisk_id_merge'].isna().sum() / len(match_reprisk_for_company)).__format__("0.2%")

'52.99%'

We are able to match 53% of the companies in Markit with their reprisk_id in Reprisk.

In [20]:
still_missing = match_reprisk_for_company[match_reprisk_for_company['reprisk_id_merge'].isna()][['isin', 'instrumentname']]

In [21]:
import re

def clean_company_name(name):
    """
    Clean the company name by applying the following transformations:
    - Handle non-string inputs.
    - Convert to lowercase.
    - Remove punctuation and special characters.
    - Replace common corporate abbreviations.
    - Remove legal entity identifiers.
    - Trim whitespace.
    """
    if pd.isnull(name) or not isinstance(name, str):
        return None
    
    # Convert to lowercase
    name = name.lower()
    # Remove punctuation and special characters (keep alphanumeric and spaces)
    name = re.sub(r'[^\w\s]', '', name)
    # Replace common corporate abbreviations and legal entity identifiers
    abbreviations = {
        ' corporation': ' corp',
        ' incorporated': ' inc',
        ' company': ' co',
        ' limited': ' ltd',
        ' plc': '',
        ' llc': '',
        ' l p': ' lp',
        ' lp': ' lp'
    }
    for key, value in abbreviations.items():
        name = name.replace(key, value)
    # Trim whitespace
    name = re.sub(r'\s+', ' ', name).strip()
    return name

still_missing['cleaned_name'] = still_missing['instrumentname'].apply(clean_company_name)
RepRisk_company_name['cleaned_name'] = RepRisk_company_name['company_name'].apply(clean_company_name)

merged_df = still_missing.merge(RepRisk_company_name, left_on='cleaned_name', right_on='cleaned_name', how='left')

merged_df

Unnamed: 0,isin,instrumentname,cleaned_name,reprisk_id,company_name
0,US3371621018,First Horizon National Corp,first horizon national corp,,
1,US2473611083,Delta Air Lines Inc,delta air lines inc,,
2,US0442041051,Ashland Inc,ashland inc,,
3,US1251291068,CDW COMPUTER CENTERS INC,cdw computer centers inc,,
4,CA4532584022,Inco Ltd,inco ltd,,
...,...,...,...,...,...
34991,US8200141088,Sharplink Gaming Ltd,sharplink gaming ltd,,
34992,US74019P2074,Precision Biosciences Inc,precision biosciences inc,562341,Precision BioSciences Inc
34993,CA00792K1075,Aero Energy Ltd,aero energy ltd,,
34994,US87975F1049,Telomir Pharmaceuticals Inc,telomir pharmaceuticals inc,,


In [23]:
still_missing2 = merged_df[merged_df['reprisk_id'].isna()][['isin', 'instrumentname', 'cleaned_name']]

In [32]:
from fuzzywuzzy import fuzz
from collections import defaultdict

reprisk_df = RepRisk_company_name.copy()

# Create a set of unique cleaned names from the reprisk_df for faster lookup
unique_cleaned_names = set(reprisk_df['cleaned_name'].dropna())

# Create a dictionary to cache the results of fuzzy matching
fuzzy_match_cache = defaultdict(lambda: None)

def get_reprisk_id_optimized(still_missing_name, threshold=90):
    """
    Optimized function to use fuzzy matching to find the closest match in the reprisk dataframe
    for a given company name. It uses caching to speed up the process.
    """
    # Check if the name is already in the cache
    if still_missing_name in fuzzy_match_cache:
        return fuzzy_match_cache[still_missing_name]
    
    if still_missing_name is None:
        return None
    
    # Find the best match for the company name in the set of unique cleaned names
    best_match = None
    best_score = 0
    for candidate in unique_cleaned_names:
        score = fuzz.partial_ratio(still_missing_name, candidate)
        if score > best_score:
            best_score = score
            best_match = candidate
        # If we reach a score that's good enough, we can stop searching
        if score >= threshold:
            break

    # If the best score is above the threshold, find the reprisk_id from the dataframe
    if best_score >= threshold:
        reprisk_id = reprisk_df[reprisk_df['cleaned_name'] == best_match]['reprisk_id'].values[0]
    else:
        reprisk_id = None

    # Cache the result
    fuzzy_match_cache[still_missing_name] = reprisk_id
    return reprisk_id

In [None]:
import numpy as np
from concurrent.futures import ThreadPoolExecutor

def parallelize_dataframe(df, func, n_cores=4):
    df_split = np.array_split(df, n_cores)
    with ThreadPoolExecutor(max_workers=n_cores) as executor:
        df = pd.concat(executor.map(func, df_split))
    return df

def apply_func_to_series(data_series):
    return data_series.apply(get_reprisk_id_optimized)

subset_still_missing_df = still_missing2.copy()
subset_still_missing_df = parallelize_dataframe(subset_still_missing_df['cleaned_name'], apply_func_to_series, n_cores=4)
subset_still_missing_df

  return bound(*args, **kwds)
