# Merging RepRisk and Markit

In order to compute the ratios we need but also to be able to compare the two datasets, we need to merge them.

A first idea was to use the `isin` columns available in both dataframe but it appears that it is not well populated in the RepRisk dataset. That means we might need to match companies on their name. Doing so would require to clean the names in both datasets.

Let's first explore the idea of merging on ISIN.

In [46]:
from pathlib import Path

import pandas as pd

import config

DATA_DIR = Path(config.DATA_DIR)
file_path = Path(DATA_DIR) / "pulled"

Let's have a look at RepRisk first.

In [47]:
RepRisk_company = pd.read_parquet(file_path  / "reprisk_company.parquet")
RepRisk_company.head()

Unnamed: 0,reprisk_id,company_name,primary_isin,isins
0,10,Acer Inc,TW0002353000,US0044341065 | US0044342055 | TW0002353000
1,100,Rio Tinto PLC,GB0007188757,GB0007406639 | BRRIOTBDR007 | ARDEUT112638 | G...
2,1000,Terrane Metals Corp,CA88103A1084,CA88103A1167 | US88103A3068 | CA88103A1084 | C...
3,10000,RAK Properties PJSC,AER000601016,AER000601016
4,100000,BLUECOM Co Ltd,KR7033560004,KR7033560004


In [48]:
(RepRisk_company[['primary_isin']].isna().sum() / len(RepRisk_company)).to_frame("percentage_missing_isin").style.format("{:.2%}")

Unnamed: 0,percentage_missing_isin
primary_isin,85.03%


Out of more than 600k referenced companies, 85% of them have a missing ISIN in RepRisk.

Let's have a look at Markit now.

In [49]:
Markit = pd.read_parquet(file_path  / "markit.parquet")
Markit_company = Markit[["isin", "instrumentname"]].drop_duplicates()
Markit_company.head()

Unnamed: 0,isin,instrumentname
0,DE0005552004,Deutsche Post Ag
1,US98956P1021,Zimmer Holdings Inc
2,US86764P1093,Sunoco Inc
3,US7901481009,St. Joe Co
4,US8265521018,Sigma-aldrich Corp


In [50]:
(Markit_company[['isin']].isna().sum() / len(Markit_company)).to_frame("percentage_missing_isin").style.format("{:.2%}")

Unnamed: 0,percentage_missing_isin
isin,5.33%


For Markit, only 5% of the companies have a missing ISIN.

We are now going to see if we can match all the available ISIN in Markit with the ones in RepRisk.

In [51]:
isin_intersection = Markit_company['isin'].dropna().isin(RepRisk_company['primary_isin'].dropna())

In [52]:
(isin_intersection.sum() / len(Markit_company['isin'].dropna())).__format__("0.2%")

'47.27%'

Only 47% of the ISIN in Markit are available in RepRisk so this is not going to work for us. We cannot merge the two datasets on ISIN only. We will have to merge on company name also.

In [58]:
RepRisk_id_on_isin = Markit_company.merge(RepRisk_company[['reprisk_id', 'primary_isin']].dropna(), left_on="isin", right_on="primary_isin", how="left")
RepRisk_id_on_isin.head()

Unnamed: 0,isin,instrumentname,reprisk_id,primary_isin
0,DE0005552004,Deutsche Post Ag,3794,DE0005552004
1,US98956P1021,Zimmer Holdings Inc,182884,US98956P1021
2,US86764P1093,Sunoco Inc,978,US86764P1093
3,US7901481009,St. Joe Co,7502,US7901481009
4,US8265521018,Sigma-aldrich Corp,7620,US8265521018


In [59]:
(RepRisk_id_on_isin[['reprisk_id']].isna().sum() / len(RepRisk_id_on_isin)).to_frame("percentage_missing_reprisk_id").style.format("{:.2%}")

Unnamed: 0,percentage_missing_reprisk_id
reprisk_id,55.24%


Matching on ISIN only, we cannot match 55% of the companies in Markit with their reprisk_id in Reprisk. We will now have to look at matching companies missing isin on their name.

In [60]:
companies_missing_isin = RepRisk_id_on_isin[RepRisk_id_on_isin['reprisk_id'].isna()][['isin', 'instrumentname']]
companies_missing_isin.head()

Unnamed: 0,isin,instrumentname
14,US3371621018,First Horizon National Corp
15,US3199631041,First Data Corp
16,US2473611083,Delta Air Lines Inc
19,US0442041051,Ashland Inc
20,US1251291068,CDW COMPUTER CENTERS INC


In [61]:
RepRisk_id_on_company_name = companies_missing_isin.merge(RepRisk_company, left_on="instrumentname", right_on="company_name", how="left")
RepRisk_id_on_company_name.head()

Unnamed: 0,isin,instrumentname,reprisk_id,company_name,primary_isin,isins
0,US3371621018,First Horizon National Corp,,,,
1,US3199631041,First Data Corp,1524305.0,First Data Corp,US32008D1063,US32008D1063
2,US2473611083,Delta Air Lines Inc,,,,
3,US0442041051,Ashland Inc,,,,
4,US1251291068,CDW COMPUTER CENTERS INC,,,,


In [62]:
(RepRisk_id_on_company_name[['reprisk_id']].isna().sum() / len(RepRisk_id_on_company_name)).to_frame("percentage_still_missing_reprisk_id").style.format("{:.2%}")

Unnamed: 0,percentage_still_missing_reprisk_id
reprisk_id,85.13%


Merging on company name without cleaning it only allows us to match 15% of the companies without matching ISIN. We will have to clean the company names in both datasets to be able to merge them.

In [63]:
companies_still_missing_reprisk_id = RepRisk_id_on_company_name[RepRisk_id_on_company_name['reprisk_id'].isna()][['isin', 'instrumentname']]
companies_still_missing_reprisk_id.head()

Unnamed: 0,isin,instrumentname
0,US3371621018,First Horizon National Corp
2,US2473611083,Delta Air Lines Inc
3,US0442041051,Ashland Inc
4,US1251291068,CDW COMPUTER CENTERS INC
5,CA4532584022,Inco Ltd
