# Cleaning of a large NYSE dataset and fuzzy matching on company names from two different datasets.
Francis Peng <br>
BA Economics/Mathematics, Music - May 2023 <br>
Emory University Department of Economics

Parts of the following notebooks are work in progress denoted by the markers **_Begin Work in Progress_** and **_End Work in Progress_**

The following notebook desscribes and executes the process of cleaning a large dataset of NYSE stock listings as well as matching company names from two different datasets. In this process, the rapidfuzz library is used to implement fuzzy matching. Fuzzy matching is needed as the same company may appear differently in the two datasets. For example, the same company might be listed as "X Corporation" in one dataset and "X Corp" in the other. These are the same comapny, and thus should be matched as such.

This fuzzy matching process was developed for the purposes of the following working papers:

    Fohlin, Caroline (2020) “Crisis and Innovation:  The Transformation of the New York Stock Exchange from the Great War to the Great Depression,” working paper, Emory University.

    Fohlin, Caroline and Phoebe Lei (2020) “The Determinants of the Volume of Initial Public Offerings During the Great Depression and World War II,” working paper, Emory University


To protect the integrity of these unpublished materials, the source datasets are not output in their entirety anywhere in the following notebook. Only the final result table has been displayed to show that the process does indeed work.

In [223]:
# Installing rapidfuzz.
# !pip install rapidfuzz

In [224]:
# importing needed libraries
import pandas as pd
import numpy as np
from rapidfuzz import process
from rapidfuzz import fuzz

**_Begin Work in Progreess_**

# Cleaning up the large dataset of NYSE listings.
The big dataset contains companies listed on the NYSE (aka CRSP dataset)

The cleaning of this dataset aims to achieve the following: <br>
1. Standardize the naming conventions of companies.
2. Fill in missing company names.

In [225]:
# Load the dataset
df = pd.read_stata('full_dataset.dta')

In [226]:
print(len(pd.unique(df['comnam']))) # unique company names
print(len(pd.unique(df['permco']))) # unique idnetifiers

2758
1574


In [227]:
# df.head() # taking a look

In [228]:
# Cleaning up company names
df['comnam'] = df['comnam'].str.upper()
df['comnam'] = df['comnam'].replace(
    regex={',':'', '\.':'', '-':' '}).replace(
    regex={'CORPORATION':'CO', 'INCORPORATED':'INC', 'COMPANY':'CO', 'LIMITED':'LTD'}
)

In [229]:
print(len(pd.unique(df['comnam']))) 
print(len(pd.unique(df['permco'])))

2508
1574


In [230]:
# test = df.dropna(subset=['permco'])
# print(len(pd.unique(test['permco'])))
# print(len(test))
# print(len(pd.unique(test['comnam'])))

In [231]:
df['comnam'].replace('', np.nan, inplace = True) # replacing empty strings with null values
noblank = df.dropna(subset=['comnam']) # dropping null values

In [232]:
print(len(pd.unique(noblank['permco'])))

987


From the above output, we find that there are in fact permcos where no row exists where there is a comnam, i.e., some permcos are unidentifiable with comnam. The code directly below gives us the number of such permcos and the specific permcos that are unidentifiable.

In [233]:
a = pd.unique(noblank['permco'])
b = pd.unique(df['permco'])
print(len(np.setdiff1d(b, a)))
c = np.asarray(np.setdiff1d(b, a))
np.savetxt("permco_missing_comnam.csv", c, delimiter=",")

588


In [234]:
# df[df['permco'] == 22195].head()

In [235]:
# pd.unique(df.loc[df['permco'] == 56225, 'comnam'])

Regardless of this result which will be dealt with later, we continue with the cleaning. In particular, we now fill in comnam for permcos that are identifiable.

1. First, we create a dataframe from the existing dataframe such that we have only comnam and permco.
2. Then, we use this dataframe and join it with the original dataframe on permco.

In [236]:
# Step 1.
# Create a dataframe with only comnam and permco
working_df = df[['comnam', 'permco']].copy()

In [237]:
working_df.dropna(subset = ['comnam'], inplace = True)
working_df.dropna(subset = ['permco'], inplace = True)

In [238]:
working_df.drop_duplicates(subset = ['comnam'], inplace = True)

In [239]:
working_df = working_df.sort_values('comnam', key = lambda x:x.str.len(), ascending = False)

In [240]:
working_df = working_df.drop_duplicates(subset = ['permco'], keep = 'first' )

In [241]:
working_df = working_df.reset_index(drop = True)

In [242]:
len(working_df)

984

In [243]:
working_df = working_df.rename(columns={'comnam':'new_name'})
new_df = df.join(working_df.set_index('permco'), on='permco')

In [244]:
new_df.to_csv('full_dataset_with_comnam.csv')

**_End Work in Progress_**

# Fuzzy Matching

In [245]:
# Loading the two datasets
dataA = pd.read_excel('Data.xlsx')
# dataB = pd.read_excel('CRSP.xlsx')
dataB = new_df.copy()

### Cleaning Up the Data

In [246]:
dataB = dataB.rename(columns={'new_name':'company'})

In [247]:
# Dropping duplicates
dataAW = dataA.dropna(subset = ['Company Name']).drop_duplicates('Company Name', keep = 'first').reset_index(drop=True)

In [248]:
dataBW = dataB.dropna(subset = ['company']).drop_duplicates('company', keep = 'first').reset_index(drop=True)

In [249]:
# Replacing common terms with abbreviations, deleting punctuation, making everything uppercase for ease of reading.
dataAW['N_Company Name'] = dataAW['Company Name'].str.upper()
dataBW['N_company'] = dataBW['company'].str.upper()

dataAW['N_Company Name'] = dataAW['N_Company Name'].replace(
    regex={',':'', '\.':'', '-':' '}).replace(
    regex={'CORPORATION':'CO', 'INCORPORATED':'INC', 'COMPANY':'CO', 'LIMITED':'LTD'}
)

dataBW['N_company'] = dataBW['N_company'].replace(
    regex={',':'', '\.':'', '-':' '}).replace(
    regex={'CORPORATION':'CO', 'INCORPORATED':'INC', 'COMPANY':'CO', 'LIMITED':'LTD'}
)

#dataAW['N_Company Name'] = dataAW['N_Company Name'].replace(
#    regex={',':'', '\.':'', '-':' '}).replace(
#    regex={'CORPORATION':'', 'INCORPORATED':'', 'COMPANY':''}).replace(
#    regex={'LTD':'', 'CORP':'', 'INC':''}).replace(
#    regex={'CO':''}
#)

#dataBW['N_company'] = dataBW['N_company'].replace(
#    regex={',':'', '\.':'', '-':' '}).replace(
#    regex={'CORPORATION':'', 'INCORPORATED':'', 'COMPANY':''}).replace(
#    regex={'LTD':'', 'CORP':'', 'INC':''}).replace(
#    regex={'CO':''}
#)

dataAW = dataAW.dropna(subset = ['N_Company Name']).drop_duplicates('N_Company Name', keep = 'first').reset_index(drop=True)
dataBW = dataBW.dropna(subset = ['N_company']).drop_duplicates('N_company', keep = 'first').reset_index(drop=True)

In [254]:
# Matching using the fuzzywuzzy library
results_df = pd.DataFrame(columns = ['Match_Score', 'N_AWname', 'N_BWName', 'AWname'])
for index, row in dataAW.iterrows():
    result = process.extractOne(row['N_Company Name'], dataBW['N_company'])
    #if result[1] >= 86:
    results_df = results_df.append({'Match_Score':result[1], 'N_AWname':row['N_Company Name'], 'N_BWName':result[0], 'AWname':row['Company Name']}, ignore_index=True)

In [255]:
# Sorting values by match score given by fuzzywuzzy
results_final = results_df.sort_values(by = ['Match_Score'], ascending = False).drop_duplicates('N_AWname', keep = 'first').reset_index(drop=True)

In [256]:
results_final.head()

Unnamed: 0,Match_Score,N_AWname,N_BWName,AWname
0,100.0,GENERAL ICE CREAM CORP,GENERAL ICE CREAM CORP,General Ice Cream Corp.
1,100.0,GREAT NORTHERN RAILWAY CO,GREAT NORTHERN RAILWAY CO,Great Northern Railway Company
2,100.0,GENERAL MILLS INC,GENERAL MILLS INC,"General Mills, Inc."
3,100.0,NATIONAL BISCUIT CO,NATIONAL BISCUIT CO,National Biscuit Company
4,100.0,SUN OIL CO,SUN OIL CO,Sun Oil Co.


In [257]:
# Writing to a csv file
output = results_final
output.to_csv('10_27_rapidfuzz.csv')