In [1]:
import pandas as pd

### Harmonizing Company Names

In order to clean Company names for this dataset, Big Local News used an internal machine learning tool called the harmonizer. The harmonizer takes a dataset, takes `stop_words` which are words with little semantic value, and uses machine learning algorithms to standardize and clean the company names, without giving too much weight to the stop_words. 

For the current data, we used the default threshold of 0.85. A threshold is a value between 0 and 1, with values closer to 1 requiring a stricter match in order to assign the same ID. In other words, if the machine learning algorithms deem two similar names to be at least 85% similar, it will match them.

***Caveat:*** Machine learning algorithms for data cleaning (or anything really) are not perfect. It is likely the harmonizer missed some company name matches. However, using a tool such as the harmonizer allows us to share analysis that is reproducible by others given they follow the same steps. Without the harmonizer, it would be unnecessarily arduous to re-create a cleaned and standardized version of this data that is the same as ours.

In [2]:
warn_harmonized = pd.read_csv('../data/harmonizing/clean_warn_data_harmonized.csv')
warn_harmonized.head()

Unnamed: 0,Notice Date,Effective Date,Received Date,Company,City,County,Employees,Layoff/Closure,County Orig,Year,Layoff/Closure clean,Population,City 2,Company_harmonizer_cleaned,Company_harmonizer_score,Company_harmonizer_id,Company_harmonizer_standardized
0,07/01/2020,09/04/2020,10/02/2020,"**CoreLogic Credco, LLC (Cancelled)",San Diego,San Diego County,137.0,Layoff Permanent,San Diego County,2020,layoff permanent,3338330.0,san diego,CORELOGIC CREDCO CANCELLED,0.0,0,"**CoreLogic Credco, LLC (Cancelled)"
1,07/17/2020,09/23/2020,10/02/2020,**JC Penney (Cancelled),San Bernardino,San Bernardino County,109.0,Closure Permanent,San Bernardino County,2020,closure permanent,2180085.0,san bernardino,JC PENNEY CANCELLED,59.462185,1,**JC Penney (Cancelled)
2,03/25/2020,03/25/2020,03/25/2020,1 Hotel West Hollywood,West Hollywood,Los Angeles County,223.0,Layoff Temporary,Los Angeles County,2020,layoff temporary,10039107.0,west hollywood,1 HOTEL WEST HOLLYWOOD,31.301587,2,1 Hotel West Hollywood
3,04/10/2020,03/18/2020,05/01/2020,1100 Group LLC The Star & Little Star Plaza,Alameda,Alameda County,53.0,Layoff Temporary,Alameda County,2020,layoff temporary,1671329.0,alameda,1100 STAR LITTLE STAR PLAZA,34.318841,3,1100 Group LLC The Star & Little Star Plaza
4,03/28/2020,03/17/2020,04/10/2020,"115 New Montgomery LLC, DBA The Bird",San Francisco,San Francisco County,18.0,Layoff Temporary,San Francisco County,2020,layoff temporary,881549.0,san francisco,115 NEW MONTGOMERY BIRD,35.15493,4,"115 New Montgomery LLC, DBA The Bird"


In [3]:
len(warn_harmonized)

6612

In [4]:
dupes = warn_harmonized[warn_harmonized[['Company_harmonizer_cleaned', 'City 2', 'County', 'Employees', 'Year']].duplicated(keep=False)]
dupes.sort_values(by='Company', ascending=True)
dupes.head()

Unnamed: 0,Notice Date,Effective Date,Received Date,Company,City,County,Employees,Layoff/Closure,County Orig,Year,Layoff/Closure clean,Population,City 2,Company_harmonizer_cleaned,Company_harmonizer_score,Company_harmonizer_id,Company_harmonizer_standardized
16,07/23/2020,07/23/2020,08/18/2020,"24 HOUR FITNESS, USA, INC.",Carlsbad,San Diego County,39.0,Layoff Temporary,San Diego County,2020,layoff temporary,3338330.0,carlsbad,24 HOUR FITNESS,27.438596,15,"24 HOUR FITNESS, USA, INC."
18,08/05/2020,09/09/2020,08/10/2020,"24 Hour Fitness, USA, Inc",Carlsbad,San Diego County,39.0,Layoff Type Unknown,San Diego County,2020,layoff type uncategorized,3338330.0,carlsbad,24 HOUR FITNESS,95.833333,16,24 Hour Fitness USA Inc.
30,04/03/2020,04/03/2020,04/17/2020,"4LEAF, Inc",Pleasanton,Alameda County,53.0,Layoff Temporary,Alameda County,2020,layoff temporary,1671329.0,pleasanton,4LEAF,36.623377,27,"4LEAF, Inc"
31,04/03/2020,04/03/2020,04/17/2020,"4LEAF, Inc.",Pleasanton,Alameda County,53.0,Layoff Temporary,Alameda County,2020,layoff temporary,1671329.0,pleasanton,4LEAF,100.0,27,"4LEAF, Inc"
81,06/08/2020,06/30/2020,06/15/2020,AEG LA Youth Soccer Academy LLC,Carson,Los Angeles County,3.0,Layoff Permanent,Los Angeles County,2020,layoff permanent,10039107.0,carson,AEG YOUTH SOCCER ACADEMY,35.972222,71,AEG LA Youth Soccer Academy LLC


In [5]:
cali_no_dupes = warn_harmonized[~warn_harmonized.isin(dupes)].dropna()
len(cali_no_dupes)

6370

In [6]:
cali_no_dupes.to_csv('../data/analysis/finalized_warn_data.csv', index=False)