In [2]:
import pandas as pd

### Standardizing Company Names

In order to clean Company names for this dataset, Big Local News used a tool called OpenRefine. OpenRefine allows you to cluster text based on similarities in spelling. There are a number of algorithms available for clustering, such as `Key Collision - Fingerprint`, `Key Collision - metaphone3`, and `Nearest Neighbor - Levenshtein`.

We used all of the algorithms to standardize company names and replace several different spellings of companies with a more unified spelling. The column `Parent Company` holds the original company names. In the column `Company Affiliation`, we use the algorithms available in open refine to make sure we have the same spellings for all company names.

Within our data, we found many records that have a company name but are doing business as a different company name. For these records, we split the text on `DBA` which stands for `Doing Business As`. We then used the text before the ***DBA*** as the parent company name. That name is represented in the `Company Affiliation` column. We then used the text after the ***DBA*** in the `DBA` column. Any records without the text `DBA` were left untouched and appear as the same text in both columns.

In addition to replacing parent company names within the `Company Affiliation` column, we also went through the text facet in OpenRefine and manually updated any parent company names that did not show up in any of the initial clusters. 

In [4]:
warn_refined = pd.read_csv('../data/open_refine/open_refine_exported_warn_data.csv')
warn_refined.head()

Unnamed: 0,Notice Date,Effective Date,Received Date,Company,City,County,Employees,Layoff/Closure,Company 2,Parent Company,Company Affiliation,DBA Name,County Orig,Year,Layoff/Closure clean,Population,City 2
0,06/09/2020,06/07/2020,07/01/2020,Bay Club Redondo Beach,Redondo Beach,Los Angeles County,102.0,Layoff Permanent,BAY CLUB,BAY CLUB,BAY CLUB,BAY CLUB,Los Angeles County,2020,layoff permanent,10039107.0,redondo beach
1,06/09/2020,06/07/2020,07/01/2020,Bay Club Rolling Hills,Rolling Hills Estates,Los Angeles County,64.0,Layoff Permanent,BAY CLUB ROLLING HILLS,BAY CLUB ROLLING HILLS,BAY CLUB,BAY CLUB ROLLING HILLS,Los Angeles County,2020,layoff permanent,10039107.0,rolling hills estates
2,06/09/2020,06/07/2020,07/01/2020,Bay Club Santa Monica,Santa Monica,Los Angeles County,82.0,Layoff Permanent,BAY CLUB SANTA MONICA,BAY CLUB SANTA MONICA,BAY CLUB,BAY CLUB SANTA MONICA,Los Angeles County,2020,layoff permanent,10039107.0,santa monica
3,06/19/2020,08/21/2020,07/01/2020,"Weber Metals, Inc",Paramount,Los Angeles County,169.0,Layoff Permanent,WEBER METALS INC,WEBER METALS INC,WEBER METALS INC,WEBER METALS INC,Los Angeles County,2020,layoff permanent,10039107.0,paramount
4,06/09/2020,06/07/2020,07/01/2020,StoneTree Golf Club,Novato,Marin County,32.0,Layoff Permanent,STONETREE GOLF CLUB,STONETREE GOLF CLUB,STONETREE GOLF CLUB,STONETREE GOLF CLUB,Marin County,2020,layoff permanent,258826.0,novato


In [5]:
len(warn_refined)

6708

In [6]:
dupes = warn_refined[warn_refined[['Company Affiliation', 'City 2', 'County', 'Employees', 'Year']].duplicated(keep=False)]
dupes.sort_values(by='Company', ascending=True)
dupes.head()

Unnamed: 0,Notice Date,Effective Date,Received Date,Company,City,County,Employees,Layoff/Closure,Company 2,Parent Company,Company Affiliation,DBA Name,County Orig,Year,Layoff/Closure clean,Population,City 2
6,06/23/2020,06/23/2020,07/01/2020,The Freeman Company LLC,Anaheim,Orange County,29.0,Layoff Permanent,THE FREEMAN COMPANY LLC,THE FREEMAN COMPANY LLC,THE FREEMAN COMPANY LLC,THE FREEMAN COMPANY LLC,Orange County,2020,layoff permanent,3175692.0,anaheim
41,06/16/2020,03/31/2020,07/02/2020,PLAYERS CASINO,Ventura,Ventura County,183.0,Layoff Temporary,PLAYERS CASINO,PLAYERS CASINO,PLAYERS CASINO,PLAYERS CASINO,Ventura County,2020,layoff temporary,846006.0,ventura
44,04/16/2020,03/23/2020,07/03/2020,Tri-Mountain,Irwindale,Los Angeles County,59.0,Layoff Temporary,TRIMOUNTAIN,TRIMOUNTAIN,TRIMOUNTAIN,TRIMOUNTAIN,Los Angeles County,2020,layoff temporary,10039107.0,irwindale
54,06/30/2020,05/09/2020,07/03/2020,Golden Valley Health Centers,Ceres,Stanislaus County,2.0,Layoff Permanent,GOLDEN VALLEY HEALTH CENTERS,GOLDEN VALLEY HEALTH CENTERS,GOLDEN VALLEY HEALTH CENTERS,GOLDEN VALLEY HEALTH CENTERS,Stanislaus County,2020,layoff permanent,550660.0,ceres
63,07/01/2020,07/01/2020,07/06/2020,ELECTRO RENT,Canoga Park,Los Angeles County,25.0,Layoff Permanent,ELECTRO RENT,ELECTRO RENT,ELECTRO RENT,ELECTRO RENT,Los Angeles County,2020,layoff permanent,10039107.0,canoga park


In [7]:
len(dupes)

440

In [8]:
cali_no_dupes = warn_refined[~warn_refined.isin(dupes)].dropna()
len(cali_no_dupes)

6264

In [9]:
cali_no_dupes.to_csv('../data/analysis/finalized_warn_data.csv', index=False)