## How we got this California Historical Data - Big Local News

We manually downloaded the PDFs from the california WARN notices site: https://edd.ca.gov/Jobs_and_Training/Layoff_Services_WARN.htm

We only grabbed the historical data which (on 10/22/20) resulted in the downloading of 6 PDF files with data ranging from 2014-2020.

After downloading the data, we used Tabula to convert each PDF to a csv file.
The exported csv files are the same that we are importing below in order to create one large historical file.

See `data/processed/` directory for all the data.

In [6]:
import pandas as pd
import numpy as np

### In the code block below, all california WARN data from prior years is imported.

In [2]:
results = []

row_count = 0

filepaths = [
 '../data/processed/cali_2015_16.csv',
 '../data/processed/cali_2018_19.csv',
 '../data/processed/cali_2016_17.csv',
 '../data/processed/cali_2014_15.csv',
 '../data/processed/cali_2019_20.csv',
 '../data/processed/cali_2017_18.csv'
]

### Creating one dataset out of all past data

In the code block below:
- Each file is read in as a dataframe
- The variable `row count` keeps track of the number of rows by adding the length of each dataframe
- Columns are cleaned from misc. whitespace characters
- The `Number of Employees` column is standardized in name and content
- For the `County` column, blank rows are assigned `nan` values
- After going through this process, each dataframe is added to an array and then concatenated, which means all the data becomes one large dataframe
- The `Effective Date` and `Received Date` on the newly created dataframe is cleared of excessive white space
- Lastly, we confirm the number of rows in our new dataframe equals the sum of rows of the smaller dataframes


In [3]:
for filepath in filepaths:
    data = pd.read_csv(filepath)
    
    row_count += len(data)
    
    data.columns = data.columns.str.replace('\\r',' ')
    
    # rename employees column for consistency
    
    if "No. Of Employees" in data.columns:
        data.rename(columns={"No. Of Employees": "Employees"}, inplace = True)
        
    data['Employees'] = data['Employees'].astype('Int64')
    
    if "County" not in data.columns:
        data['County'] = np.nan
        
    results.append(data)
    
data = pd.concat(results)

data['Effective Date'] = data['Effective Date'].str.replace(' ', '')
data['Received Date'] = data['Received Date'].str.replace(' ', '')
data['Company'] = data['Company'].str.replace('\\r', '')
data['Company'] = data['Company'].str.replace('\\n', '')
data['City'] = data['City'].str.replace('\\r', '')
data['City'] = data['City'].str.replace('\\n', '')
data['Layoff/Closure'] = data['Layoff/Closure'].str.replace('\\r', '')
data['Layoff/Closure'] = data['Layoff/Closure'].str.replace('\\n', '')

# check to make sure we got all the data
row_count == len(data)

True

In [4]:
data

Unnamed: 0,Notice Date,Effective Date,Received Date,Company,City,Employees,Layoff/Closure,County
0,06/22/2015,03/25/2016,07/01/2015,Maxim Integrated Product,San Jose,150,Closure Permanent,
1,06/30/2015,08/29/2015,07/01/2015,McGraw-Hill Education,Monterey,137,Layoff Unknown at this time,
2,06/30/2015,08/30/2015,07/01/2015,Long Beach Memorial Medical Center,Long Beach,90,Layoff Permanent,
3,07/01/2015,09/02/2015,07/01/2015,Leidos,El Segundo,72,Layoff Permanent,
4,07/01/2015,09/30/2016,07/01/2015,"Bosch Healthcare Systems, Inc.",Palo Alto,55,Closure Permanent,
...,...,...,...,...,...,...,...,...
640,06/27/2018,06/27/2018,06/29/2018,eBay Inc.,Brisbane,5,Layoff Permanent,San Mateo County
641,06/27/2018,06/27/2018,06/29/2018,eBay Inc.,San Francisco,41,Layoff Permanent,San Francisco
642,06/27/2018,06/27/2018,06/29/2018,eBay Inc.,San Jose,228,Layoff Permanent,Santa Clara
643,06/27/2018,08/31/2018,06/29/2018,California Medical Business Services,Arcadia,64,Closure Permanent,Los Angeles


In [5]:
data.to_csv('../data/processed/cali_warn_processed.csv', index=False)