# Processing U.S. Bureau of Labor Statistics Datasets

The `Local Area Unemployment Statistics (LAUS)` and `State and Metro Area Employment, Hours, & Earnings (SAE)` datasets were downloaded using the [SAE Databases One Screen option](https://www.bls.gov/sae/data/) and the [LAUS Databases One Screen option]()

The United States Office of Management and Budget (OMB) delineates metropolitan and micropolitan statistical areas according to published standards that are applied to Census Bureau data. The general concept of a metropolitan or micropolitan statistical area is that of a core area containing a substantial population nucleus, together with adjacent communities having a high degree of economic and social integration with that core. Current area delineations are based on OMB Bulletin No. 18-03 effective April 2018.

In [1]:
import json
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
with open('ei_intermediate_file_paths.json') as output_path_file:
    file_paths = json.load(output_path_file)

LAUS_INPUT_PATH = file_paths.get("raw_laus.csv")

SAE_INPUT_PATH = file_paths.get("raw_sae.csv")

BLS_STAGE1_OUTPUT_PATH = file_paths.get("stage1_bls_output.csv")

## Read in the Data

Note: added skiprows due to the format of the file download from the BLS website. Original file formats included in the 02_economic_impact_model/raw directory

In [3]:
laus_df = pd.read_excel(LAUS_INPUT_PATH,skiprows=10)

sae_df = pd.read_excel(SAE_INPUT_PATH,skiprows=12)

In [4]:
print("LAUS Dataset Columns")
print(laus_df.columns)

print("First 10 Rows of LAUS Dataset")
print(laus_df.head(10))

LAUS Dataset Columns
Index(['Year', 'Period', 'labor force', 'employment', 'unemployment',
       'unemployment rate'],
      dtype='object')
First 10 Rows of LAUS Dataset
   Year Period  labor force  employment  unemployment  unemployment rate
0  2007    Jan       619054      584073         34981                5.7
1  2007    Feb       614277      582072         32205                5.2
2  2007    Mar       616410      586497         29913                4.9
3  2007    Apr       614834      587479         27355                4.4
4  2007    May       617835      589760         28075                4.5
5  2007    Jun       629064      596036         33028                5.3
6  2007    Jul       631471      599001         32470                5.1
7  2007    Aug       622063      591348         30715                4.9
8  2007    Sep       621736      589636         32100                5.2
9  2007    Oct       623843      591794         32049                5.1


In [5]:
print("SAE Dataset Columns")
print(sae_df.columns)

print("First 10 Rows of SAE Dataset")
print(sae_df.head(10))

SAE Dataset Columns
Index(['Year', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
       'Oct', 'Nov', 'Dec'],
      dtype='object')
First 10 Rows of SAE Dataset
   Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov  \
0  2007  34.0  34.4  35.1  34.8  34.6  35.2  34.9  34.7  35.2  34.9  34.3   
1  2008  33.4  34.4  34.7  34.5  34.5  35.6  35.0  35.3  35.1  34.9  34.9   
2  2009  34.5  34.9  34.4  34.0  33.8  34.3  34.3  34.8  34.1  34.1  34.4   
3  2010  34.9  34.7  35.2  35.3  35.3  35.5  35.4  35.7  35.4  35.5  35.3   
4  2011  34.7  34.5  35.7  35.5  35.7  35.7  35.5  35.8  34.6  35.4  34.3   
5  2012  34.4  34.1  34.3  34.9  34.4  34.6  34.9  35.1  35.5  35.3  35.5   
6  2013  34.2  35.2  35.0  34.3  34.7  35.2  34.6  34.8  35.4  35.0  35.0   
7  2014  34.8  35.4  35.9  35.4  35.2  35.6  35.4  35.5  35.1  35.1  35.8   
8  2015  34.6  34.8  34.9  35.2  35.0  35.2  35.4  36.3  35.2  35.5  35.7   
9  2016  35.2  34.8  34.9  35.3  36.0  35.7  35.9  3

## Process SAE Data

In [6]:
# Step 1: Convert SAE dataset to long format
sae_long = sae_df.melt(id_vars=['Year'], var_name='Month', value_name='curr_sae_hrs')
sae_long['Month'] = sae_long['Month'].str.strip()  # Remove any whitespace in column names

In [7]:
print(sae_long.head(10))

   Year Month  curr_sae_hrs
0  2007   Jan          34.0
1  2008   Jan          33.4
2  2009   Jan          34.5
3  2010   Jan          34.9
4  2011   Jan          34.7
5  2012   Jan          34.4
6  2013   Jan          34.2
7  2014   Jan          34.8
8  2015   Jan          34.6
9  2016   Jan          35.2


In [8]:
# Create a date column
sae_long['date'] = pd.to_datetime(sae_long['Year'].astype(str) + sae_long['Month'], format='%Y%b')


In [9]:
# Sort the DataFrame by the date column in descending order
sae_long = sae_long.sort_values(by='date', ascending=False)


In [10]:
print(sae_long.head(10))

     Year Month  curr_sae_hrs       date
215  2024   Dec           NaN 2024-12-01
197  2024   Nov           NaN 2024-11-01
179  2024   Oct          34.3 2024-10-01
161  2024   Sep          34.5 2024-09-01
143  2024   Aug          34.2 2024-08-01
125  2024   Jul          34.3 2024-07-01
107  2024   Jun          34.9 2024-06-01
89   2024   May          33.8 2024-05-01
71   2024   Apr          34.0 2024-04-01
53   2024   Mar          34.3 2024-03-01


In [11]:
# Step 1: Create a key for the current month-year and the previous month-year
sae_long['key'] = sae_long['Year'].astype(str) + '-' + sae_long['Month']
sae_long['prev_key'] = (sae_long['Year'] - 1).astype(str) + '-' + sae_long['Month']

# Step 2: Rename columns in a copy of the dataframe for merging
sae_lookup = sae_long[['key', 'curr_sae_hrs']].rename(columns={
    'key': 'prev_key',
    'curr_sae_hrs': 'prev_yr_sae_hrs'
})

# Step 3: Merge the current dataframe with the lookup dataframe on 'prev_key'
sae_long = pd.merge(sae_long, sae_lookup, on='prev_key', how='left')

# Drop the temporary keys to clean up the dataframe
sae_long.drop(columns=['key', 'prev_key'], inplace=True)

# Result now includes the explicitly searched previous year's SAE hours
print(sae_long.head(18))


    Year Month  curr_sae_hrs       date  prev_yr_sae_hrs
0   2024   Dec           NaN 2024-12-01             34.5
1   2024   Nov           NaN 2024-11-01             34.6
2   2024   Oct          34.3 2024-10-01             34.7
3   2024   Sep          34.5 2024-09-01             34.7
4   2024   Aug          34.2 2024-08-01             35.0
5   2024   Jul          34.3 2024-07-01             35.2
6   2024   Jun          34.9 2024-06-01             34.8
7   2024   May          33.8 2024-05-01             34.9
8   2024   Apr          34.0 2024-04-01             35.4
9   2024   Mar          34.3 2024-03-01             34.7
10  2024   Feb          34.2 2024-02-01             34.7
11  2024   Jan          33.5 2024-01-01             35.1
12  2023   Dec          34.5 2023-12-01             34.8
13  2023   Nov          34.6 2023-11-01             35.2
14  2023   Oct          34.7 2023-10-01             35.6
15  2023   Sep          34.7 2023-09-01             35.4
16  2023   Aug          35.0 20

## Process the LAUS Data

In [12]:
# Step 1: Create a key for the current month-year and the previous month-year
laus_df['key'] = laus_df['Year'].astype(str) + '-' + laus_df['Period']
laus_df['prev_key'] = (laus_df['Year'] - 1).astype(str) + '-' + laus_df['Period']

# Step 2: Rename columns in a copy of the dataframe for merging
laus_lookup = laus_df[['key', 'labor force', 'unemployment rate']].rename(columns={
    'key': 'prev_key',
    'labor force': 'prev_yr_laus_labor_force',
    'unemployment rate': 'prev_yr_laus_unemployment_rate'
})

# Step 3: Merge the current dataframe with the lookup dataframe on 'prev_key'
laus_df = pd.merge(laus_df, laus_lookup, on='prev_key', how='left')

# Rename columns to match the desired output format
laus_df.rename(columns={
    'labor force': 'curr_laus_labor_force',
    'unemployment rate': 'curr_laus_unemployment_rate'
}, inplace=True)

# Step 4: Create a date column
laus_df['date'] = pd.to_datetime(laus_df['Year'].astype(str) + '-' + laus_df['Period'], format='%Y-%b')

# Step 5: Drop unnecessary columns
laus_df.drop(columns=['employment', 'unemployment', 'key', 'prev_key'], inplace=True)

# Step 6: Sort by date column in descending order
laus_df = laus_df.sort_values(by='date', ascending=False)

# Display the updated dataframe
print(laus_df.head(18))


     Year Period  curr_laus_labor_force  curr_laus_unemployment_rate  \
212  2024    Sep                 621528                          4.4   
211  2024    Aug                 622437                          4.4   
210  2024    Jul                 627202                          4.5   
209  2024    Jun                 622404                          4.6   
208  2024    May                 620054                          3.5   
207  2024    Apr                 621615                          3.2   
206  2024    Mar                 623679                          3.7   
205  2024    Feb                 617640                          3.5   
204  2024    Jan                 614328                          4.3   
203  2023    Dec                 616903                          3.7   
202  2023    Nov                 617394                          3.7   
201  2023    Oct                 618328                          4.1   
200  2023    Sep                 618474                         

## Join the SAE and LAUS Data

In [13]:
# Step 1: Merge the two DataFrames on the 'date' column
merged_df = pd.merge(
    sae_long,
    laus_df,
    on='date',
    how='outer'  # 'outer' for all dates
)

# Step 2: Select and reorder the columns as desired
merged_df = merged_df[[
    'date', 
    'curr_sae_hrs', 
    'curr_laus_labor_force', 
    'curr_laus_unemployment_rate', 
    'prev_yr_sae_hrs', 
    'prev_yr_laus_labor_force', 
    'prev_yr_laus_unemployment_rate'
]]

merged_df = merged_df.sort_values(by='date', ascending=False)

# Display the resulting merged DataFrame
print(merged_df.head(18))


          date  curr_sae_hrs  curr_laus_labor_force  \
215 2024-12-01           NaN                    NaN   
214 2024-11-01           NaN                    NaN   
213 2024-10-01          34.3                    NaN   
212 2024-09-01          34.5               621528.0   
211 2024-08-01          34.2               622437.0   
210 2024-07-01          34.3               627202.0   
209 2024-06-01          34.9               622404.0   
208 2024-05-01          33.8               620054.0   
207 2024-04-01          34.0               621615.0   
206 2024-03-01          34.3               623679.0   
205 2024-02-01          34.2               617640.0   
204 2024-01-01          33.5               614328.0   
203 2023-12-01          34.5               616903.0   
202 2023-11-01          34.6               617394.0   
201 2023-10-01          34.7               618328.0   
200 2023-09-01          34.7               618474.0   
199 2023-08-01          35.0               621427.0   
198 2023-0

In [14]:
# Step 3: Calculate percentage difference for each column
merged_df['pct_diff_sae_hrs'] = ((merged_df['curr_sae_hrs'] - merged_df['prev_yr_sae_hrs']) / merged_df['prev_yr_sae_hrs']) * 100
merged_df['pct_diff_laus_labor_force'] = ((merged_df['curr_laus_labor_force'] - merged_df['prev_yr_laus_labor_force']) / merged_df['prev_yr_laus_labor_force']) * 100
merged_df['pct_diff_laus_unemployment_rate'] = ((merged_df['curr_laus_unemployment_rate'] - merged_df['prev_yr_laus_unemployment_rate']) / merged_df['prev_yr_laus_unemployment_rate']) * 100

# Display the DataFrame with new percentage difference columns
print(merged_df.head(18))

          date  curr_sae_hrs  curr_laus_labor_force  \
215 2024-12-01           NaN                    NaN   
214 2024-11-01           NaN                    NaN   
213 2024-10-01          34.3                    NaN   
212 2024-09-01          34.5               621528.0   
211 2024-08-01          34.2               622437.0   
210 2024-07-01          34.3               627202.0   
209 2024-06-01          34.9               622404.0   
208 2024-05-01          33.8               620054.0   
207 2024-04-01          34.0               621615.0   
206 2024-03-01          34.3               623679.0   
205 2024-02-01          34.2               617640.0   
204 2024-01-01          33.5               614328.0   
203 2023-12-01          34.5               616903.0   
202 2023-11-01          34.6               617394.0   
201 2023-10-01          34.7               618328.0   
200 2023-09-01          34.7               618474.0   
199 2023-08-01          35.0               621427.0   
198 2023-0

In [15]:
# Write the DataFrame to a CSV file
merged_df.to_csv(BLS_STAGE1_OUTPUT_PATH, index=False)

print(f"Merged DataFrame has been saved to {BLS_STAGE1_OUTPUT_PATH}")

Merged DataFrame has been saved to intermediate/stage1-output/stage1_bls_output.csv
