In [1]:
import pandas as pd
import numpy as np


## Read in May 2023 OEWS Data

BLS OEWS publishes employment estimates for industies and SOC Codes on https://www.bls.gov/oes/tables.htm. Use the 'all data' for employment estimates for the U.S. economy.

In [2]:
filepath = r'C:\Users\maurice\OneDrive\Government Data'
df = pd.read_excel(filepath + '\\all_data_M_2023.xlsx')

This analysis is only interested in the most detailed SOC Codes in the U.S. economy.

In [3]:
filtered = df[(df['NAICS']!='000000')&(df['NAICS']!='000001')&(df['O_GROUP']=='detailed')]

In [4]:
keep_columns = ['NAICS','OCC_CODE','TOT_EMP']
unique_soc_naics = filtered[['NAICS','OCC_CODE']].drop_duplicates()
full_list = pd.merge(filtered, unique_soc_naics, on=['NAICS','OCC_CODE'], how='inner')[keep_columns]

## Non-Standard NAICS

Non-standard NAICS codes. The following are non-standard NAICS codes.

In [5]:
key = ['NAICS','OCC_CODE']
non_standard = full_list[full_list['NAICS'].apply(lambda x: str(x).isalnum() and any(c.isalpha() for c in str(x)) and any(c.isdigit() for c in str(x)))]
standard_list = full_list[~full_list[key].apply(tuple, axis=1).isin(non_standard[key].apply(tuple, axis=1))]

## Expand NAICS Code (if hyphenated)

If there is a hyphen in the NAICS code, we want to create a new row so that it is 31, 32, 33.

In [6]:
def expand_naics(df):
   # Create copy of dataframe
   expanded_df = df.copy()
   
   # Find rows with hyphens
   mask = expanded_df['NAICS'].str.contains('-', na=False)
   hyphen_rows = expanded_df[mask].copy()
   
   # Drop original hyphen rows
   expanded_df = expanded_df[~mask]
   
   # Expand hyphenated rows
   new_rows = []
   for idx, row in hyphen_rows.iterrows():
       start, end = map(int, row['NAICS'].split('-'))
       for naics in range(start, end + 1):
           new_row = row.copy()
           new_row['NAICS'] = str(naics)
           new_rows.append(new_row)
   
   # Combine original and expanded rows
   expanded_df = pd.concat([expanded_df, pd.DataFrame(new_rows)], ignore_index=True)
   
   return expanded_df

standard_list = expand_naics(standard_list)

In [7]:
government_naics = ['999000', '999001', '999100', '999101', '999200','999201', '999300', '999301','99']
government = standard_list[standard_list['NAICS'].isin(government_naics)]
standard_list = standard_list[~standard_list['NAICS'].isin(government_naics)]
non_standard = pd.concat([government, non_standard])
non_standard

Unnamed: 0,NAICS,OCC_CODE,TOT_EMP
6545,99,11-1011,25550
6546,99,11-1021,135720
6547,99,11-1031,32460
6548,99,11-2011,100
6549,99,11-2021,1820
...,...,...,...
82149,5320A1,53-7061,1710
82150,5320A1,53-7062,28440
82151,5320A1,53-7065,3570
82152,5320A1,53-7072,560
