## Option 1: Occupations with the Largest Projected Increase in Jobs by Share of Women in the Occupation

https://www.dol.gov/agencies/wb/data/high-demand-occupations

1. Is 'Employment, 2022' dependent on occupation (can pick 3-4)? (Chi-squared test)
2. Are there more women ('Employment, 2022') working as Nurse practitioners compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
3. Are there more women ('Employment, 2022') working as Software developers compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
4. Are there more women projected ('Employment, 2032') to work as Nurse practitioners compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
5. Are there more women projected ('Employment, 2032') to work as Software developers compared to data scientists? (proportion hypothesis testing, t-test and/or z-test)
6. Is median annual wage dependent on occupation? (Chi-squared test)
7. Compare median annual wage between two occupations. (proporiton hypothesis testing)
8. Label coding careers vs. non-coding careers. Do women with coding careers have a higher median wage compared to non-coding careers? (means hypothesis testing, t-test or z-test)

Import Libraries 

In [42]:
import pandas as pd
import numpy as np
#pip install scipy
from scipy.stats import ttest_rel


Data Exploration

In [43]:


women = pd.read_csv('OccupationswithMostProjectedGrowth.csv')
print(women.head())
print(women['Measure Names'].unique())
print(women['Occupation'].unique())

# 'Employment change, 2022-32' --> frequency (in thousands)
# 'Employment, 2022' --> frequency (in thousands)
# 'Employment, 2032' --> frequency (in thousands)
# 'Median annual wage 2022' --> $
# 'Percent employment change, 2022-32' --> percentage
# 'Percent women 2022'] --> percentage

                        Measure Names                  Occupation  \
0          Employment change, 2022-32    Accountants and auditors   
1                    Employment, 2022    Accountants and auditors   
2                    Employment, 2032    Accountants and auditors   
3             Median annual wage 2022    Accountants and auditors   
4  Percent employment change, 2022-32    Accountants and auditors   

  Measure Values  
0           67.4  
1        1,538.4  
2        1,605.8  
3         78,000  
4            4.4  
['Employment change, 2022-32' 'Employment, 2022' 'Employment, 2032'
 'Median annual wage 2022' 'Percent employment change, 2022-32'
 'Percent women 2022']
['  Accountants and auditors' '  Animal caretakers'
 '  Computer and information systems managers'
 '  Computer systems analysts' '  Construction laborers'
 '  Cooks, restaurant' '  Data scientists' '  Financial managers'
 '  First-line supervisors of food preparation and serving workers'
 '  General and operations

In [44]:
print(women.head(10))

                        Measure Names                  Occupation  \
0          Employment change, 2022-32    Accountants and auditors   
1                    Employment, 2022    Accountants and auditors   
2                    Employment, 2032    Accountants and auditors   
3             Median annual wage 2022    Accountants and auditors   
4  Percent employment change, 2022-32    Accountants and auditors   
5                  Percent women 2022    Accountants and auditors   
6          Employment change, 2022-32           Animal caretakers   
7                    Employment, 2022           Animal caretakers   
8                    Employment, 2032           Animal caretakers   
9             Median annual wage 2022           Animal caretakers   

  Measure Values  
0           67.4  
1        1,538.4  
2        1,605.8  
3         78,000  
4            4.4  
5    0.587865715  
6           52.5  
7            339  
8          391.5  
9         29,530  


In [45]:
df_pivoted = women.pivot(index='Occupation', columns='Measure Names', values='Measure Values').reset_index()
df_pivoted.columns.name = None
df_pivoted.columns = [col if col != 'Measure Values' else 'Value' for col in df_pivoted.columns]


#df_pivoted.to_csv('output_data.csv', index=False)

DATA CLEANING 

In [46]:
df_pivoted.columns

Index(['Occupation', 'Employment change, 2022-32', 'Employment, 2022',
       'Employment, 2032', 'Median annual wage 2022',
       'Percent employment change, 2022-32', 'Percent women 2022'],
      dtype='object')

In [47]:
# rename all columns

col= ("occupation","employment_change", "employment2022", "employment2032", "median_annual_wage_2022","employment_change_percet","percent_women")
df_pivoted.columns = col
df=df_pivoted.copy()


In [48]:
# clean and remove all the commmas in the dataset

numeric_columns = df.columns.difference(["occupation"])
df[numeric_columns] = df[numeric_columns].replace(to_replace=r',', value='', regex=True)
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')
print(df)

                                           occupation  employment_change  \
0                            Accountants and auditors               67.4   
1                                   Animal caretakers               52.5   
2           Computer and information systems managers               86.0   
3                           Computer systems analysts               51.1   
4                               Construction laborers               61.9   
5                                   Cooks, restaurant              277.6   
6                                     Data scientists               59.4   
7                                  Financial managers              126.6   
8     First-line supervisors of food preparation a...               60.0   
9                     General and operations managers              147.3   
10            Heavy and tractor-trailer truck drivers               89.3   
11                Home health and personal care aides              804.6   
12          

In [49]:
print(df.head(10))

                                          occupation  employment_change  \
0                           Accountants and auditors               67.4   
1                                  Animal caretakers               52.5   
2          Computer and information systems managers               86.0   
3                          Computer systems analysts               51.1   
4                              Construction laborers               61.9   
5                                  Cooks, restaurant              277.6   
6                                    Data scientists               59.4   
7                                 Financial managers              126.6   
8    First-line supervisors of food preparation a...               60.0   
9                    General and operations managers              147.3   

   employment2022  employment2032  median_annual_wage_2022  \
0          1538.4          1605.8                    78000   
1           339.0           391.5                 

In [50]:
df.isna().sum()


occupation                  0
employment_change           0
employment2022              0
employment2032              0
median_annual_wage_2022     0
employment_change_percet    0
percent_women               4
dtype: int64

In [51]:
df1=df.copy()

In [52]:
df.dtypes

occupation                   object
employment_change           float64
employment2022              float64
employment2032              float64
median_annual_wage_2022       int64
employment_change_percet    float64
percent_women               float64
dtype: object

In [53]:
df["percent_women"] = df["percent_women"].fillna(df["percent_women"].mean())

print(df)

                                           occupation  employment_change  \
0                            Accountants and auditors               67.4   
1                                   Animal caretakers               52.5   
2           Computer and information systems managers               86.0   
3                           Computer systems analysts               51.1   
4                               Construction laborers               61.9   
5                                   Cooks, restaurant              277.6   
6                                     Data scientists               59.4   
7                                  Financial managers              126.6   
8     First-line supervisors of food preparation a...               60.0   
9                     General and operations managers              147.3   
10            Heavy and tractor-trailer truck drivers               89.3   
11                Home health and personal care aides              804.6   
12          

In [54]:
df.isna().sum()


occupation                  0
employment_change           0
employment2022              0
employment2032              0
median_annual_wage_2022     0
employment_change_percet    0
percent_women               0
dtype: int64

In [55]:
# has missing value imputation affected the variace of the data 

variance_before = df1["percent_women"].var()

variance_after_mean = df["percent_women"].var()

print(variance_before)
print(variance_after_mean)

# there is just 10% change

0.06919941687358531
0.05997282795710727


**TESTING**

In [56]:
# mean, variance and skewness 

# Import necessary libraries
import numpy as np

# Convert the "Median annual wage 2022" column to numerical format
df["Median annual wage 2022"] = df["median_annual_wage_2022"].replace({',': ''}, regex=True).astype(float)

# Calculate mean, variance, and skewness
mean_wage = np.mean(df["median_annual_wage_2022"])
variance_wage = np.var(df["median_annual_wage_2022"])
skewness_wage = df["median_annual_wage_2022"].skew()

print("Mean:", mean_wage)
print("Variance:", variance_wage)
print("Skewness:", skewness_wage)


Mean: 74149.35483870968
Variance: 1449347631.8418317
Skewness: 0.613807060668165


**Q1**. Is there any significant difference between 2022 and 2032 samples? 

H0: There is no significant difference between 2022 and 2032
H1: There is significant difference 

In [57]:
t_stat, p_value = ttest_rel(df["employment2022"], df["employment2032"])

print("t-statistic:", t_stat)
print("p-value:", p_value)


t-statistic: -1.8799174643550587
p-value: 0.06986200323491434


There is no significant differene between employement in 2022 and 2032 as we are accepting the null-hypothesis. 

t-statistic: The t-statistic measures the size and direction of the difference between the means of the paired samples. In this case, the negative value indicates that, on average, the values in "employment2022" are lower than the values in "employment2032."

p-value: The p-value is the probability of observing a t-statistic as extreme as the one calculated, assuming the null hypothesis is true. In this case, the p-value is approximately 0.0699.

Decision: If you chose a significance level (alpha) of 0.05, the p-value (0.0699) is greater than alpha. Therefore, at the 0.05 significance level, you would fail to reject the null hypothesis.

Conclusion: There is not enough evidence to conclude that there is a significant difference between the means of "employment2022" and "employment2032" based on this paired t-test.

x/N = percent
x and percent are given
To find N,
x/percent = N

In [60]:
output_continued = pd.read_csv('output_data.csv')

output_continued['Employment, 2022'] = output_continued['Employment, 2022'].str.replace(',', '').astype(float)
output_continued['Employment, 2032'] = output_continued['Employment, 2032'].str.replace(',', '').astype(float)
output_continued['Median annual wage 2022'] = output_continued['Median annual wage 2022'].str.replace(',', '').astype(float)

output_continued['Employment change, 2022-32'] = output_continued['Employment change, 2022-32'].str.replace(',', '').astype(float)
output_continued['Percent employment change, 2022-32'] = output_continued['Percent employment change, 2022-32'].astype(float)
output_continued['Percent women 2022'] = output_continued['Percent women 2022'].astype(float)

output_continued['Occupation'] = output_continued['Occupation'].str.replace("  ", "")
output_continued['Total Employment, 2022'] = output_continued['Employment, 2022'] / output_continued['Percent women 2022']

output_continued['Percent non-women 2022'] = 1-output_continued['Percent women 2022']
output_continued['Men employment, 2022'] = output_continued['Total Employment, 2022'] - output_continued['Employment, 2022']

update_colnames = {'Employment change, 2022-32': 'WomenEmploymentChange2022-23', 
                   'Employment, 2022': 'WomenEmployment2022', 'Employment, 2022': 'WomenEmployment2022',
                   'Employment, 2032': 'WomenEmployment2032', 'Employment, 2032': 'WomenEmployment2032',
                   'Median annual wage 2022': 'WomenMedianAnnual2022', 'Percent employment change, 2022-32': 'WomenPercentEmploymentChange2022-32',
                   'Percent women 2022': 'PercentWomen2022', 'Total Employment, 2022': 'WomenTotalEmployment2022',
                   'Percent non-women 2022': 'PercentNon-women2022', 'Men employment, 2022': 'MenEmployment2022'}
output_continued.rename(columns=update_colnames, inplace=True)

print(output_continued.head())

                                  Occupation  WomenEmploymentChange2022-23  \
0                   Accountants and auditors                          67.4   
1                          Animal caretakers                          52.5   
2  Computer and information systems managers                          86.0   
3                  Computer systems analysts                          51.1   
4                      Construction laborers                          61.9   

   WomenEmployment2022  WomenEmployment2032  WomenMedianAnnual2022  \
0               1538.4               1605.8                78000.0   
1                339.0                391.5                29530.0   
2                557.4                643.3               164070.0   
3                531.4                582.6               102240.0   
4               1418.6               1480.5                40750.0   

   WomenPercentEmploymentChange2022-32  PercentWomen2022  \
0                                  4.4          0.

STEM vs. non-STEM  
- compare wage  
- compare employment  

Computer vs. non-Computer  
- compare wage  
- compare employment  

Industry SOC major groups classification  
- https://www.bls.gov/soc/2018/major_groups.htm  

Employment STEM classification  
- https://www.bls.gov/soc/Attachment_C_STEM_2018.pdf  

Compare top three industries/SOC majors  

https://www.bls.gov/spotlight/2017/science-technology-engineering-and-mathematics-stem-occupations-past-present-and-future/home.htm


In [None]:

def get_stem(occupation):
    if occupation in [  'Computer and information systems managers', 'Software developers', 'Nurse practitioners', 
                      'Information security analysts', 'Medical and health services managers', 'Data scientists', 
                      'Computer systems analysts', 'Registered nurses']:
        return 'yes'
    else:
        return 'no'

output_continued['STEM'] = output_continued['Occupation'].apply(get_stem)

# Define a custom function to determine category
def get_SOC(occupation):
    if occupation in ['Project management specialists', 'Management analysts', 
                      'Market research analysts and marketing specialists', 'Human resources specialists']:
        return 'Business and Financial Operations Occupations'
    elif occupation == 'Substance abuse, behavioral disorder, and mental health counselors':
        return 'Community and Social Service Occupations'
    elif occupation in ['Software developers', 'Information security analysts', 'Data scientists', 
                        'Computer systems analysts']:
        return 'Computer and Mathematical Occupations'
    elif occupation == "Construction laborers":
        return 'Construction and Extraction Occupations'
    elif occupation == "Accountants and auditors":
        return 'Financial Specialists'
    elif occupation in ['First-line supervisors of food preparation and serving workers', 'Cooks, restaurant']:
        return 'Food Preparation and Serving Related Occupations'
    elif occupation in ['Nurse practitioners','Registered nurses']:
        return 'Healthcare Practitioners and Technical Occupations'
    elif occupation in ['Medical assistants', 'Nursing assistants', 'Home health and personal care aides' 'Animal caretakers']:
        return 'Healthcare Support Occupations'
    elif occupation in ['Industrial machinery mechanics', 'Maintenance and repair workers, general']:
        return 'Installation, Maintenance, and Repair Occupations'
    elif occupation == "Lawyers":
        return 'Legal Occupations'
    elif occupation in ['Computer and information systems managers', 'Financial managers', 
                        'Medical and health services managers', 'General and operations managers']:
        return 'Management Occupations'
    elif occupation in ['Heavy and tractor-trailer truck drivers', 'Light truck drivers',
                        'Laborers and freight, stock, and material movers, hand', 'Stockers and order fillers']:
        return 'Transportation and Material Moving Occupations'
    else:
        return 'Other'

output_continued['Industry_SOCmajor'] = output_continued['Occupation'].apply(get_SOC)

print(output_continued)


In [None]:
output_continued.to_csv('output_data_v2.csv', index=False)