1. Descriptive Statistics
Task:
Summarize the dataset for Overall Engagement and Job Satisfaction.
Identify trends by analyzing demographic groups (age, gender, ethnicity, department) and other areas such as Work-Life Balance, Compensation Satisfaction, etc.

In [1]:
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv("employee_experience_survey_data.csv")

df.head()

Unnamed: 0,Name,Age Bracket,Gender,Ethnicity,Job Title,Department,Date Survey Completed,Job Satisfaction,Work-Life Balance,Management Support,Team Collaboration,Workload Fairness,Career Development Opportunities,Workplace Inclusivity,Company Communication,Compensation Satisfaction,Job Security,Overall Engagement
0,John Doe,25-34,Female,Asian,Product Manager,Product Development,2024-10-05,Disagree,Strongly Agree,Neutral,Neutral,Neutral,Disagree,Agree,Strongly Agree,Strongly Agree,Agree,Strongly Agree
1,Jane Smith,18-24,Female,Middle Eastern,Operations Manager,Sales,2024-10-07,Agree,Strongly Disagree,Strongly Agree,Neutral,Neutral,Disagree,Neutral,Neutral,Neutral,Agree,Neutral
2,Carlos Reyes,45-54,Female,Indian,UX Designer,Consulting,2024-10-08,Neutral,Strongly Disagree,Agree,Disagree,Strongly Agree,Agree,Agree,Strongly Disagree,Neutral,Strongly Agree,Strongly Agree
3,Emily Zhang,35-44,Male,Caucasian,UX Designer,HR,2024-10-07,Neutral,Agree,Agree,Strongly Disagree,Strongly Disagree,Strongly Agree,Agree,Strongly Disagree,Strongly Disagree,Agree,Neutral
4,Michael Johnson,18-24,Female,Caucasian,UX Designer,Product Development,2024-10-07,Agree,Strongly Agree,Disagree,Neutral,Strongly Disagree,Disagree,Disagree,Disagree,Strongly Disagree,Neutral,Disagree


In [2]:

# Map Likert scale responses to numeric values
likert_mapping = {
    'Strongly Disagree': 1,
    'Disagree': 2,
    'Neutral': 3,
    'Agree': 4,
    'Strongly Agree': 5
}

# Apply the mapping to all relevant columns
columns_to_convert = [
    'Job Satisfaction', 'Work-Life Balance', 'Management Support', 'Team Collaboration', 
    'Workload Fairness', 'Career Development Opportunities', 'Workplace Inclusivity', 
    'Company Communication', 'Compensation Satisfaction', 'Job Security', 'Overall Engagement'
]

for column in columns_to_convert:
    df[column] = df[column].map(likert_mapping)

# Check if conversion was successful
print(df[columns_to_convert].head())

# Now you can rerun the descriptive and inferential statistics
# Descriptive statistics for Overall Engagement and Job Satisfaction
descriptive_stats = df[['Overall Engagement', 'Job Satisfaction']].describe()
print("Descriptive Statistics for Overall Engagement and Job Satisfaction:\n", descriptive_stats)

# Identify trends based on demographic factors
age_group_satisfaction = df.groupby('Age Bracket')['Job Satisfaction'].mean()
department_satisfaction = df.groupby('Department')['Job Satisfaction'].mean()

print("\nAverage Job Satisfaction by Age Group:\n", age_group_satisfaction)
print("\nAverage Job Satisfaction by Department:\n", department_satisfaction)


   Job Satisfaction  Work-Life Balance  Management Support  \
0                 2                  5                   3   
1                 4                  1                   5   
2                 3                  1                   4   
3                 3                  4                   4   
4                 4                  5                   2   

   Team Collaboration  Workload Fairness  Career Development Opportunities  \
0                   3                  3                                 2   
1                   3                  3                                 2   
2                   2                  5                                 4   
3                   1                  1                                 5   
4                   3                  1                                 2   

   Workplace Inclusivity  Company Communication  Compensation Satisfaction  \
0                      4                      5                          5   
1 

In [3]:
from scipy import stats

# Subset data for IT and HR departments
it_satisfaction = df[df['Department'] == 'IT']['Job Satisfaction']
hr_satisfaction = df[df['Department'] == 'HR']['Job Satisfaction']

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(it_satisfaction, hr_satisfaction)

print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Interpret results
if p_value < 0.05:
    print("Reject the null hypothesis. There is a significant difference in Job Satisfaction between IT and HR.")
else:
    print("Fail to reject the null hypothesis. No significant difference in Job Satisfaction between IT and HR.")


T-statistic: -1.7320508075688774, P-value: 0.33333333333333326
Fail to reject the null hypothesis. No significant difference in Job Satisfaction between IT and HR.


In [4]:
# Correlation between Work-Life Balance and Overall Engagement
correlation = df['Work-Life Balance'].corr(df['Overall Engagement'])

print(f"Correlation Coefficient between Work-Life Balance and Overall Engagement: {correlation}")

# Interpret the result
if correlation > 0.7:
    print("Strong positive correlation.")
elif correlation > 0.5:
    print("Moderate positive correlation.")
elif correlation > 0.3:
    print("Weak positive correlation.")
else:
    print("No significant correlation.")


Correlation Coefficient between Work-Life Balance and Overall Engagement: -0.361040820409166
No significant correlation.


# Recommendations and Business Implications
Based on the results, you can make recommendations such as:
Improvement areas: If certain departments (e.g., HR) have lower job satisfaction, the organization could focus on interventions to boost morale in those areas.
Correlation insights: If there’s a strong positive correlation between Work-Life Balance and Overall Engagement, the nonprofit could prioritize work-life balance initiatives to increase overall employee engagement.


In [5]:
df[columns_to_convert].head()

Unnamed: 0,Job Satisfaction,Work-Life Balance,Management Support,Team Collaboration,Workload Fairness,Career Development Opportunities,Workplace Inclusivity,Company Communication,Compensation Satisfaction,Job Security,Overall Engagement
0,2,5,3,3,3,2,4,5,5,4,5
1,4,1,5,3,3,2,3,3,3,4,3
2,3,1,4,2,5,4,4,1,3,5,5
3,3,4,4,1,1,5,4,1,1,4,3
4,4,5,2,3,1,2,2,2,1,3,2


In [6]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load dataset
df = pd.read_csv('employee_experience_survey_data.csv')

# Map Likert scale responses to numeric values as done previously
likert_mapping = {
    'Strongly Disagree': 1,
    'Disagree': 2,
    'Neutral': 3,
    'Agree': 4,
    'Strongly Agree': 5
}

# Apply the mapping to all relevant columns
columns_to_convert = [
    'Job Satisfaction', 'Work-Life Balance', 'Management Support', 'Team Collaboration', 
    'Workload Fairness', 'Career Development Opportunities', 'Workplace Inclusivity', 
    'Company Communication', 'Compensation Satisfaction', 'Job Security', 'Overall Engagement'
]

for column in columns_to_convert:
    df[column] = df[column].map(likert_mapping)

# One-Hot Encode categorical columns like Age Bracket, Gender, Ethnicity
df_encoded = pd.get_dummies(df, columns=['Age Bracket', 'Gender', 'Ethnicity', 'Department'], drop_first=True)

# Drop non-numeric columns like 'Name', 'Date Survey Completed', 'Job Title'
df_encoded = df_encoded.drop(columns=['Name', 'Date Survey Completed', 'Job Title'])

# Ensure the data contains no non-numeric columns
print(df_encoded.dtypes)  # Check if all columns are numeric

# Drop any rows with missing values
df_encoded = df_encoded.dropna()

# Standardize the data for clustering
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_encoded)


Job Satisfaction                    int64
Work-Life Balance                   int64
Management Support                  int64
Team Collaboration                  int64
Workload Fairness                   int64
Career Development Opportunities    int64
Workplace Inclusivity               int64
Company Communication               int64
Compensation Satisfaction           int64
Job Security                        int64
Overall Engagement                  int64
Age Bracket_25-34                    bool
Age Bracket_35-44                    bool
Age Bracket_45-54                    bool
Gender_Male                          bool
Ethnicity_Asian                      bool
Ethnicity_Caucasian                  bool
Ethnicity_Hispanic                   bool
Ethnicity_Indian                     bool
Ethnicity_Middle Eastern             bool
Department_Design                    bool
Department_Finance                   bool
Department_HR                        bool
Department_IT                     

In [7]:
from sklearn.cluster import KMeans

# Set number of clusters (e.g., 3 groups of employees)
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model
kmeans.fit(df_scaled)

# Add the cluster labels back to the original dataframe
df['Cluster'] = kmeans.labels_

# Check the distribution of employees across clusters
print(df['Cluster'].value_counts())


Cluster
1    10
2     3
0     2
Name: count, dtype: int64


In [8]:
# Check the data types of all columns in the DataFrame
print(df.dtypes)

# Ensure that you are using only numeric columns for analysis
# Only keep numeric columns, including Likert scale converted columns and dummy variables
numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()

# Analyze the characteristics of each cluster by calculating the mean of each numeric variable per cluster
try:
    cluster_means = df.groupby('Cluster')[numeric_columns].mean()
    print("Cluster Means for Job Satisfaction and Related Factors:\n", cluster_means)
except Exception as e:
    print("Error during grouping and calculating means:", e)


Name                                object
Age Bracket                         object
Gender                              object
Ethnicity                           object
Job Title                           object
Department                          object
Date Survey Completed               object
Job Satisfaction                     int64
Work-Life Balance                    int64
Management Support                   int64
Team Collaboration                   int64
Workload Fairness                    int64
Career Development Opportunities     int64
Workplace Inclusivity                int64
Company Communication                int64
Compensation Satisfaction            int64
Job Security                         int64
Overall Engagement                   int64
Cluster                              int32
dtype: object
Cluster Means for Job Satisfaction and Related Factors:
          Job Satisfaction  Work-Life Balance  Management Support  \
Cluster                                     

In [9]:
# Example: Remove rows where any numeric columns contain non-numeric values
for col in numeric_columns:
    df = df[pd.to_numeric(df[col], errors='coerce').notna()]

# Recalculate cluster means after cleaning
cluster_means = df.groupby('Cluster')[numeric_columns].mean()
print("Cleaned Cluster Means:\n", cluster_means)


Cleaned Cluster Means:
          Job Satisfaction  Work-Life Balance  Management Support  \
Cluster                                                            
0                     2.5           2.000000            2.000000   
1                     3.1           3.200000            3.100000   
2                     3.0           4.666667            2.333333   

         Team Collaboration  Workload Fairness  \
Cluster                                          
0                       3.5                4.5   
1                       2.7                3.1   
2                       3.0                3.0   

         Career Development Opportunities  Workplace Inclusivity  \
Cluster                                                            
0                                4.500000                    3.5   
1                                3.000000                    3.3   
2                                2.333333                    4.0   

         Company Communication  Compensatio

In [10]:
 # Generate recommendations based on cluster means
for cluster_id, stats in cluster_means.iterrows():
    print(f"\nRecommendations for Cluster {cluster_id}:")
    
    if stats['Job Satisfaction'] < 3:
        print("- Focus on improving job satisfaction. Consider implementing career development programs or regular feedback sessions.")
    
    if stats['Work-Life Balance'] < 3:
        print("- Enhance work-life balance, possibly by introducing flexible working hours or wellness programs.")
    
    if stats['Compensation Satisfaction'] < 3:
        print("- Review and improve compensation packages to align with employee expectations.")
    
    if stats['Management Support'] < 3:
        print("- Increase management training to ensure employees feel supported and valued.")



Recommendations for Cluster 0:
- Focus on improving job satisfaction. Consider implementing career development programs or regular feedback sessions.
- Enhance work-life balance, possibly by introducing flexible working hours or wellness programs.
- Increase management training to ensure employees feel supported and valued.

Recommendations for Cluster 1:
- Review and improve compensation packages to align with employee expectations.

Recommendations for Cluster 2:
- Increase management training to ensure employees feel supported and valued.


In [11]:
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Assuming your data is already preprocessed and encoded as `df_encoded`
X = df_encoded.drop(columns=['Job Satisfaction'])  # Features
y = df_encoded['Job Satisfaction']  # Target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Save the trained model and feature names
with open('job_satisfaction_model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

with open('model_features.pkl', 'wb') as feature_file:
    pickle.dump(X_train.columns.tolist(), feature_file)

print("Model and feature names saved.")


Model and feature names saved.


In [12]:
df_encoded.head()

Unnamed: 0,Job Satisfaction,Work-Life Balance,Management Support,Team Collaboration,Workload Fairness,Career Development Opportunities,Workplace Inclusivity,Company Communication,Compensation Satisfaction,Job Security,...,Ethnicity_Hispanic,Ethnicity_Indian,Ethnicity_Middle Eastern,Department_Design,Department_Finance,Department_HR,Department_IT,Department_Operations,Department_Product Development,Department_Sales
0,2,5,3,3,3,2,4,5,5,4,...,False,False,False,False,False,False,False,False,True,False
1,4,1,5,3,3,2,3,3,3,4,...,False,False,True,False,False,False,False,False,False,True
2,3,1,4,2,5,4,4,1,3,5,...,False,True,False,False,False,False,False,False,False,False
3,3,4,4,1,1,5,4,1,1,4,...,False,False,False,False,False,True,False,False,False,False
4,4,5,2,3,1,2,2,2,1,3,...,False,False,False,False,False,False,False,False,True,False
