**This Notebook applies supervised machine learning to a global work-from-home survey dataset**

The prediction target is a response to an employee being required to return to working from their worksite after the end of the pandemic. The survey options were to comply, to look for a job to work from home, or to quit their job. The goal is to generate a model that can predict which of these three options a particular employee will choose.

Dee Weinacht (c) 2023

Data sourced from the Global Survey of Working Arrangements (G-SWA), used under the Creative Commons Attribution 4.0 International License:
Aksoy, Cevat Giray, Jose Maria Barrero, Nicholas Bloom, Steven J. Davis, Mathias Dolls and Pablo Zarate, 2022. “Working from Home Around the World,” Brookings Papers on Economic Activity.
https://wfhresearch.com/

---

## Initial Setup

In [1]:
import pandas as pd
import numpy as np

In [2]:
# original_data = pd.read_excel('G-SWA.xlsx')
clean_data = pd.read_excel('G-SWA Clean.xlsx')

## Data Wrangling

### Prepare data types for machine learning

Most machine learning algorithms require numeric datatypes.

List columns with string/object datatype:

In [3]:
for col in clean_data.columns:
    if clean_data[col].dtype == object:
        print(f'{col}: \n{clean_data[col].unique()}\n')

original_country: 
['Australia' 'Austria' 'Brazil' 'Canada' 'China' 'France' 'Germany'
 'Greece' 'Hungary' 'India' 'Italy' 'Japan' 'Korea' 'Malaysia'
 'Netherlands' 'Poland' 'Russia' 'Singapore' 'Spain' 'Sweden' 'Taiwan'
 'Turkey' 'UK' 'USA' 'Ukraine']

gender: 
['Female' 'Male' 'Other/Prefer not to say']

education: 
['Graduate' 'Tertiary' 'Secondary']

industry_job: 
['Education' 'Retail Trade' 'Information'
 'Professional & Business Services' 'Other' 'Construction'
 'Wholesale Trade' 'Health Care & Social Assistance' 'Real Estate'
 'Finance or Insurance' 'Mining' 'Government' 'Manufacturing'
 'Transportation or Warehousing' 'Hospitality & Food Services' 'Utilities'
 'Agriculture' 'Arts & Entertainment']

return_office: 
[' Look for a job to WFH 1-2 days' 'Quit job'
 'Comply and return to worksite']



'education' is an ordinal categorical datatype and is encoded directly as a numeric datatype:

In [4]:
education_dic = {'Secondary': 0, 'Tertiary': 1, 'Graduate': 2}
clean_data['education'] = clean_data['education'].replace(education_dic)

'gender','industry_job', and 'country' are nominal categorical data and are encoded using dummy variables:

In [5]:
clean_data['gender'] = clean_data['gender'].str.lower()
clean_data['country'] = clean_data['original_country'].str.lower()
clean_data.drop(labels='original_country', axis=1, inplace=True)
clean_data['industry_job'] = clean_data['industry_job'].str.lower()
clean_data['industry_job'] = clean_data['industry_job'].str.replace(' ', '_')
clean_data  = pd.get_dummies(data=clean_data, columns=['gender', 'industry_job', 'country'], drop_first=True, dtype=np.int64)

In [6]:
for col in clean_data.columns:
    if clean_data[col].dtype == object:
        print(f'{col}: \n{clean_data[col].unique()}\n')

return_office: 
[' Look for a job to WFH 1-2 days' 'Quit job'
 'Comply and return to worksite']



The target variable is the only remaining object/string datatype, so the dataset is ready for use in a classifier.

Place target variable at the start of the dataframe:

In [7]:
clean_data.insert(loc=0, column='return_office', value=clean_data.pop('return_office'))

In [8]:
clean_data['return_office'] = clean_data['return_office'].apply(lambda x : x.strip())

In [9]:
clean_data['return_office'].value_counts()

Comply and return to worksite     9282
Look for a job to WFH 1-2 days    2875
Quit job                           317
Name: return_office, dtype: int64

The finalized machine learning dataset:

In [10]:
ml_data = clean_data
ml_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12474 entries, 0 to 12473
Data columns (total 60 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   return_office                                  12474 non-null  object 
 1   education                                      12474 non-null  int64  
 2   age                                            12474 non-null  int64  
 3   married                                        12474 non-null  int64  
 4   with_kids                                      12474 non-null  float64
 5   work_home_days_current                         12474 non-null  int64  
 6   work_home_days_employer                        12474 non-null  int64  
 7   work_home_days_employee                        12474 non-null  int64  
 8   WFH_value                                      12474 non-null  float64
 9   WFH_expectation                                124

In [11]:
ml_data.describe()

Unnamed: 0,education,age,married,with_kids,work_home_days_current,work_home_days_employer,work_home_days_employee,WFH_value,WFH_expectation,WFH_perception,...,country_poland,country_russia,country_singapore,country_spain,country_sweden,country_taiwan,country_turkey,country_uk,country_ukraine,country_usa
count,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,...,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0,12474.0
mean,1.242905,39.000561,0.779622,0.49531,2.622415,1.452862,2.503447,7.611432,7.267917,47.255491,...,0.026535,0.042168,0.064214,0.032307,0.039522,0.032067,0.050425,0.040805,0.039763,0.03848
std,0.763557,10.161728,0.414518,0.498251,2.089775,1.752797,1.758177,10.335732,12.015619,42.326908,...,0.160727,0.20098,0.245143,0.176822,0.194842,0.176184,0.218829,0.197846,0.195409,0.19236
min,0.0,20.0,0.0,0.0,0.0,0.0,0.0,-30.0,-25.0,-95.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,31.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,25.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,38.0,1.0,0.0,3.0,0.0,2.0,7.5,5.0,70.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2.0,48.0,1.0,1.0,5.0,3.0,4.0,12.5,15.0,70.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2.0,59.0,1.0,1.0,5.0,5.0,5.0,30.0,25.0,95.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Save the machine learning dataset:

In [12]:
ml_data.to_excel('G-SWA ML.xlsx', index=False)

## Machine Learning
### Train-Test Split
For supervised machine learning the dataset must be separated into X and y variables, with y being the dependent variable and X being the independent variables:

In [13]:
X = ml_data.drop('return_office', axis=1)
y = ml_data['return_office']

Next, the dataset must be split into training and testing sets in order to evaluate the performance of a generated model:

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=36)
y_train.value_counts()

Comply and return to worksite     6987
Look for a job to WFH 1-2 days    2135
Quit job                           233
Name: return_office, dtype: int64

### Address Class Imbalance
Imbalance between the classification categories will cause difficulties for some machine learning algorithms. 

Use the imbalanced-learn synthetic minority oversampling technique - edited nearest neighbor (SMOTEENN) to perform majority class undersampling and minority class oversampling to address class imbalance:

In [16]:
# install imbalanced-learn, if not installed, using 'pip install imblearn'
from imblearn.combine import SMOTEENN

In [17]:
smote_enn = SMOTEENN(random_state=36, n_jobs=-1)
X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)



In [18]:
y_resampled.value_counts()

Quit job                          6539
Look for a job to WFH 1-2 days    4538
Comply and return to worksite     2804
Name: return_office, dtype: int64

In [19]:
print(f'Original training set: {len(X_train)} --> Resampled training set: {len(X_resampled)}')

Original training set: 9355 --> Resampled training set: 13881


The training data now has a more equitable split between the different classes.

## k-Nearest Neighbors Classifier

This classifier generates a prediction by evaluating the proximity of a datapoint to other datapoints.

In [20]:
from sklearn.neighbors import KNeighborsClassifier

Create a k-neighbors classifier, fit it to the training data, and generate a prediction:

In [21]:
knc = KNeighborsClassifier(n_jobs=-1)
knc.fit(X_train, y_train)
y_predict = knc.predict(X_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Import scoring metrics from scikit-learn and evaluate the model's accuracy:

In [22]:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_predict))

[[2090  205    0]
 [ 603  137    0]
 [  76    8    0]]


In [23]:
print(classification_report(y_test, y_predict, zero_division=0))

                                precision    recall  f1-score   support

 Comply and return to worksite       0.75      0.91      0.83      2295
Look for a job to WFH 1-2 days       0.39      0.19      0.25       740
                      Quit job       0.00      0.00      0.00        84

                      accuracy                           0.71      3119
                     macro avg       0.38      0.37      0.36      3119
                  weighted avg       0.65      0.71      0.67      3119



While the overall accuracy is reasonable for the model, this is mostly due to the imbalanced data and an over-emphasis on the majority class.


Refit the classifier using the resampled data that balances the classes and generate a new set of predictions:

In [24]:
knc.fit(X_resampled, y_resampled)
y_predict = knc.predict(X_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Evaluate the new model:

In [25]:
print(confusion_matrix(y_test, y_predict))

[[982 945 368]
 [152 447 141]
 [ 19  36  29]]


In [26]:
print(classification_report(y_test, y_predict))

                                precision    recall  f1-score   support

 Comply and return to worksite       0.85      0.43      0.57      2295
Look for a job to WFH 1-2 days       0.31      0.60      0.41       740
                      Quit job       0.05      0.35      0.09        84

                      accuracy                           0.47      3119
                     macro avg       0.41      0.46      0.36      3119
                  weighted avg       0.70      0.47      0.52      3119



While the model now has better inclusion of all classes, the overall accuracy of the model has dropped significantly.

## Multilayer Perceptron Classifier

A multilayer perceptron classifier is a feedforward artifical neural network that can generate classification predictions.

In [27]:
from sklearn.neural_network import MLPClassifier

Create a multilayer perceptron classifier, fit it to the original training data, and generate a prediction:

In [81]:
mlpc = MLPClassifier(solver='adam', activation='logistic', learning_rate='adaptive', learning_rate_init=0.0001, batch_size=32, random_state=36)
mlpc.fit(X_train, y_train)
y_predict = mlpc.predict(X_test)

Evaluate the model's performance:

In [82]:
print(confusion_matrix(y_test, y_predict))

[[2295    0    0]
 [ 740    0    0]
 [  84    0    0]]


In [83]:
print(classification_report(y_test, y_predict, zero_division=0))

                                precision    recall  f1-score   support

 Comply and return to worksite       0.74      1.00      0.85      2295
Look for a job to WFH 1-2 days       0.00      0.00      0.00       740
                      Quit job       0.00      0.00      0.00        84

                      accuracy                           0.74      3119
                     macro avg       0.25      0.33      0.28      3119
                  weighted avg       0.54      0.74      0.62      3119



Due to the class imbalance, this model only predicts the majority class. This may result in a higher average accuracy, but results in a model that does not offer any predictive value.

Refit the classifier using the resampled data which addresses class imbalance and generate a new set of predictions:

In [84]:
mlpc.fit(X_resampled, y_resampled)
y_predict = mlpc.predict(X_test)

In [85]:
print(confusion_matrix(y_test, y_predict))

[[1092  972  231]
 [ 148  464  128]
 [  13   28   43]]


In [86]:
print(classification_report(y_test, y_predict))

                                precision    recall  f1-score   support

 Comply and return to worksite       0.87      0.48      0.62      2295
Look for a job to WFH 1-2 days       0.32      0.63      0.42       740
                      Quit job       0.11      0.51      0.18        84

                      accuracy                           0.51      3119
                     macro avg       0.43      0.54      0.40      3119
                  weighted avg       0.72      0.51      0.56      3119



Using the resampled training data the mlp classifier still has relatively low accuracy, but at least offers predictions for each of the three classes.

# Results

Both of the machine learning algorithms used (k-nearest neighbors and multilayer perceptron classifier) struggled to handle the class imbalance of the training data. This resulted in both models predicting only the majority class, which is not a useful predictive outcome. However, this is not surprising as there was a significant imbalance between the classes (~200, ~2000, ~7000). There were approximately 35 times more examples in the majority class as the minority class.

To address this, the training data was resampled using a combined over- and under-sampling technique (SMOTEENN). This resulted in a much closer balance between the three classes, with the majority class only being approximately 2.5x the size of the minority class. This resulted in both models offering predictions for each of the three classes. However, the overall precision and recall for both machine learning algorithms was disappointing, with f1-scores hovering around 50%.

Further tweaking could result in an improvement of the generated models. Additionally, neural network classifiers like mlp, perform better when feature scaling is employed, which could result in a model that performs better.