**IN THE FOLLOWING SECTION, WE'LL APPLY THE LEARNINGS FROM THE MODULE 4 ASSIGNMENT OF OUR COURSE TO OUR GROUP PROJECT DATASET.**


To start, we'll import all the libraries that we may or may not need to perform our analysis:

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm, tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.preprocessing import StandardScaler

Then we'll import our dataset, and take a look at it:

In [4]:
# Loading the dataset
balance = pd.read_csv('Wellbeing_and_lifestyle_data_Kaggle.csv')
balance.head()

Unnamed: 0,Timestamp,FRUITS_VEGGIES,DAILY_STRESS,PLACES_VISITED,CORE_CIRCLE,SUPPORTING_OTHERS,SOCIAL_NETWORK,ACHIEVEMENT,DONATION,BMI_RANGE,...,SLEEP_HOURS,LOST_VACATION,DAILY_SHOUTING,SUFFICIENT_INCOME,PERSONAL_AWARDS,TIME_FOR_PASSION,WEEKLY_MEDITATION,AGE,GENDER,WORK_LIFE_BALANCE_SCORE
0,7/7/15,3,2,2,5,0,5,2,0,1,...,7,5,5,1,4,0,5,36 to 50,Female,609.5
1,7/7/15,2,3,4,3,8,10,5,2,2,...,8,2,2,2,3,2,6,36 to 50,Female,655.6
2,7/7/15,2,3,3,4,4,10,3,2,2,...,8,10,2,2,4,8,3,36 to 50,Female,631.6
3,7/7/15,3,3,10,3,10,7,2,5,2,...,5,7,5,1,5,2,0,51 or more,Female,622.7
4,7/7/15,5,1,3,3,10,4,2,4,2,...,7,0,0,2,8,1,5,51 or more,Female,663.9


A preliminary investigation reveals a few columns that will need to be prepped for our analysis:

- Timestamp is irrevelant for our purposes, and can be dropped altogether.
- Age Range should be cleaned up by converting each range category to a numerical value: Less than 20 = 1, 21 to 35 = 2, 36 to 50 = 3, 51 or more = 4 
- Gender can also be made numerical in the same way, we'll assign 0 = Female, 1 = Male.
- Row 10007 \(index 10005\) contains a date\-formatted entry for the DAILY\_STRESS column.  Since we don't know what that value should actually be, we'll drop the entire row as our sample size is large enough to not be impacted.
- Work\-life balance score is a calculated value, though we don't know how it is calculated.  We'll likely want to extract it so that it doesn't factor into our model's initial training, similar to how the "Region" column was excluded in the Module 4 assignment. We can then use it for testing or training our predictions at a later time.  Similar to Module 4, we'll assign "x" as the dataframe with the variables, and "y" as the dataframe with the actual results. 



In [5]:
# Dropping the Timestamp column
balance = balance.drop(columns=['Timestamp'])

# Replacing Age Ranges in the Age column
balance['AGE'] = balance['AGE'].replace({"Less than 20": 1, "21 to 35": 2, "36 to 50": 3, "51 or more": 4})

# Replacing "Female" with 0 and "Male" with 1 in the Gender column
balance['GENDER'] = balance['GENDER'].replace({"Female": 0, "Male": 1})

# Dropping index row 10005, which contains a non-numeric value
balance = balance.drop(balance.index[10005])

# Assigning x and y as the dataframes for variables and results
x = balance.drop(columns=['WORK_LIFE_BALANCE_SCORE'])
y = balance['WORK_LIFE_BALANCE_SCORE']


# Printing the shape of x and y:
print("Shape of x:", x.shape)
print("Shape of y:", y.shape)

Shape of x: (15971, 22)
Shape of y: (15971,)


The Work\-Life balance score is a calculated field based on the prior entry in that row, but we don't know how that calculation is done.



In [6]:
# Taking a look at x, to confirm everything looks as intended:
x.head()

Unnamed: 0,FRUITS_VEGGIES,DAILY_STRESS,PLACES_VISITED,CORE_CIRCLE,SUPPORTING_OTHERS,SOCIAL_NETWORK,ACHIEVEMENT,DONATION,BMI_RANGE,TODO_COMPLETED,...,LIVE_VISION,SLEEP_HOURS,LOST_VACATION,DAILY_SHOUTING,SUFFICIENT_INCOME,PERSONAL_AWARDS,TIME_FOR_PASSION,WEEKLY_MEDITATION,AGE,GENDER
0,3,2,2,5,0,5,2,0,1,6,...,0,7,5,5,1,4,0,5,3,0
1,2,3,4,3,8,10,5,2,2,5,...,5,8,2,2,2,3,2,6,3,0
2,2,3,3,4,4,10,3,2,2,2,...,5,8,10,2,2,4,8,3,3,0
3,3,3,10,3,10,7,2,5,2,3,...,0,5,7,5,1,5,2,0,4,0
4,5,1,3,3,10,4,2,4,2,5,...,0,7,0,0,2,8,1,5,4,0


In [7]:
# Creating a StandardScaler object
scaler = StandardScaler()

# Fitting the scaler to the dataframe x
scaler.fit(x)

# Transforming X to apply standardization
x_scaled = scaler.transform(x)

# Converting the scaled array back to a dataframe so as to calculate individual means and standard deviations
x_scaled = pd.DataFrame(x_scaled, columns=x.columns)

# Calculating the means and standard deviations for each column
column_means = x_scaled.mean(axis=0)  
column_stds = x_scaled.std(axis=0)   

# Printing the means and std devs for each column, rounding to 6 digits
for i, (mean, std) in enumerate(zip(column_means, column_stds)):
    print(f"Column {i+1}: Mean = {mean:.6f}, Std Dev = {std:.6f}")

# Peeking at the head of the new dataframe
x_scaled.head()

Column 1: Mean = -0.000000, Std Dev = 1.000031
Column 2: Mean = 0.000000, Std Dev = 1.000031
Column 3: Mean = -0.000000, Std Dev = 1.000031
Column 4: Mean = 0.000000, Std Dev = 1.000031
Column 5: Mean = 0.000000, Std Dev = 1.000031
Column 6: Mean = 0.000000, Std Dev = 1.000031
Column 7: Mean = 0.000000, Std Dev = 1.000031
Column 8: Mean = -0.000000, Std Dev = 1.000031
Column 9: Mean = -0.000000, Std Dev = 1.000031
Column 10: Mean = 0.000000, Std Dev = 1.000031
Column 11: Mean = 0.000000, Std Dev = 1.000031
Column 12: Mean = -0.000000, Std Dev = 1.000031
Column 13: Mean = 0.000000, Std Dev = 1.000031
Column 14: Mean = -0.000000, Std Dev = 1.000031
Column 15: Mean = -0.000000, Std Dev = 1.000031
Column 16: Mean = -0.000000, Std Dev = 1.000031
Column 17: Mean = 0.000000, Std Dev = 1.000031
Column 18: Mean = 0.000000, Std Dev = 1.000031
Column 19: Mean = -0.000000, Std Dev = 1.000031
Column 20: Mean = 0.000000, Std Dev = 1.000031
Column 21: Mean = 0.000000, Std Dev = 1.000031
Column 22: Me

Unnamed: 0,FRUITS_VEGGIES,DAILY_STRESS,PLACES_VISITED,CORE_CIRCLE,SUPPORTING_OTHERS,SOCIAL_NETWORK,ACHIEVEMENT,DONATION,BMI_RANGE,TODO_COMPLETED,...,LIVE_VISION,SLEEP_HOURS,LOST_VACATION,DAILY_SHOUTING,SUFFICIENT_INCOME,PERSONAL_AWARDS,TIME_FOR_PASSION,WEEKLY_MEDITATION,AGE,GENDER
0,0.053599,-0.578819,-0.976294,-0.178965,-1.732407,-0.477572,-0.725985,-1.466473,-0.834683,0.096804,...,-1.161311,-0.035823,0.569258,0.773095,-1.63991,-0.553915,-1.218845,-0.40897,0.421337,-0.787468
1,-0.639548,0.152304,-0.372383,-0.883141,0.735331,1.142362,0.362618,-0.386266,1.198059,-0.28428,...,0.386206,0.798194,-0.243364,-0.347874,0.60979,-0.877598,-0.485987,-0.077447,0.421337,-0.787468
2,-0.639548,0.152304,-0.674338,-0.531053,-0.498538,1.142362,-0.363118,-0.386266,1.198059,-1.42753,...,0.386206,0.798194,1.923628,-0.347874,0.60979,-0.553915,1.712587,-1.072015,0.421337,-0.787468
3,0.053599,0.152304,1.439352,-0.883141,1.352265,0.170402,-0.725985,1.234043,1.198059,-1.046447,...,-1.161311,-1.703858,1.111006,0.773095,-1.63991,-0.230232,-0.485987,-2.066583,1.480214,-0.787468
4,1.439895,-1.309943,-0.674338,-0.883141,1.352265,-0.801558,-0.725985,0.69394,1.198059,-0.28428,...,-1.161311,-0.035823,-0.785113,-1.095186,0.60979,0.740817,-0.852416,-0.40897,1.480214,-0.787468


Now we can split the data out into Testing and Training sets, still using the strategy from  Assignment 4.1:



In [8]:
# Splitting the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, random_state=0)

# Printing the shapes of all four arrays
print("Shape of x_train:", x_train.shape)
print("Shape of x_test:", x_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of x_train: (11978, 22)
Shape of x_test: (3993, 22)
Shape of y_train: (11978,)
Shape of y_test: (3993,)


We can then standardize them using the Scaler method from Assignment 4.1:


In [9]:
# Fitting the scaler to the dataframes x_train and x_test
scaler.fit(x_train, x_test)

# Transforming x_train and x_test to apply standardization
x_train_scale = scaler.transform(x_train)
x_test_scale = scaler.transform(x_test)

# Converting the scaled arrays back into dataframes so as to calculate individual means and standard deviations
x_train_scale = pd.DataFrame(x_train_scale, columns=x_train.columns)
x_test_scale = pd.DataFrame(x_test_scale, columns=x_test.columns)

# Computing the means and the standard deviations for each column
x_train_column_means = x_train_scale.mean(axis=0)  
x_train_column_stds = x_train_scale.std(axis=0)   
x_test_column_means = x_test_scale.mean(axis=0)  
x_test_column_stds = x_test_scale.std(axis=0) 

# Printing the means and std devs for each column, rounding to 6 digits
for i, (mean, std) in enumerate(zip(column_means, column_stds)):
    print(f"Column {i+1}: Mean = {mean:.6f}, Std Dev = {std:.6f}")

Column 1: Mean = -0.000000, Std Dev = 1.000031
Column 2: Mean = 0.000000, Std Dev = 1.000031
Column 3: Mean = -0.000000, Std Dev = 1.000031
Column 4: Mean = 0.000000, Std Dev = 1.000031
Column 5: Mean = 0.000000, Std Dev = 1.000031
Column 6: Mean = 0.000000, Std Dev = 1.000031
Column 7: Mean = 0.000000, Std Dev = 1.000031
Column 8: Mean = -0.000000, Std Dev = 1.000031
Column 9: Mean = -0.000000, Std Dev = 1.000031
Column 10: Mean = 0.000000, Std Dev = 1.000031
Column 11: Mean = 0.000000, Std Dev = 1.000031
Column 12: Mean = -0.000000, Std Dev = 1.000031
Column 13: Mean = 0.000000, Std Dev = 1.000031
Column 14: Mean = -0.000000, Std Dev = 1.000031
Column 15: Mean = -0.000000, Std Dev = 1.000031
Column 16: Mean = -0.000000, Std Dev = 1.000031
Column 17: Mean = 0.000000, Std Dev = 1.000031
Column 18: Mean = 0.000000, Std Dev = 1.000031
Column 19: Mean = -0.000000, Std Dev = 1.000031
Column 20: Mean = 0.000000, Std Dev = 1.000031
Column 21: Mean = 0.000000, Std Dev = 1.000031
Column 22: Me

We have now verified that the means have all been scaled to nearly 0, and the standard deviations have all been scaled to nearly 1.



At this point, we must diverge slightly from the learnings of Assignment 4.1, since our dataset is a regression problem rather than a classification problem.  We can still use the same principles as shown in 4.1, but for this we will first attempt to use a Support Vector Regression rather than a classifier:

In [10]:

# We already have these libraries imported, but since we are using them here for the first itme we'll call them out explicitly again, just for fun: 
from sklearn import svm
from sklearn.metrics import mean_squared_error

# Used Chat Gpt 4.0 mini on 3/18/25 to identify the SMV regressor:
# Creating an SVM regressor and assigning it to clf1
regressor1 = svm.SVR()

# Fitting the SVM regressor with the training data
regressor1.fit(x_train_scale, y_train)

# Making predictions on the test data
predictions1 = regressor1.predict(x_test_scale)

# Testing predictions on the test data using mean squared error
mse = mean_squared_error(y_test, predictions1)

# Printing out the results of our prediction
print(mse)

71.81477735555181


**BELOW, ATTEMPTING TO TRAIN A MODEL ON THE DATA USING THE YOUTUBE VIDEO LINKED IN THE MODULE 5 REQUIRED READINGS.**


In [15]:
#Importing the necessary Keras libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import categorical_crossentropy

In [17]:
# Creating a sequential model with 22 dimensions (columns) in input_shape, two hidden layers with 16 and 32 neurons respectively, and an output layer with 1 units

model = Sequential([
    Dense(16, input_shape=(22, ), activation='relu'),
    Dense(32, activation='relu'),
    Dense(1)
])

In [18]:
# Compiling the model, using Adam as the optimizer with a low learning rate (lr) of.0001, a loss function of mean_squared_error since its a regression, and a printed out result of mean_absolute error:

model.compile(Adam(learning_rate=.0001), loss='mean_squared_error', metrics=['mean_absolute_error'])

In [19]:
# Training the model by calling fit. "Batch_size" is how many pieces of data we want sent to the model at once, epochs=20 is how many times the data should run through the model. Shuffle=True shuffles the data each time it runs through an epoch, verbose=2 determines how much output we want to see.

model.fit(x_train_scale, y_train, batch_size=10, epochs=20, shuffle=True, verbose=2)

Epoch 1/20


1198/1198 - 5s - 4ms/step - loss: 442387.4062 - mean_absolute_error: 663.5671


Epoch 2/20


1198/1198 - 3s - 2ms/step - loss: 417619.7500 - mean_absolute_error: 644.3141


Epoch 3/20


1198/1198 - 5s - 4ms/step - loss: 354049.3125 - mean_absolute_error: 591.2015


Epoch 4/20


1198/1198 - 3s - 2ms/step - loss: 260982.0000 - mean_absolute_error: 500.4356


Epoch 5/20


1198/1198 - 5s - 5ms/step - loss: 168162.3750 - mean_absolute_error: 386.9439


Epoch 6/20


1198/1198 - 5s - 4ms/step - loss: 102981.7422 - mean_absolute_error: 287.9888


Epoch 7/20


1198/1198 - 5s - 4ms/step - loss: 69840.7578 - mean_absolute_error: 230.1426


Epoch 8/20


1198/1198 - 3s - 2ms/step - loss: 54469.7344 - mean_absolute_error: 199.9733


Epoch 9/20


1198/1198 - 5s - 4ms/step - loss: 43782.7344 - mean_absolute_error: 177.3384


Epoch 10/20


1198/1198 - 3s - 2ms/step - loss: 34477.8672 - mean_absolute_error: 155.8815


Epoch 11/20


1198/1198 - 5s - 4ms/step - loss: 26516.5117 - mean_absolute_error: 135.1990


Epoch 12/20


1198/1198 - 3s - 2ms/step - loss: 20004.0508 - mean_absolute_error: 116.0338


Epoch 13/20


1198/1198 - 3s - 2ms/step - loss: 14913.3730 - mean_absolute_error: 99.1776


Epoch 14/20


1198/1198 - 5s - 4ms/step - loss: 11080.7246 - mean_absolute_error: 84.7409


Epoch 15/20


1198/1198 - 3s - 2ms/step - loss: 8300.2324 - mean_absolute_error: 72.8080


Epoch 16/20


1198/1198 - 5s - 4ms/step - loss: 6329.8848 - mean_absolute_error: 63.4553


Epoch 17/20


1198/1198 - 3s - 2ms/step - loss: 4944.5898 - mean_absolute_error: 55.9410


Epoch 18/20


1198/1198 - 3s - 2ms/step - loss: 3967.5613 - mean_absolute_error: 50.0719


Epoch 19/20


1198/1198 - 3s - 2ms/step - loss: 3269.2473 - mean_absolute_error: 45.4482


Epoch 20/20


1198/1198 - 5s - 4ms/step - loss: 2749.2363 - mean_absolute_error: 41.6534


<keras.src.callbacks.history.History at 0x7f8f1d1aaf50>