In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [56]:
data = pd.read_csv('screen_time.csv')
data.head()

Unnamed: 0,Age,Gender,Screen Time Type,Day Type,Average Screen Time (hours),Sample Size
0,5,Male,Educational,Weekday,0.44,500
1,5,Male,Recreational,Weekday,1.11,500
2,5,Male,Total,Weekday,1.55,500
3,5,Male,Educational,Weekend,0.5,500
4,5,Male,Recreational,Weekend,1.44,500


A couple of notes:
- There are a lot of repeated entries. This is because Educational and Recreation sum together to be Total. If we're just trying to predict Average Screen time, we should just take the total columns.
- Gender and Day Type are non-numeric but can easily be mapped to 0,1,2 etc. This is called One-Hot Encoding

In [57]:
# One-hot encode the categorical columns
data = data[data['Screen Time Type'] == 'Total']
data.reset_index(drop=True, inplace=True)

data_encoded = pd.get_dummies(data, columns=['Gender', 'Day Type'], drop_first=True)
print(data_encoded.columns)

data_encoded.head()

Index(['Age', 'Screen Time Type', 'Average Screen Time (hours)', 'Sample Size',
       'Gender_Male', 'Gender_Other/Prefer not to say', 'Day Type_Weekend'],
      dtype='object')


Unnamed: 0,Age,Screen Time Type,Average Screen Time (hours),Sample Size,Gender_Male,Gender_Other/Prefer not to say,Day Type_Weekend
0,5,Total,1.55,500,True,False,False
1,5,Total,1.93,500,True,False,True
2,5,Total,1.45,500,False,False,False
3,5,Total,1.9,500,False,False,True
4,5,Total,1.5,500,False,True,False


In [58]:
# the columns like "Sample Size and Screen Time Type" are no longer useful. We can just drop them
data_encoded.drop(columns=['Sample Size', 'Screen Time Type'], inplace=True)
data_encoded.head()

Unnamed: 0,Age,Average Screen Time (hours),Gender_Male,Gender_Other/Prefer not to say,Day Type_Weekend
0,5,1.55,True,False,False
1,5,1.93,True,False,True
2,5,1.45,False,False,False
3,5,1.9,False,False,True
4,5,1.5,False,True,False


In [59]:
X = data_encoded.drop(columns=['Average Screen Time (hours)'])
y = data_encoded['Average Screen Time (hours)']

# train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X:")
print(X.head())
print("\nY:")
print(y.head())

X:
   Age  Gender_Male  Gender_Other/Prefer not to say  Day Type_Weekend
0    5         True                           False             False
1    5         True                           False              True
2    5        False                           False             False
3    5        False                           False              True
4    5        False                            True             False

Y:
0    1.55
1    1.93
2    1.45
3    1.90
4    1.50
Name: Average Screen Time (hours), dtype: float64


In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# performance
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")

Mean Squared Error: 0.053428322988865296
R^2 Score: 0.9883074754058151


In [61]:
# I just learned about feature scaling, lets give it a try and see if MSE improves for the LR model.
# it seems to be a preprocessing technique
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

# performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")

Mean Squared Error: 0.05342832298886539
R^2 Score: 0.9883074754058151


This is a pretty good result. Having R^2 over 0.8 and such a low MSE is a good sign of our model's validity. It seems that the feature scaling didn't really have a noticable effect. I'll be sure to give it a try in my future project. We can try a few different models just for the sake of experiment.

In [66]:
# another model I heard of is the random forest.
# It likely will preform worse since the problem itself is regression, not classification

from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# performance
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print(f"Mean Squared Error (Random Forest): {mse_rf}")
print(f"R^2 Score (Random Forest): {r2_rf}")

Mean Squared Error (Random Forest): 0.06746006571428617
R^2 Score (Random Forest): 0.9852366978156135


Seems like it performs slightly worse, especially when comparing MAE with the Linear Regression model. Its standalone performance is pretty good though.

I wanted to try the Linear Regression because of the strong linear correlation between age and total screen time. I'm curious to see if the age parameter is significantly more important than the others.

In [None]:
# feature importance
coefficients = model.coef_
feature_importance = pd.DataFrame({'Feature': X.columns, 'Coefficient': coefficients})
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)

print(feature_importance)

                          Feature  Coefficient
0                             Age     1.668452
3                Day Type_Weekend     0.501358
1                     Gender_Male     0.104132
2  Gender_Other/Prefer not to say     0.044301


As suspected, age is by far the most significant parameter, its coefficient is over triple of the next most relevant paramter, Day Type. Day Type being the second most important is pretty plausible too, since we saw in the EDA that weekends have very noticably increased screen time.

In [None]:
# just for fun we can try XGBoost
# %pip install xgboost <- I didn't have it installed lol
from xgboost import XGBRegressor

xgb = XGBRegressor(n_estimators=100, random_state=42)
xgb.fit(X_train_scaled, y_train)

y_pred_xgb = xgb.predict(X_test_scaled)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"Mean Squared Error (XGBoost): {mse_xgb}")
print(f"R^2 Score (XGBoost): {r2_xgb}")

Mean Squared Error (XGBoost): 0.03371242713484535
R^2 Score (XGBoost): 0.9926222018331738


This result is pretty insane. After some research it seems that XGBoost is pretty solid for both Regression and Classification tasks, and this one is regression. That being said, it's definitely more computationally intensive than linear regression.

In [None]:
# feature importance for XGBoost
xgb_feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': xgb.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(xgb_feature_importance)

                          Feature  Importance
0                             Age    0.841653
3                Day Type_Weekend    0.145493
1                     Gender_Male    0.010319
2  Gender_Other/Prefer not to say    0.002535


Pretty much the same result as we gathered from the EDA and the LR model; Age is the defining characteristic, follow by if the datapoint was recorded on a weekend or weekday.