# Social Media Habits, Productivity Loss, and Brain Rot
**Author: Clarissa Dominguez**


**Problem Statement:** Social media has been known to affect a person’s productivity, phrases like boom scrolling and brain rot have been introduced to describe endless scrolling on social media. In this project, I will analyze how social media usage affects both productivity and brain rot. Using two dataset, a brain rot questionnaire labeled yes/no and a ‘Time-Wasters on Social Media’ dataset, I will study how patterns in social media behavior, self-control, and addiction relate to reduced productivity, difficulty concentrating, and negative moods. My goal is to identify which specific usage patterns and demographic factors are most strongly associated with ‘boom scrolling’ on social media and experiencing brain rot style outcomes.


In [125]:
import kagglehub

# Download latest version
brain_rot_path = kagglehub.dataset_download("ahmedtakleefalhasani/brain-rot-dataset")

print("Path to dataset files:", brain_rot_path)


# Download latest version
social_media_path = kagglehub.dataset_download("muhammadroshaanriaz/time-wasters-on-social-media")

print("Path to dataset files:", social_media_path)

Using Colab cache for faster access to the 'brain-rot-dataset' dataset.
Path to dataset files: /kaggle/input/brain-rot-dataset
Using Colab cache for faster access to the 'time-wasters-on-social-media' dataset.
Path to dataset files: /kaggle/input/time-wasters-on-social-media


In [126]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import (
    mean_squared_error, r2_score,
    accuracy_score, classification_report, confusion_matrix
)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

brain_rot_file = f"{brain_rot_path}/Brain Rot Cases.csv"
social_media_file = f"{social_media_path}/Time-Wasters on Social Media.csv"

brain_rot_df = pd.read_csv(brain_rot_file)
social_media_df = pd.read_csv(social_media_file)
#print(social_media_df)
brain_rot_df.columns

Index(['Age', 'Gender', 'Educational Level',
       'How many hours do you spend daily on social media',
       'Do you prefer using technology for work or leisure?',
       'How many hours do you spend in front of computer screens or smartphones daily?',
       'Do you feel that you use social media excessively?',
       'Do social media platforms affect your social relationships?',
       'Have you noticed a change in your mood after using technology?',
       'How would you rate your level of focus while working or studying? On a scale from 1 to 5, where 1 = very poor and 5 = excellent.',
       'Do you find it difficult to concentrate when using technology (social media)?',
       'How often do you feel distracted while trying to work or study due to technology (social media)?',
       'How would you rate your level of concern regarding technology use, particularly social media? On a scale from 1 to 5, where 1 = very low and 5 = excellent.',
       'Do you believe that your use of 

# Preprocess and clean brain_rot data

-rename columns so they're easier to access in code

In [127]:
brain_rot = brain_rot_df.copy()  # copy so original isn't modified
brain_rot = brain_rot.rename(columns={
    'Age': 'age',
    'Gender': 'gender',
    'Educational Level': 'edu_level',
    'How many hours do you spend daily on social media': 'social_media_hours',
    'Do you prefer using technology for work or leisure?': 'tech_preference',
    'How many hours do you spend in front of computer screens or smartphones daily?': 'screen_time_hours',
    'Do you feel that you use social media excessively?': 'excessive_use',
    'Do social media platforms affect your social relationships?': 'affect_relationships',
    'Have you noticed a change in your mood after using technology?': 'mood_change',
    'How would you rate your level of focus while working or studying? On a scale from 1 to 5, where 1 = very poor and 5 = excellent.': 'focus_score',
    'Do you find it difficult to concentrate when using technology (social media)?': 'difficulty_concentrating',
    'How often do you feel distracted while trying to work or study due to technology (social media)?': 'distraction_freq',
    'How would you rate your level of concern regarding technology use, particularly social media? On a scale from 1 to 5, where 1 = very low and 5 = excellent.': 'concern_level',
    'Do you believe that your use of social media negatively affects y How would you rate your level of concern regarding technology use, particularly social media? On a scale from 1 to 5, where 1 = very low and 5 = excellent.our mental health?': 'bad_header_drop',
    'Do you believe that your use of social media negatively affects your mental health?': 'negative_mental_health',
    'What platforms do you primarily use?': 'platforms',
    '"Do you think that using certain applications can help reduce feelings of mental fatigue?"\n': 'apps_help_fatigue',
})

# Justifying Brain_rot patterns

First, I will construct a brain_rot as a binary label that flags people who match multiple risk patterns related to heavy social media use (higher daily social media hours, longer screen time) and negative mental health indicators.

**Brain rot patterns:**

• Social media use ≥3 hours per day

• Screen time ≥ 4 hours per day

• Focus level ≤ 2 out of 5

• Reports frequent distraction

• Notices mood shift as technology is used

• Thinks social media is bad for mental health

In [128]:
pattern_columns = [
    'social_media_hours',
    'screen_time_hours',
    'focus_score',
    'distraction_freq',
    'mood_change',
    'negative_mental_health'
]

br = brain_rot.dropna(subset=pattern_columns).copy()

# Create brain_rot label based on patterns
patterns_count = (
    (br['social_media_hours'] >= 3).astype(int) +
    (br['screen_time_hours'] >= 4).astype(int) +
    (br['focus_score'] <= 2).astype(int) +
    (br['distraction_freq'] >= 4).astype(int) +
    (br['mood_change'] == 1).astype(int) +
    (br['negative_mental_health'] == 1).astype(int)
)

br['brain_rot'] = (patterns_count >= 3).astype(int)

print(br['brain_rot'].value_counts())

brain_rot
0    1120
1     555
Name: count, dtype: int64


# Training Random Forest Classifier on Brain Rot dataset

After modeling “brain rot” as a binary output, I trained a Random Forest classifier on the Brain Rot survey dataset.  I then used the remaining survey responses as input features, one-hot encoded the categorical variables, and split the data into training and test sets using an 80training/20testing stratified split. The Random Forest model was trained with 200 trees using entropy as the splitting criterion.

In [129]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split


X_br = br.drop(columns=['brain_rot'])
y_br = br['brain_rot']


X_br_encode = pd.get_dummies(X_br, drop_first=True)

x_train_br, x_test_br, y_train_br, y_test_br = train_test_split(
    X_br_encode, y_br,
    test_size=0.2,
    random_state=42,
    stratify=y_br
)


random_forest_br = RandomForestClassifier(
    n_estimators=200,
    criterion='entropy',
    n_jobs=-1,
    random_state=42
)

random_forest_br.fit(x_train_br, y_train_br)
pred_br = random_forest_br.predict(x_test_br)

print("Brain Rot - Accuracy:",
      accuracy_score(y_test_br, pred_br))
print("Confusion matrix:\n",
      confusion_matrix(y_test_br, pred_br))
print("\nClassification report:\n",
      classification_report(y_test_br, pred_br, digits=3))


importances_br = pd.Series(random_forest_br.feature_importances_, index=X_br_encode.columns)
print("\nTop 5 brain-rot random forest feature importances:")
print(importances_br.sort_values(ascending=False).head(5))

Brain Rot - Accuracy: 0.9970149253731343
Confusion matrix:
 [[224   0]
 [  1 110]]

Classification report:
               precision    recall  f1-score   support

           0      0.996     1.000     0.998       224
           1      1.000     0.991     0.995       111

    accuracy                          0.997       335
   macro avg      0.998     0.995     0.997       335
weighted avg      0.997     0.997     0.997       335


Top 5 brain-rot random forest feature importances:
social_media_hours        0.358591
mood_change               0.104767
focus_score               0.074597
screen_time_hours         0.073790
negative_mental_health    0.053072
dtype: float64


**Results:**

The model achieved very strong performance with an overall accuracy of 0.997. The confusion matrix shows that out of 335 examples only one positive case was misclassified:

  
  •	True “no brain rot” (class 0): 224 correctly predicted, 0 misclassified

  
  •	True “brain rot” (class 1): 110 correctly predicted, 1 misclassified

This is reflected in the classification report: precision and recall are above 0.99 for both classes, and the F1-scores are 0.998 for class 0 and 0.995 for class 1.



The Random Forest “feature importance” measures how much each feature helps reduce classification error across all trees. Higher values mean that the feature is used more often in informative splits. The highest error was social media use which meant students with higher daily social media usage are much more likely to be classified as having brain rot.

# Regression for ProductivityLoss

For the regression part of my project, I used the Time-Wasters on Social Media dataset to see how different usage patterns relate to productivity loss. I treated ProductivityLoss as a continuous variable and trained a Random Forest Regressor using a mix of behavioral and psychological features, such as Satisfaction, Addiction Level, Self Control, total time spent, and number of sessions. I first removed rows with missing values in the target and redundant columns like ID indicators, then one-hot encoded the categorical features and split the data into an 80/20 train/test split.

In [130]:
social_media_df.columns

Index(['UserID', 'Age', 'Gender', 'Location', 'Income', 'Debt',
       'Owns Property', 'Profession', 'Demographics', 'Platform',
       'Total Time Spent', 'Number of Sessions', 'Video ID', 'Video Category',
       'Video Length', 'Engagement', 'Importance Score', 'Time Spent On Video',
       'Number of Videos Watched', 'Scroll Rate', 'Frequency',
       'ProductivityLoss', 'Satisfaction', 'Watch Reason', 'DeviceType', 'OS',
       'Watch Time', 'Self Control', 'Addiction Level', 'CurrentActivity',
       'ConnectionType'],
      dtype='object')

In [131]:
from sklearn.ensemble import RandomForestRegressor

time_wasters = social_media_df.copy()  # copy so original isn't modified

target = 'ProductivityLoss'


print("Using target column for regression:", target)

time_wasters = time_wasters.dropna(subset=[target])


cols_to_drop = ['UserID', 'Video ID']
cols_to_drop = [c for c in cols_to_drop if c in time_wasters.columns]


X_reg = time_wasters.drop(columns=[target] + cols_to_drop)
y_reg = time_wasters[target]


numeric_cols_regression = X_reg.select_dtypes(exclude='object').columns
categorical_cols_regression = X_reg.select_dtypes(include='object').columns


X_regression_categorical = pd.get_dummies(X_reg[categorical_cols_regression], drop_first=True)
X_regression_numeric = X_reg[numeric_cols_regression]


X_regression_encode = pd.concat([X_regression_numeric, X_regression_categorical], axis=1)


x_train_reg, x_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_regression_encode, y_reg, test_size=0.2, random_state=42
)


random_forest_reg = RandomForestRegressor(
    n_estimators=400,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

random_forest_reg.fit(x_train_reg, y_train_reg)


pred_reg = random_forest_reg.predict(x_test_reg)


mse = mean_squared_error(y_test_reg, pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, pred_reg)

print(f"Regression – {target}: RMSE={rmse:.3f}, R²={r2:.3f}")


importances_reg = pd.Series(random_forest_reg.feature_importances_, index=X_regression_encode.columns)
print("\nTop 5 regression feature importances:")
print(importances_reg.sort_values(ascending=False).head(5))

Using target column for regression: ProductivityLoss
Regression – ProductivityLoss: RMSE=0.057, R²=0.999

Top 5 regression feature importances:
Satisfaction        0.405977
Addiction Level     0.304121
Self Control        0.289903
Age                 0.000000
Total Time Spent    0.000000
dtype: float64


**Results:**

The model reached an RMSE of 0.057 and an R² of 0.999 on the test set, meaning it was able to almost perfectly predict the reported productivity loss scores from the survey features. This supports my original goal of using social media behavior and self-regulation measures to explain how strongly students feel their productivity is being affected.

# Classification on High Productivity Loss & High Addiction
To further predict a continuous productivity score, I framed two high-risk classification problems using the Time-Wasters on Social Media dataset. First, I defined a high productivity loss label by marking users in the top 25% of the ProductivityLoss distribution as 1 and everyone else as 0. I did the same for Addiction Level to create a high addiction label.

For each task, I removed the corresponding raw score and other obvious IDs, one-hot encoded the remaining features like demographics, usage patterns, satisfaction, self-control, time-of-day, etc., and trained a Random Forest classifier with an 80/20 train/test split.

In [132]:
tw_classification = time_wasters.copy()


loss_column = 'ProductivityLoss'
addiction_col = 'Addiction Level'
high_loss_threshold = tw_classification[loss_column].quantile(0.75)
tw_classification['high_prod_loss'] = (tw_classification[loss_column] >= high_loss_threshold).astype(int)



high_add_threshold = tw_classification[addiction_col].quantile(0.75)
tw_classification['high_addiction'] = (tw_classification[addiction_col] >= high_add_threshold).astype(int)



cols_to_exclude_loss = [
    loss_column,
    'high_prod_loss',
    'high_addiction',
    'UserID',
    'Video ID'
]

X_loss = tw_classification.drop(columns=[c for c in cols_to_exclude_loss if c in tw_classification.columns])
y_high_loss = tw_classification['high_prod_loss']


X_loss_encode = pd.get_dummies(X_loss, drop_first=True)

x_train_hl, x_test_hl, y_train_hl, y_test_hl = train_test_split(
    X_loss_encode, y_high_loss,
    test_size=0.2,
    random_state=42,
    stratify=y_high_loss
)

rf_high_loss = RandomForestClassifier(
    n_estimators=200,
    criterion='entropy',
    n_jobs=-1,
    random_state=42
)

rf_high_loss.fit(x_train_hl, y_train_hl)
pred_hl = rf_high_loss.predict(x_test_hl)

print("Classification - High Productivity Loss")
print("Accuracy:", accuracy_score(y_test_hl, pred_hl))
print("Confusion matrix:\n", confusion_matrix(y_test_hl, pred_hl))
print("\nClassification report:\n", classification_report(y_test_hl, pred_hl, digits=3))

importances_hl = pd.Series(rf_high_loss.feature_importances_, index=X_loss_encode.columns)
print("\nTop 5 features for high productivity loss:")
print(importances_hl.sort_values(ascending=False).head(5))


cols_to_exclude_add = [
    addiction_col,
    'high_addiction',
    'high_prod_loss',
    loss_column,
    'UserID',
    'Video ID'
]

X_addicition = tw_classification.drop(columns=[c for c in cols_to_exclude_add if c in tw_classification.columns])
y_high_add = tw_classification['high_addiction']

X_addicition_enc = pd.get_dummies(X_addicition, drop_first=True)

x_train_ha, x_test_ha, y_train_ha, y_test_ha = train_test_split(
    X_addicition_enc, y_high_add,
    test_size=0.2,
    random_state=42,
    stratify=y_high_add
)

rf_high_add = RandomForestClassifier(
    n_estimators=200,
    criterion='entropy',
    n_jobs=-1,
    random_state=42
)

rf_high_add.fit(x_train_ha, y_train_ha)
pred_ha = rf_high_add.predict(x_test_ha)

print("-----------------------------------------")

print("\nClassification - High Addiction ")
print("Accuracy:", accuracy_score(y_test_ha, pred_ha))
print("Confusion matrix:\n", confusion_matrix(y_test_ha, pred_ha))
print("\nClassification report:\n", classification_report(y_test_ha, pred_ha, digits=3))

importances_ha = pd.Series(rf_high_add.feature_importances_, index=X_addicition_enc.columns)
print("\nTop 5 features for high addiction:")
print(importances_ha.sort_values(ascending=False).head(5))

Classification - High Productivity Loss
Accuracy: 0.995
Confusion matrix:
 [[102   0]
 [  1  97]]

Classification report:
               precision    recall  f1-score   support

           0      0.990     1.000     0.995       102
           1      1.000     0.990     0.995        98

    accuracy                          0.995       200
   macro avg      0.995     0.995     0.995       200
weighted avg      0.995     0.995     0.995       200


Top 5 features for high productivity loss:
Self Control         0.308687
Satisfaction         0.291918
Addiction Level      0.232056
Frequency_Night      0.035381
Frequency_Morning    0.016136
dtype: float64
-----------------------------------------

Classification - High Addiction 
Accuracy: 1.0
Confusion matrix:
 [[137   0]
 [  0  63]]

Classification report:
               precision    recall  f1-score   support

           0      1.000     1.000     1.000       137
           1      1.000     1.000     1.000        63

    accuracy        

**Results:**

-For the high productivity loss classifier resulted in 0.995 accuracy.

The model performed extremely well by reaching an accuracy of 0.995 on the test set and correctly classifying 199 out of 200 users. The confusion matrix shows only one high loss user misclassified, and the precision/recall for both classes are 0.99 or higher. Feature importance values indicate that Self Control, Satisfaction, and Addiction Level dominate the model, followed by smaller contributions from usage timing features like Frequency_Night and specific Watch Time slots in the evening and late night. In conclusion, students who report low self-control, low satisfaction, and higher addiction scores, especially when they tend to use social media at night, are the ones most likely to fall into the high productivity-loss group.


-For the high addiction classifier resulted in 100% accuracy.

This model's performance is even better. It achieved 100% accuracy on the test data, with perfect precision and recall for both the high and low addiction classes. Again, the most important features are Satisfaction and Self Control, with nighttime and evening usage (Frequency_Night, late Watch Time windows) also contributing to the decision. Income and engagement show up, but with relatively small importance values. These results are consistent.

# Clustering demographic features
Lastly, I clustered users based only on demographic features like age, income, debt, gender. I ended up with three groups that looked different on paper but were surprisingly similar in terms of risk.



In [133]:
demographic_features = [
    'Age',
    'Gender',
    'Location',
    'Income',
    'Debt',
    'Owns Property',
    'Profession',
    'Demographics',
    'Platform',
    'DeviceType',
    'OS',
    'ConnectionType'
]


demographic_features = [c for c in demographic_features if c in tw_classification.columns]

tw_demographics = tw_classification[demographic_features].copy()


num_demograhic_cols = tw_demographics.select_dtypes(exclude='object').columns.tolist()
cat_demographic_cols = tw_demographics.select_dtypes(include='object').columns.tolist()


scaler = StandardScaler()
tw_demographics_num_scaled = pd.DataFrame(
    scaler.fit_transform(tw_demographics[num_demograhic_cols]),
    columns=num_demograhic_cols,
    index=tw_demographics.index
)


tw_demographics_cat_enc = pd.get_dummies(tw_demographics[cat_demographic_cols], drop_first=True)


X_demo_cluster = pd.concat([tw_demographics_num_scaled, tw_demographics_cat_enc], axis=1)


kmeans_demo = KMeans(n_clusters=3, random_state=42, n_init=10)
tw_classification['demo_cluster'] = kmeans_demo.fit_predict(X_demo_cluster)


summary_cols = ['ProductivityLoss', 'high_prod_loss', 'high_addiction']
summary_cols = [c for c in summary_cols if c in tw_classification.columns]

print("Cluster-wise averages (outcomes + key demographics):")
print(
    tw_classification.groupby('demo_cluster')[summary_cols + [
        'Age',
        'Income',
        'Debt'
    ]].mean().round(2)
)

print("\nCluster sizes:")
print(tw_classification['demo_cluster'].value_counts())

print("\nNumeric in demographic clustering:", num_demograhic_cols)
print("Categorical in demographic clustering:", cat_demographic_cols)
print("\nGender distribution by demographic cluster:")
print(tw_classification.groupby('demo_cluster')['Gender'].value_counts(normalize=True))

Cluster-wise averages (outcomes + key demographics):
              ProductivityLoss  high_prod_loss  high_addiction    Age  \
demo_cluster                                                            
0                         5.30            0.50            0.29  39.83   
1                         5.00            0.46            0.33  41.55   
2                         5.14            0.51            0.33  41.53   

                Income  Debt  
demo_cluster                  
0             76305.71   1.0  
1             34874.55   0.0  
2             75694.02   1.0  

Cluster sizes:
demo_cluster
1    401
0    325
2    274
Name: count, dtype: int64

Numeric in demographic clustering: ['Age', 'Income', 'Debt', 'Owns Property']
Categorical in demographic clustering: ['Gender', 'Location', 'Profession', 'Demographics', 'Platform', 'DeviceType', 'OS', 'ConnectionType']

Gender distribution by demographic cluster:
demo_cluster  Gender
0             Male      0.498462
              Female    

**Results:**

One cluster had higher-income users with debt, another had lower-income users with no debt, and the third looked similar to the higher-income/debt group. However, across all three clusters, the average productivity loss stayed around 5, and the share of users flagged as high productivity loss (about 46–51%) and high addiction (about 29–33%) was almost the same.

-The gender breakdown was also very similar in every cluster. These results suggest that demographic differences alone don’t really separate high-risk from low-risk users in this dataset. This lines up with my earlier findings: psychological factors like self-control, satisfaction, and addiction level, along with when people are online, are much more important for identifying who is actually at risk than demographics by themselves.

**Conclusion:**

To conclude, this project analyzes my idea of “brain rot” and productivity loss back to the actual data. Using the Brain Rot survey, I was able to build a classifier that almost perfectly separated people with and without the brain-rot pattern defined. The most important signals were not random: spending more hours on social media, longer overall screen time, feeling worse after using technology, and reporting low focus all lined up with the brain-rot label.

On the Time-Wasters dataset, the regression and classification models told a similar story. The productivity loss was predicted almost perfectly from just a few self-reported factors being satisfaction with social media, addiction level, and self-control. When I turned the problem into a high-risk classification, the models for high productivity loss and high addiction both reached very high accuracy, and again the main drivers were self-control, satisfaction, addiction level, and late night usage, not just the time spent. The clustering results reinforced this with behavior and psych based clusters clearly separated the different types of users, while demographic clusters (age, income, debt, gender) showed very similar risk levels across groups.

Overall, these results support my original idea that “brain rot” and productivity loss are less about who you are and more about how you use social media and how in control you feel. Heavy usage combined with low self-control, low satisfaction, and late night scrolling is what really shows up as risky in the models, which suggests that interventions should probably focus on habits and self-regulation rather than just telling people to look at their screen time total.

**Resources:**

https://www.kaggle.com/datasets/ahmedtakleefalhasani/brain-rot-dataset


https://www.kaggle.com/datasets/muhammadroshaanriaz/time-wasters-on-social-media/data
