<a href="https://colab.research.google.com/github/WajdAlsuhaymi/IT326-DataMining-Group2/blob/main/Reports/Phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 1 – Data Selection

## Project Goal
The main goal of this project is to explore the relationship between social media engagement and emotional well-being. Through data analysis, the project aims to uncover how factors such as time spent, purpose of use, and users’ demographic characteristics influence psychological outcomes. The project seeks to raise awareness about the psychological effects of social media.

## Dataset Source
Kaggle: [Social Media Usage and Emotional Well-being](https://www.kaggle.com/datasets/emirhanai/social-media-usage-and-emotional-well-being)


In [1]:
import pandas as pd


url = "https://raw.githubusercontent.com/WajdAlSuhaymi/IT326-DataMining-Group2/main/Dataset/Raw_dataset.csv"

# قراءة الملف
df = pd.read_csv(url)

# عرض أول 5 صفوف للتأكد
df.head()


Unnamed: 0,User_ID,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,,,,,,,,,,
1,1.0,25.0,Female,Instagram,120.0,3.0,45.0,10.0,12.0,Happiness
2,,,,,,,,,,
3,2.0,30.0,Male,Twitter,90.0,5.0,20.0,25.0,30.0,Anger
4,,,,,,,,,,


In [2]:
print("Instances (rows), Features (cols):", df.shape)

summary = pd.DataFrame({
    "dtype": df.dtypes.astype(str),
    "n_unique": df.nunique(),
    "n_missing": df.isna().sum()
}).sort_index()

display(summary)  # أنواع السمات + عدد القيم المميزة + المفقودات
print("\nTop columns with missing values:")
display(summary.sort_values("n_missing", ascending=False).head(10))


Instances (rows), Features (cols): (2004, 10)


Unnamed: 0,dtype,n_unique,n_missing
Age,object,19,1003
Comments_Received_Per_Day,float64,30,1004
Daily_Usage_Time (minutes),float64,30,1004
Dominant_Emotion,object,6,1004
Gender,object,18,1004
Likes_Received_Per_Day,float64,49,1004
Messages_Sent_Per_Day,float64,29,1004
Platform,object,7,1004
Posts_Per_Day,float64,8,1004
User_ID,object,1001,1003



Top columns with missing values:


Unnamed: 0,dtype,n_unique,n_missing
Comments_Received_Per_Day,float64,30,1004
Daily_Usage_Time (minutes),float64,30,1004
Messages_Sent_Per_Day,float64,29,1004
Dominant_Emotion,object,6,1004
Gender,object,18,1004
Likes_Received_Per_Day,float64,49,1004
Posts_Per_Day,float64,8,1004
Platform,object,7,1004
Age,object,19,1003
User_ID,object,1001,1003


In [3]:
possible_labels = ["Dominant_emotion","label","target","class","diagnosis","outcome"]
print("Columns:", list(df.columns)[:20], "...")

label_col = df.columns[-1]  #'Dominant_Emotion'

if label_col in df.columns:
    vc = df[label_col].value_counts(dropna=False).to_frame("count")
    print(f"\nLabel column: {label_col}")
    display(vc)
    print("Number of classes:", vc.shape[0])
else:
    raise ValueError(f"Label column '{label_col}' not found. Choose one of: {possible_labels}")


Columns: ['User_ID', 'Age', 'Gender', 'Platform', 'Daily_Usage_Time (minutes)', 'Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day', 'Messages_Sent_Per_Day', 'Dominant_Emotion'] ...

Label column: Dominant_Emotion


Unnamed: 0_level_0,count
Dominant_Emotion,Unnamed: 1_level_1
,1004
Happiness,200
Neutral,200
Anxiety,170
Sadness,160
Boredom,140
Anger,130


Number of classes: 7


In [4]:
print("Sample rows:")
display(df.sample(5, random_state=42))

#وصف عددي سريع للاعمده الرقميه
num_cols = df.select_dtypes(include=['number']).columns
if len(num_cols):
    print("\nNumeric columns describe():")
    display(df[num_cols].describe().T)
else:
    print("\nNo numeric columns detected.")


Sample rows:


Unnamed: 0,User_ID,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
585,293.0,24.0,Male,Telegram,75.0,3.0,37.0,16.0,22.0,Neutral
1284,,,,,,,,,,
1851,924.0,33.0,Non-binary,Instagram,190.0,8.0,105.0,36.0,50.0,Happiness
495,248.0,26.0,Non-binary,Facebook,75.0,2.0,30.0,12.0,18.0,Anxiety
1901,949.0,32.0,Male,Instagram,130.0,5.0,80.0,28.0,31.0,Neutral



Numeric columns describe():


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Daily_Usage_Time (minutes),1000.0,95.95,38.850442,40.0,65.0,85.0,120.0,200.0
Posts_Per_Day,1000.0,3.321,1.914582,1.0,2.0,3.0,4.0,8.0
Likes_Received_Per_Day,1000.0,39.898,26.393867,5.0,20.0,33.0,55.0,110.0
Comments_Received_Per_Day,1000.0,15.611,8.819493,2.0,8.0,14.0,22.0,40.0
Messages_Sent_Per_Day,1000.0,22.56,8.516274,8.0,17.75,22.0,28.0,50.0


In [5]:
has_size = (df.shape[0] >= 500) and (df.shape[1] >= 10)
has_label = ('Dominant_emotion' in df.columns)

print(">=500 rows and >=10 columns ?", has_size)
print("Has label column ?", has_label)


>=500 rows and >=10 columns ? True
Has label column ? False


In [6]:
import pandas as pd
import numpy as np


# url = "https://raw.githubusercontent.com/WajdAlSuhaymi/IT326-DataMining-Group2/main/Dataset/Raw_dataset.csv"
# df = pd.read_csv(url)

pd.set_option('display.max_columns', None)
print("Loaded shape:", df.shape)
df.head(3)


Loaded shape: (2004, 10)


Unnamed: 0,User_ID,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,,,,,,,,,,
1,1.0,25.0,Female,Instagram,120.0,3.0,45.0,10.0,12.0,Happiness
2,,,,,,,,,,
