# Multimodal Depression Detection Dataset - Exploration

This notebook explores the structure and characteristics of the multimodal depression detection dataset.

## Dataset Overview

The dataset contains **9 modalities** of data from multiple users:
1. **app_usage** - Mobile/app usage patterns
2. **calendar** - Calendar events and schedule data
3. **call_log** - Phone call records
4. **SMS** - Text messaging data
5. **sensing** - Sensor data (activity, audio, Bluetooth, GPS, WiFi, etc.)
6. **EMA** - Ecological Momentary Assessment (self-reported data)
7. **dinning** - Dining/meal information
8. **education** - Education-related data (classes, grades, deadlines, piazza)
9. **survey** - Survey responses (PHQ-9, Big Five, stress, loneliness, etc.)

**Depression Ground Truth**: PHQ-9 (Patient Health Questionnaire-9) scores in survey/PHQ-9.csv

In [1]:
import pandas as pd
import numpy as np
import os
import json
from pathlib import Path

# Define dataset paths
dataset_path = Path(r'c:\Users\arnol\Documents\Dev\multimodal-depression-detection\data\raw\dataset')

# Check dataset structure
print("=" * 80)
print("DATASET DIRECTORY STRUCTURE")
print("=" * 80)

for modality_dir in sorted(dataset_path.iterdir()):
    if modality_dir.is_dir():
        file_count = len(list(modality_dir.glob('*')))
        print(f"\n{modality_dir.name}/")
        print(f"  Files: {file_count}")
        # Show first few files
        for i, file in enumerate(sorted(modality_dir.glob('*'))[:3]):
            print(f"    - {file.name}")
        if file_count > 3:
            print(f"    ... and {file_count - 3} more files")

DATASET DIRECTORY STRUCTURE

app_usage/
  Files: 49
    - running_app_u00.csv
    - running_app_u01.csv
    - running_app_u02.csv
    ... and 46 more files

calendar/
  Files: 28
    - calendar_u00.csv
    - calendar_u01.csv
    - calendar_u02.csv
    ... and 25 more files

call_log/
  Files: 49
    - call_log_u00.csv
    - call_log_u01.csv
    - call_log_u02.csv
    ... and 46 more files

dinning/
  Files: 31
    - u01.txt
    - u02.txt
    - u04.txt
    ... and 28 more files

education/
  Files: 5
    - class.csv
    - class_info.json
    - deadlines.csv
    ... and 2 more files

EMA/
  Files: 2
    - EMA_definition.json
    - response

sensing/
  Files: 10
    - activity
    - audio
    - bluetooth
    ... and 7 more files

sms/
  Files: 49
    - sms_u00.csv
    - sms_u01.csv
    - sms_u02.csv
    ... and 46 more files

survey/
  Files: 8
    - BigFive.csv
    - FlourishingScale.csv
    - LonelinessScale.csv
    ... and 5 more files


## 1. Depression Ground Truth (PHQ-9)

In [2]:
# Load PHQ-9 survey (depression ground truth)
phq9_path = dataset_path / 'survey' / 'PHQ-9.csv'
phq9_df = pd.read_csv(phq9_path)

print("PHQ-9 Dataset (Depression Ground Truth)")
print(f"Shape: {phq9_df.shape}")
print(f"\nColumn Names:\n{phq9_df.columns.tolist()}")
print(f"\nFirst few rows:")
print(phq9_df.head())
print(f"\nData Types:")
print(phq9_df.dtypes)
print(f"\nUnique 'type' values (pre/post assessment):")
print(phq9_df['type'].value_counts())
print(f"\nUsers in PHQ-9: {phq9_df['uid'].nunique()}")
print(f"\nUser IDs: {sorted(phq9_df['uid'].unique())}")

PHQ-9 Dataset (Depression Ground Truth)
Shape: (84, 12)

Column Names:
['uid', 'type', 'Little interest or pleasure in doing things', 'Feeling down, depressed, hopeless.', 'Trouble falling or staying asleep, or sleeping too much.', 'Feeling tired or having little energy', 'Poor appetite or overeating', 'Feeling bad about yourself or that you are a failure or have let yourself or your family down', 'Trouble concentrating on things, such as reading the newspaper or watching television', 'Moving or speaking so slowly that other people could have noticed. Or the opposite being so figety or restless that you have been moving around a lot more than usual', 'Thoughts that you would be better off dead, or of hurting yourself', 'Response']

First few rows:
   uid type Little interest or pleasure in doing things  \
0  u00  pre                                  Not at all   
1  u01  pre                                Several days   
2  u02  pre                     More than half the days   
3  u03

## 2. Modality: App Usage

In [3]:
# Analyze app_usage modality
app_usage_files = list((dataset_path / 'app_usage').glob('*.csv'))
print(f"App Usage Files: {len(app_usage_files)} files")

# Load a sample file
sample_app = pd.read_csv(app_usage_files[0])
print(f"\nSample app_usage file: {app_usage_files[0].name}")
print(f"Shape: {sample_app.shape}")
print(f"Columns: {sample_app.columns.tolist()}")
print(f"\nFirst few rows:")
print(sample_app.head())
print(f"\nData info:")
print(f"  - Unique devices: {sample_app['device'].nunique()}")
print(f"  - Timestamp range: {sample_app['timestamp'].min()} to {sample_app['timestamp'].max()}")
print(f"  - Number of apps tracked: {sample_app['RUNNING_TASKS_topActivity_mPackage'].nunique()}")

App Usage Files: 49 files

Sample app_usage file: running_app_u00.csv
Shape: (59879, 10)
Columns: ['id', 'device', 'timestamp', 'RUNNING_TASKS_baseActivity_mClass', 'RUNNING_TASKS_baseActivity_mPackage', 'RUNNING_TASKS_id', 'RUNNING_TASKS_numActivities', 'RUNNING_TASKS_numRunning', 'RUNNING_TASKS_topActivity_mClass', 'RUNNING_TASKS_topActivity_mPackage']

First few rows:
                                        id  \
0  a7d09e48-5bbd-4672-9f55-ea34bf4862ee-38   
1  a7d09e48-5bbd-4672-9f55-ea34bf4862ee-38   
2  a7d09e48-5bbd-4672-9f55-ea34bf4862ee-38   
3  a7d09e48-5bbd-4672-9f55-ea34bf4862ee-38   
4  a7d09e48-5bbd-4672-9f55-ea34bf4862ee-38   

                                 device   timestamp  \
0  1977b545-a88f-4903-a7ae-2c434de4be49  1364100683   
1  1977b545-a88f-4903-a7ae-2c434de4be49  1364100683   
2  1977b545-a88f-4903-a7ae-2c434de4be49  1364100683   
3  1977b545-a88f-4903-a7ae-2c434de4be49  1364100683   
4  1977b545-a88f-4903-a7ae-2c434de4be49  1364100683   

                  

## 3. Modality: Call Log

In [4]:
# Analyze call_log modality
call_log_files = list((dataset_path / 'call_log').glob('*.csv'))
print(f"Call Log Files: {len(call_log_files)} files")

# Load a sample file
sample_call = pd.read_csv(call_log_files[0])
print(f"\nSample call_log file: {call_log_files[0].name}")
print(f"Shape: {sample_call.shape}")
print(f"Columns: {sample_call.columns.tolist()}")
print(f"\nFirst few rows:")
print(sample_call.head())
print(f"\nData info:")
if sample_call.shape[0] > 0:
    print(f"  - Call types: {sample_call['type'].unique() if 'type' in sample_call.columns else 'N/A'}")

Call Log Files: 49 files

Sample call_log file: call_log_u00.csv
Shape: (2203, 11)
Columns: ['id', 'device', 'timestamp', 'CALLS__id', 'CALLS_date', 'CALLS_duration', 'CALLS_name', 'CALLS_number', 'CALLS_numberlabel', 'CALLS_numbertype', 'CALLS_type']

First few rows:
                                        id  \
0  8b67cbb5-cfa2-465b-b5af-31ee2c6606a6-40   
1  9c0354e8-4f4f-451a-9358-efe17df0d26c-22   
2  9c0354e8-4f4f-451a-9358-efe17df0d26c-22   
3  9c0354e8-4f4f-451a-9358-efe17df0d26c-22   
4  9c0354e8-4f4f-451a-9358-efe17df0d26c-22   

                                 device   timestamp  CALLS__id    CALLS_date  \
0  1977b545-a88f-4903-a7ae-2c434de4be49  1364099483        NaN           NaN   
1  1977b545-a88f-4903-a7ae-2c434de4be49  1364077455      213.0  1.364077e+12   
2  1977b545-a88f-4903-a7ae-2c434de4be49  1364077455      212.0  1.364077e+12   
3  1977b545-a88f-4903-a7ae-2c434de4be49  1364077455      211.0  1.364075e+12   
4  1977b545-a88f-4903-a7ae-2c434de4be49  1364077455   

## 4. Modality: SMS

In [5]:
# Analyze SMS modality
sms_files = list((dataset_path / 'sms').glob('*.csv'))
print(f"SMS Files: {len(sms_files)} files")

# Load a sample file
sample_sms = pd.read_csv(sms_files[0])
print(f"\nSample SMS file: {sms_files[0].name}")
print(f"Shape: {sample_sms.shape}")
print(f"Columns: {sample_sms.columns.tolist()}")
if sample_sms.shape[0] > 0:
    print(f"\nFirst few rows:")
    print(sample_sms.head())
else:
    print("File is empty")

SMS Files: 49 files

Sample SMS file: sms_u00.csv
Shape: (1864, 16)
Columns: ['id', 'device', 'timestamp', 'MESSAGES_address', 'MESSAGES_body', 'MESSAGES_date', 'MESSAGES_locked', 'MESSAGES_person', 'MESSAGES_protocol', 'MESSAGES_read', 'MESSAGES_reply_path_present', 'MESSAGES_service_center', 'MESSAGES_status', 'MESSAGES_subject', 'MESSAGES_thread_id', 'MESSAGES_type']

First few rows:
                                        id  \
0  8b67cbb5-cfa2-465b-b5af-31ee2c6606a6-37   
1  9c0354e8-4f4f-451a-9358-efe17df0d26c-20   
2  9c0354e8-4f4f-451a-9358-efe17df0d26c-20   
3  9c0354e8-4f4f-451a-9358-efe17df0d26c-20   
4  9c0354e8-4f4f-451a-9358-efe17df0d26c-20   

                                 device   timestamp  \
0  1977b545-a88f-4903-a7ae-2c434de4be49  1364099482   
1  1977b545-a88f-4903-a7ae-2c434de4be49  1363969836   
2  1977b545-a88f-4903-a7ae-2c434de4be49  1363969836   
3  1977b545-a88f-4903-a7ae-2c434de4be49  1363969836   
4  1977b545-a88f-4903-a7ae-2c434de4be49  1363969836   

  

## 5. Modality: Sensing

In [6]:
# Analyze Sensing modality
sensing_dir = dataset_path / 'sensing'
sensing_subdirs = [d for d in sensing_dir.iterdir() if d.is_dir()]
print(f"Sensing Data Subdirectories: {len(sensing_subdirs)}")
for subdir in sorted(sensing_subdirs):
    file_count = len(list(subdir.glob('*')))
    print(f"  - {subdir.name}: {file_count} files")

# Sample sensing data
activity_files = list((sensing_dir / 'activity').glob('*.csv'))
if activity_files:
    sample_activity = pd.read_csv(activity_files[0])
    print(f"\nSample activity file: {activity_files[0].name}")
    print(f"Shape: {sample_activity.shape}")
    print(f"Columns: {sample_activity.columns.tolist()}")
    if sample_activity.shape[0] > 0:
        print(f"\nFirst few rows:")
        print(sample_activity.head())

Sensing Data Subdirectories: 10
  - activity: 49 files
  - audio: 49 files
  - bluetooth: 49 files
  - conversation: 49 files
  - dark: 49 files
  - gps: 49 files
  - phonecharge: 49 files
  - phonelock: 49 files
  - wifi: 49 files
  - wifi_location: 49 files

Sample activity file: activity_u00.csv
Shape: (461991, 2)
Columns: ['timestamp', ' activity inference']

First few rows:
    timestamp   activity inference
0  1364356801                    0
1  1364356804                    0
2  1364356807                    0
3  1364356809                    0
4  1364356992                    0


## 6. Modality: EMA (Ecological Momentary Assessment)

In [7]:
# Analyze EMA modality
ema_dir = dataset_path / 'EMA'

# Load EMA definition
with open(ema_dir / 'EMA_definition.json', 'r') as f:
    ema_def = json.load(f)
    
print(f"EMA Definition: {len(ema_def)} question categories")
for item in ema_def[:3]:  # Show first 3 categories
    print(f"  - {item['name']}: {len(item['questions'])} questions")

# List EMA response files
ema_response_dir = ema_dir / 'response'
ema_response_files = list(ema_response_dir.glob('*.csv'))
print(f"\nEMA Response Files: {len(ema_response_files)} files")

if ema_response_files:
    sample_ema = pd.read_csv(ema_response_files[0])
    print(f"\nSample EMA response file: {ema_response_files[0].name}")
    print(f"Shape: {sample_ema.shape}")
    print(f"Columns: {sample_ema.columns.tolist()[:10]}...")  # Show first 10 columns

EMA Definition: 25 question categories
  - Social: 2 questions
  - Class: 5 questions
  - Comment: 2 questions

EMA Response Files: 0 files


## 7. Data Coverage Summary

In [8]:
# Create a summary of data coverage across modalities
modality_coverage = {}

# App usage
app_files = list((dataset_path / 'app_usage').glob('*.csv'))
users_app = [f.stem.replace('running_app_', '') for f in app_files]
modality_coverage['app_usage'] = len(users_app)

# Call log
call_files = list((dataset_path / 'call_log').glob('*.csv'))
users_call = [f.stem.replace('call_log_', '') for f in call_files]
modality_coverage['call_log'] = len(users_call)

# SMS
sms_files = list((dataset_path / 'sms').glob('*.csv'))
users_sms = [f.stem.replace('sms_', '') for f in sms_files]
modality_coverage['sms'] = len(users_sms)

# Calendar
cal_files = list((dataset_path / 'calendar').glob('*.csv'))
users_cal = [f.stem.replace('calendar_', '') for f in cal_files]
modality_coverage['calendar'] = len(users_cal)

# Survey
phq9_users = set(phq9_df['uid'].unique())
modality_coverage['survey_phq9'] = len(phq9_users)

# Sensing
activity_files = list((dataset_path / 'sensing' / 'activity').glob('*.csv'))
users_sensing = [f.stem.replace('activity_', '') for f in activity_files]
modality_coverage['sensing'] = len(users_sensing)

# Dinning
dinning_files = list((dataset_path / 'dinning').glob('*.txt'))
users_dinning = [f.stem for f in dinning_files]
modality_coverage['dinning'] = len(users_dinning)

print("\n" + "=" * 60)
print("DATA COVERAGE BY MODALITY")
print("=" * 60)
for modality, count in sorted(modality_coverage.items(), key=lambda x: -x[1]):
    print(f"{modality:20} : {count:3} users")

print("\n" + "=" * 60)
print("TOTAL UNIQUE USERS: {}\n".format(len(set(users_app) & set(users_call) & set(users_sms))))


DATA COVERAGE BY MODALITY
app_usage            :  49 users
call_log             :  49 users
sms                  :  49 users
sensing              :  49 users
survey_phq9          :  46 users
dinning              :  31 users
calendar             :  28 users

TOTAL UNIQUE USERS: 49



## **Key Findings:**

**Dataset Composition:**

- **49 total users** with diverse data coverage
- **9 modalities** total with the following coverage:
  - **Full coverage (49 users):** App usage, Call logs, SMS, Sensing data
  - **Partial coverage:** Survey (46 users), Calendar (28 users), Dinning (31 users)
  - **Structured data:** Education (shared class data), EMA (25 question categories)

**Depression Ground Truth:**

- **PHQ-9 Survey** provides depression scores
- Pre and post assessment data available
- Measures 9 depression symptoms (interest, mood, sleep, energy, appetite, self-worth, concentration, psychomotor symptoms, suicidal ideation)

**Modality Details:**

1. **App Usage** - Tracks running Android apps with timestamps
2. **Call Log** - Phone call records
3. **SMS** - Text messaging data
4. **Sensing** - 10 types: activity, audio, Bluetooth, conversation, darkness, GPS, phone charge, phone lock, WiFi, WiFi location
5. **EMA** - Self-reported data with 25 categories (Social, Class, Comment, etc.)
6. **Calendar** - Calendar events (28 users)
7. **Dinning** - Meal/dining data (31 users)
8. **Education** - Classes, grades, deadlines, Piazza posts (shared dataset)
9. **Survey** - Multiple validated scales: PHQ-9, Big Five, Flourishing, Loneliness, Stress, PSQI, PANAS, VR-12