# DSA210 Project: Sleep Data Analysis

This notebook presents the complete analysis pipeline for the DSA210 project, following the provided project guidelines: motivation, data loading, exploratory data analysis, statistical tests, machine learning models, findings, and future work.

## Motivation

- Sleep is essential for cognitive performance and academic success.
- Analyzing personal sleep data in relation to study habits can reveal insights for optimizing study routines.

## Data Source

The data is loaded from `2 - Sleep_Study_Data.xlsx`, which contains daily records of sleep duration, sleep quality, study duration, and study quality. Ensure this file is in the same directory as this notebook.

## Setup

Import required libraries and load the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Load data
file_path = '2 - Sleep_Study_Data.xlsx'
data = pd.read_excel(file_path)

# Rename columns for convenience
data = data.rename(columns={
    'Sleep Time - (Calculation)': 'SleepHours',
    'Sleep Quality - (1-10)': 'SleepQuality',
    'Study Duration - (Calculation)': 'StudyHours',
    'Study Quality - (1-10)': 'StudyQuality'
})

# Convert Date column to datetime
data['Date'] = pd.to_datetime(data['Date'])

## Data Preprocessing

Handle missing values by filling with the column mean.

In [None]:
# Fill missing values
num_cols = ['SleepHours', 'SleepQuality', 'StudyHours', 'StudyQuality']
filled = data.copy()
filled[num_cols] = filled[num_cols].fillna(filled[num_cols].mean())
filled.head()

## Exploratory Data Analysis

### Correlation Matrix

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(filled[num_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

### Sleep Quality vs Study Quality

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(x='SleepQuality', y='StudyQuality', data=filled)
plt.title('Sleep Quality vs Study Quality')
plt.xlabel('Sleep Quality (1-10)')
plt.ylabel('Study Quality (1-10)')
plt.tight_layout()
plt.show()

### Sleep Quality vs Study Hours (Box Plot)

In [None]:
plt.figure(figsize=(6,6))
sns.boxplot(x=filled['SleepQuality'].round().astype(int), y='StudyHours', data=filled)
plt.title('Sleep Quality vs Study Hours')
plt.xlabel('Sleep Quality (1-10)')
plt.ylabel('Study Hours')
plt.tight_layout()
plt.show()

### Distribution of Sleep and Study Duration

In [None]:
plt.figure(figsize=(6,6))
sns.histplot(filled['SleepHours'], bins=10, kde=True)
plt.title('Distribution of Sleep Hours')
plt.xlabel('Sleep Duration (Hours)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

plt.figure(figsize=(6,6))
sns.histplot(filled['StudyHours'], bins=10, kde=True)
plt.title('Distribution of Study Hours')
plt.xlabel('Study Duration (Hours)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

### Sleep and Study Over Time

In [None]:
ts = data.set_index('Date')[['SleepHours','StudyHours']].sort_index()
full_idx = pd.date_range(ts.index.min(), ts.index.max(), freq='D')
ts_full = ts.reindex(full_idx)

plt.figure(figsize=(10,5))
plt.plot(ts_full.index, ts_full['SleepHours'], marker='o', label='Sleep Hours')
plt.plot(ts_full.index, ts_full['StudyHours'], marker='s', label='Study Hours')
plt.title('Sleep and Study Hours Over Time')
plt.xlabel('Date')
plt.ylabel('Hours')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Feature Engineering: Sleep Efficiency & Study Productivity

In [None]:
# Create new features
filled['SleepEfficiency'] = filled['SleepQuality'] / filled['SleepHours']
filled['StudyProductivity'] = filled['StudyQuality'] / filled['StudyHours']

# Display the first few rows of the new features
filled[['SleepEfficiency', 'StudyProductivity']].head()

### Exploring New Features: Scatterplots

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(x='SleepEfficiency', y='StudyQuality', data=filled)
plt.title('Sleep Efficiency vs Study Quality')
plt.xlabel('Sleep Efficiency (Quality / Hours)')
plt.ylabel('Study Quality (1-10)')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(x='StudyProductivity', y='SleepQuality', data=filled)
plt.title('Study Productivity vs Sleep Quality')
plt.xlabel('Study Productivity (Quality / Hours)')
plt.ylabel('Sleep Quality (1-10)')
plt.tight_layout()
plt.show()

## Statistical Analysis

### Pearson Correlation Test between Sleep Duration and Study Duration

In [None]:
corr, p_value = pearsonr(filled['SleepHours'], filled['StudyHours'])
print(f"Correlation: {corr:.3f}")
print(f"P-value: {p_value:.3f}")

## Machine Learning Models

### Model 1: Predict Study Quality from Sleep Quality

In [None]:
X1 = filled[['SleepQuality']]
y = filled['StudyQuality']
model1 = LinearRegression()
model1.fit(X1, y)
y_pred1 = model1.predict(X1)
r2_1 = r2_score(y, y_pred1)
mae_1 = mean_absolute_error(y, y_pred1)
print(f"Model 1 - R²: {r2_1:.3f}, MAE: {mae_1:.3f}")

plt.figure(figsize=(6,6))
sns.regplot(x='SleepQuality', y='StudyQuality', data=filled, scatter_kws={'s':50}, line_kws={'color':'red'})
plt.title('Sleep Quality vs Study Quality Regression')
plt.xlabel('Sleep Quality (1-10)')
plt.ylabel('Study Quality (1-10)')
plt.tight_layout()
plt.show()

### Model 2: Predict Study Quality from Sleep Duration

In [None]:
X2 = filled[['SleepHours']]
model2 = LinearRegression()
model2.fit(X2, y)
y_pred2 = model2.predict(X2)
r2_2 = r2_score(y, y_pred2)
mae_2 = mean_absolute_error(y, y_pred2)
print(f"Model 2 - R²: {r2_2:.3f}, MAE: {mae_2:.3f}")

plt.figure(figsize=(6,6))
sns.regplot(x='SleepHours', y='StudyQuality', data=filled, scatter_kws={'s':50}, line_kws={'color':'red'})
plt.title('Sleep Duration vs Study Quality Regression')
plt.xlabel('Sleep Duration (Hours)')
plt.ylabel('Study Quality (1-10)')
plt.tight_layout()
plt.show()

### Model 3: Predict Study Quality using Sleep Efficiency

In [None]:
X3 = filled[['SleepEfficiency']]
y3 = filled['StudyQuality']

model3 = LinearRegression()
model3.fit(X3, y3)

y_pred3 = model3.predict(X3)
print(f"Model 3 - R²: {r2_score(y3, y_pred3):.3f}, MAE: {mean_absolute_error(y3, y_pred3):.3f}")

plt.figure(figsize=(6,6))
sns.regplot(x='SleepEfficiency', y='StudyQuality', data=filled, scatter_kws={'s':50}, line_kws={'color':'red'})
plt.title('Regression: Sleep Efficiency vs Study Quality')
plt.xlabel('Sleep Efficiency')
plt.ylabel('Study Quality (1-10)')
plt.tight_layout()
plt.show()

## Findings

- Sleep duration and study duration show a correlation. Replace the above values after running the cells.
- Sleep quality vs study quality reveals the strength and direction of the relationship.
- Regression models provide R² and MAE metrics for predictive power.


## Limitations and Future Work

- Data from a single participant limits generalizability.
- Additional factors (e.g., stress, nutrition) could improve models.
- Future work: collect multi-participant data, explore advanced machine learning techniques.