## MLDC Mapping

1. **Problem Definition** → Emotion detection from EEG
2. **Data Collection** → Load EEG CSVs from `dataset/features_raw.csv`
3. **Data Processing** → Missing values + scaling
4. **EDA** → Shape, samples, statistics
5. **Feature Engineering** → Raw EEG channels (baseline)
6. **Model Selection** → Linear + Logistic Regression
7. **Deployment** → Exported to web UI in this project


### Dataset Used (dataset/features_raw.csv)
- 32 EEG channels per row (time samples).
- No real labels included → we generate **synthetic** valence/arousal/dominance.


## 1) Problem Definition
We want to predict emotion dimensions from EEG: **valence**, **arousal**, **dominance**.


## 2) Imports


In [None]:
import numpy as np
import warnings
warnings.filterwarnings('ignore', category=RuntimeWarning)
np.seterr(all='ignore')
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score


## 3) Data Loading
Load the EEG data file with channel columns.


In [None]:
data = pd.read_csv("../dataset/features_raw.csv")
# Drop empty column if it exists
if 'Unnamed: 32' in data.columns:
    data = data.drop(columns=['Unnamed: 32'])
data.head()


## 4) EDA (Basic Exploration)


In [None]:
data.shape


In [None]:
data.describe().loc[["mean", "std"]].head()


## 5) Preprocessing
- Fill missing values
- Scale features for ML


In [None]:
# Clean extreme values to avoid numeric overflow
data = data.replace([np.inf, -np.inf], np.nan)
# First fill with column means, then zero-fill any remaining NaNs
try:
    data = data.fillna(data.mean(numeric_only=True))
except TypeError:
    data = data.fillna(data.mean())
data = data.fillna(0)

# Clip outliers per column (1st–99th percentile)
try:
    lower = data.quantile(0.01, numeric_only=True)
    upper = data.quantile(0.99, numeric_only=True)
except TypeError:
    lower = data.quantile(0.01)
    upper = data.quantile(0.99)

data = data.clip(lower=lower, upper=upper, axis=1)

X = data.to_numpy(dtype=float)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Ensure no NaN/inf remains, then clip to stable range
X = np.nan_to_num(X, nan=0.0, posinf=5.0, neginf=-5.0)
X = np.clip(X, -5, 5)


## 6) Feature Selection / Creation
We use **raw scaled EEG channels** as baseline features.


## 7) Create Synthetic Valence/Arousal/Dominance Labels
Because real emotion labels are missing, we create **pseudo-labels** using EEG patterns.


In [None]:
# Helper: scale any signal to 1–9 range (safe)
def scale_1_9(x):
    x = np.asarray(x, dtype=float)
    x = np.nan_to_num(x, nan=0.0, posinf=0.0, neginf=0.0)
    min_v = x.min()
    max_v = x.max()
    denom = max_v - min_v
    if denom == 0:
        return np.full_like(x, 5.0, dtype=float)
    scaled = (x - min_v) / denom * 8 + 1
    return np.clip(scaled, 1, 9)

# Valence: frontal asymmetry (right - left)
valence_raw = (data['F4'] + data['Fp2']) - (data['F3'] + data['Fp1'])

# Arousal: overall absolute activity
arousal_raw = data.abs().mean(axis=1)

# Dominance: central + parietal activity (simple heuristic)
dom_channels = ['C3','C4','P3','P4','Pz']
dominance_raw = data[dom_channels].abs().mean(axis=1)

# Scale to 1–9
y_val = scale_1_9(valence_raw)
y_ar = scale_1_9(arousal_raw)
y_dom = scale_1_9(dominance_raw)

# Final safety cleanup
y_val = np.nan_to_num(y_val, nan=5.0, posinf=9.0, neginf=1.0)
y_ar = np.nan_to_num(y_ar, nan=5.0, posinf=9.0, neginf=1.0)
y_dom = np.nan_to_num(y_dom, nan=5.0, posinf=9.0, neginf=1.0)

# Binary labels (High vs Low)
y_val_bin = (y_val > np.median(y_val)).astype(int)
y_ar_bin = (y_ar > np.median(y_ar)).astype(int)
y_dom_bin = (y_dom > np.median(y_dom)).astype(int)


## 8) Train/Test Split


In [None]:
# Continuous splits
X_train, X_test, yv_train, yv_test = train_test_split(X, y_val, test_size=0.2, random_state=42)
_, _, ya_train, ya_test = train_test_split(X, y_ar, test_size=0.2, random_state=42)
_, _, yd_train, yd_test = train_test_split(X, y_dom, test_size=0.2, random_state=42)

# Binary splits
X_train_b, X_test_b, yv_train_b, yv_test_b = train_test_split(X, y_val_bin, test_size=0.2, random_state=42)
_, _, ya_train_b, ya_test_b = train_test_split(X, y_ar_bin, test_size=0.2, random_state=42)
_, _, yd_train_b, yd_test_b = train_test_split(X, y_dom_bin, test_size=0.2, random_state=42)


## 9) Linear Regression (Intensity Scores)


In [None]:
lin = LinearRegression()

# Valence
lin.fit(X_train, yv_train)
pred_v = lin.predict(X_test)
print('Valence MSE:', mean_squared_error(yv_test, pred_v))

# Arousal
lin.fit(X_train, ya_train)
pred_a = lin.predict(X_test)
print('Arousal MSE:', mean_squared_error(ya_test, pred_a))

# Dominance
lin.fit(X_train, yd_train)
pred_d = lin.predict(X_test)
print('Dominance MSE:', mean_squared_error(yd_test, pred_d))


## 10) Logistic Regression (High vs Low)


In [None]:
log = LogisticRegression(max_iter=1000)

# Valence High/Low
log.fit(X_train_b, yv_train_b)
pred_vb = log.predict(X_test_b)
print('Valence Accuracy:', accuracy_score(yv_test_b, pred_vb))

# Arousal High/Low
log.fit(X_train_b, ya_train_b)
pred_ab = log.predict(X_test_b)
print('Arousal Accuracy:', accuracy_score(ya_test_b, pred_ab))

# Dominance High/Low
log.fit(X_train_b, yd_train_b)
pred_db = log.predict(X_test_b)
print('Dominance Accuracy:', accuracy_score(yd_test_b, pred_db))
