**Inputs**: Voltage between electrode on 10 different students’ forehead and 2 electrodes in their ears.
* SubjectID - Type of topic covered by video shown to student
* VideoID - The ID of video shown to student
* Attention - Level of engagement students had while watching video (measured by MindSet)
* Mediation - Level of calmness students had while watching video (measured by MindSet)
* Raw - Actual EEG signal measured
* Delta - millivolts of delta wave activity measured. Delta bands range between 1-3 Hz
* Theta - millivolts of theta wave activity measured. Theta bands range between 4-7 Hz
* Alpha1 - millivolts of alpha1 wave activity measured. Alpha1 bands range between lower 8-11 Hz
* Alpha2 - millivolts of alpha2 wave activity measured. Alpha2 bands range between higher 8-11 Hz
* Beta1 - millivolts of beta1 wave activity measured. Beta1 bands range between lower 12-29 Hz
* Beta2 - millivolts of beta2 wave activity measured. Beta2 bands range between higher 12-29 Hz
* Gamma1 - millivolts of gamma1 wave activity measured. Gamma1 bands range between lower 30-100 Hz
* Gamma2 - millivolts of gamma2 wave activity measured. Gamma2 bands range between higher 30-100 Hz
* predefinedlabel - If subject expected to be confused.

**Output**: Student self-rated 1-7 confusion level after watching each video. With 1 = least confusing and 7 = most confusing. Further normalized  into confused or not confused labels.
* user-definedlabeln - If subject is actually confused.

This is a classification problem, specifically, a binary classification problem.


# **Installs & Imports**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset

# **Read in data**

In [29]:
df = pd.read_csv('./EEG_data.csv')
# Assuming your dataset is already loaded into a Pandas DataFrame
# X is the feature matrix and y is the target vector
X = df.drop(columns=["user-definedlabeln"])
y = df["user-definedlabeln"]
df.head()

Unnamed: 0,SubjectID,VideoID,Attention,Mediation,Raw,Delta,Theta,Alpha1,Alpha2,Beta1,Beta2,Gamma1,Gamma2,predefinedlabel,user-definedlabeln
0,0.0,0.0,56.0,43.0,278.0,301963.0,90612.0,33735.0,23991.0,27946.0,45097.0,33228.0,8293.0,0.0,0.0
1,0.0,0.0,40.0,35.0,-50.0,73787.0,28083.0,1439.0,2240.0,2746.0,3687.0,5293.0,2740.0,0.0,0.0
2,0.0,0.0,47.0,48.0,101.0,758353.0,383745.0,201999.0,62107.0,36293.0,130536.0,57243.0,25354.0,0.0,0.0
3,0.0,0.0,47.0,57.0,-5.0,2012240.0,129350.0,61236.0,17084.0,11488.0,62462.0,49960.0,33932.0,0.0,0.0
4,0.0,0.0,44.0,53.0,-8.0,1005145.0,354328.0,37102.0,88881.0,45307.0,99603.0,44790.0,29749.0,0.0,0.0


# **LogisticRegression**

In [28]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Drop rows with missing values & its corresponding y_train data
X_train.dropna(inplace=True)
y_train = y_train[X_train.index]  # Update y_train to match the dropped rows in X_train

# Initialize and train the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

# Make predictions and evaluate performance
y_pred_log_reg = log_reg.predict(X_test)
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)

print(f"Logistic Regression Accuracy: {accuracy_log_reg:.4f}")

Logistic Regression Accuracy: 0.5989


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# **Neural network**

In [13]:
# Normalize the data for the neural network
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert the data to PyTorch tensors
X_tensor = torch.tensor(X_scaled, dtype=torch.float32)
y_tensor = torch.tensor(y.values, dtype=torch.long)

# Create a DataLoader for batching
train_data = TensorDataset(X_tensor, y_tensor)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Define the neural network architecture
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size=64, output_size=2):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return self.softmax(x)

# Initialize the model
input_size = X_scaled.shape[1]  # Number of features
model = NeuralNetwork(input_size)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the neural network
epochs = 20
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    if (epoch+1) % 5 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss/len(train_loader):.4f}")

# Test the model
with torch.no_grad():
    model.eval()
    X_test_tensor = torch.tensor(scaler.transform(X_test), dtype=torch.float32)
    y_pred_nn = model(X_test_tensor).argmax(dim=1)
    accuracy_nn = (y_pred_nn == torch.tensor(y_test.values)).float().mean()

print(f"Neural Network Accuracy: {accuracy_nn:.4f}")

Epoch [5/20], Loss: 0.6298
Epoch [10/20], Loss: 0.6010
Epoch [15/20], Loss: 0.5808
Epoch [20/20], Loss: 0.5652
Neural Network Accuracy: 0.7482


# **Feature engineering**


In [24]:
# Compute the IQR for each feature
Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1

# Define outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# 2. Filter out rows with outliers
X_no_outliers = X[~((X < lower_bound) | (X > upper_bound)).any(axis=1)]
y_no_outliers = y[X_no_outliers.index]

# 3. Select subset of features (dropping some features like delta, beta2, and gamma1)
X_no_outliers = X_no_outliers.drop(columns=["Delta", "Beta2", "Gamma1"])

# 4. Select a subset of SubjectIDs
X_no_outliers = X_no_outliers[X_no_outliers["SubjectID"] == 1.0]
y_no_outliers = y_no_outliers[X_no_outliers.index]

# Split and train with the cleaned data
X_train_no_outliers, X_test_no_outliers, y_train_no_outliers, y_test_no_outliers = train_test_split(X_no_outliers, y_no_outliers, test_size=0.3, random_state=42)


In [30]:
# Retrain the neural network model

# 1. Normalize the data for the neural network
scaler = StandardScaler()
X_train_no_outliers_scaled = scaler.fit_transform(X_train_no_outliers)  # Fit only on training data with no outliers
X_test_no_outliers_scaled = scaler.transform(X_test_no_outliers) # Transform the test data using the same scaler


# Convert the data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_no_outliers_scaled, dtype=torch.float32)  # Use scaled training data
y_train_tensor = torch.tensor(y_train_no_outliers.values, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_no_outliers_scaled, dtype=torch.float32)  # Use scaled test data
y_test_tensor = torch.tensor(y_test_no_outliers.values, dtype=torch.long)

# Create a DataLoader for batching
train_data = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Define the neural network architecture (same as before)
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size=64, output_size=2):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return self.softmax(x)
# Initialize the model
input_size = X_train_no_outliers_scaled.shape[1]  # Number of features
model = NeuralNetwork(input_size)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the neural network
epochs = 20
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    if (epoch+1) % 5 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss/len(train_loader):.4f}")

# Test the model
with torch.no_grad():
    model.eval()
    y_pred_nn = model(X_test_tensor).argmax(dim=1)
    accuracy_nn = (y_pred_nn == y_test_tensor).float().mean()

print(f"Neural Network Accuracy: {accuracy_nn:.4f}")

Epoch [5/20], Loss: 0.4994
Epoch [10/20], Loss: 0.4434
Epoch [15/20], Loss: 0.4222
Epoch [20/20], Loss: 0.4077
Neural Network Accuracy: 0.8738
