# Bilingualism EEG Prediction

**Project**: Sandia Datathon - Predicting Spanish Bilingualism using ERP Signals.

---

This notebook explores how to classify participants as Spanish bilinguals or not using ERP signals (300–500ms window). We use Random Forests and handle imbalanced data with SMOTE.

In [53]:
import pandas as pd
import zipfile
import os
from io import TextIOWrapper
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


# Load Metadata
metadata = pd.read_csv("metadata.csv") 
# Storing whether each participant is Spanish Bilingual or not in a dictionary 
label_lookup = metadata.set_index("participant")["spanish"].astype(int).to_dict()



## 2. ERP Feature Extraction
For each participant's EEG file, we extract mean values over the 300–500ms window (N400), a key range for language-related brain activity.

In [54]:
# Folder where CSVs are extracted
erp_folder = "EEG_Measurements"

# ERP row range for 300–500ms - key window for analysis (N400)
start_row, end_row = 102, 154

# Store results from each EEG file
feature_rows = []

for file in os.listdir(erp_folder):
    if not file.endswith(".csv"):
        continue
     #only focusing on these 2 file types to target Spanish Bilingualism 
    if "spanish-english_translation" not in file and "spanish-english_unrelated" not in file:
        continue

    participant_id = int(file.split("_")[-1].split(".")[0])
    condition = "translation" if "translation" in file else "unrelated"
    word = file.split("_")[0].lower()

    df = pd.read_csv(os.path.join(erp_folder, file))
    df_window = df.iloc[start_row:end_row] #target window
    avg_vals = df_window.mean(numeric_only=True) #averaging values across electrodes 

    row = {
        "participant": participant_id,
        "word": word,
        "condition": condition,
        "label": label_lookup.get(participant_id)
    }
    #adding averaged EEG features
    row.update(avg_vals.to_dict()) 
    feature_rows.append(row)
#Dataframe create
eeg_df = pd.DataFrame(feature_rows)
eeg_df


Unnamed: 0,participant,word,condition,label,Fp1,Fpz,Fp2,F7,F3,Fz,...,CP6,P7,P3,Pz,P4,P8,POz,O1,Oz,O2
0,26,neck,translation,0,5.210872,2.048470,5.510847,2.388379,-0.412696,0.359448,...,2.227675,-2.022466,-1.663015,-1.767832,1.023924,2.051119,-1.598112,-3.911904,-0.480249,0.934310
1,30,chair,translation,0,1.747779,5.249679,-0.987821,13.110565,8.886251,8.683909,...,-0.354170,10.013846,4.719447,9.903816,0.822147,1.759042,1.399389,4.263879,-5.136157,-1.396459
2,37,lawyer,translation,0,-3.108701,-2.396498,-5.202152,0.764026,-1.856718,1.432957,...,-1.046628,-3.551156,-2.447772,-1.504248,-0.295588,1.050383,-2.344182,-2.752326,-2.273498,-2.131297
3,23,lawyer,translation,0,3.680699,2.403862,2.741656,1.000903,6.301402,6.010365,...,8.186695,9.558172,10.150296,11.268355,11.346663,7.420594,12.731699,5.942609,10.044575,9.149056
4,24,chair,translation,0,-0.381249,5.825562,6.655709,2.727421,-0.094796,3.416468,...,4.217539,2.855266,6.912092,6.669544,4.366648,2.912742,8.222605,3.941076,3.686921,3.455947
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1591,15,chair,translation,1,-7.261914,5.334718,-1.602899,-2.966757,-0.272582,-0.563816,...,-2.132245,8.551892,2.276207,-0.068733,1.393488,1.055554,2.253133,-0.765688,0.687861,2.420011
1592,38,flour,unrelated,0,0.419329,1.514714,2.164843,0.028895,1.717166,4.057281,...,2.864235,2.182354,2.507636,0.814641,3.199654,3.970851,2.739810,4.146101,3.845253,4.984962
1593,10,flour,unrelated,1,5.368442,5.700513,2.662180,2.766340,5.322301,4.792988,...,-1.079783,-0.614263,3.309904,2.961649,2.700342,-0.672318,3.435069,1.571088,2.245423,0.416753
1594,29,chair,translation,0,3.713223,3.206290,2.330899,4.901309,4.208454,2.074045,...,-0.474346,2.440704,4.269938,5.507411,2.686140,3.056172,2.572192,0.694999,1.009748,4.541374


## 3. Data Exploration
We check class distribution and look for missing data.

In [55]:
eeg_df['label'].value_counts()

label
0    1208
1     388
Name: count, dtype: int64

In [56]:
eeg_df['condition'].value_counts()

condition
translation    1091
unrelated       505
Name: count, dtype: int64

In [57]:
eeg_df.isnull().sum() 
eeg_df = eeg_df.dropna()
eeg_df.isnull().sum()

participant    0
word           0
condition      0
label          0
Fp1            0
Fpz            0
Fp2            0
F7             0
F3             0
Fz             0
F4             0
F8             0
FC5            0
FC1            0
FC2            0
FC6            0
T7             0
C3             0
Cz             0
C4             0
T8             0
CP5            0
CP1            0
CP2            0
CP6            0
P7             0
P3             0
Pz             0
P4             0
P8             0
POz            0
O1             0
Oz             0
O2             0
dtype: int64

Creating one row per participant, averaged ERP features, and the target label. 

In [58]:
# Define the EEG electrodes (exclude metadata columns)
electrode_cols = [col for col in eeg_df.columns if col not in ['participant', 'word', 'condition', 'label']]

# Group by participant, take the mean of all EEG signals
agg_df = eeg_df.groupby('participant')[electrode_cols + ['label']].mean().reset_index()

# Check output
print(agg_df.shape)
print(agg_df['label'].value_counts())


(40, 32)
label
0.0    30
1.0    10
Name: count, dtype: int64


## 4. Train-Test Split
We split the cleaned data into training and test sets for model evaluation.

In [59]:


X = agg_df.drop(columns='label')
y = agg_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=42)


## 5. Initial Model Training
We train a Random Forest classifier and evaluate the performance.

In [60]:


model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[8 0]
 [2 0]]
              precision    recall  f1-score   support

         0.0       0.80      1.00      0.89         8
         1.0       0.00      0.00      0.00         2

    accuracy                           0.80        10
   macro avg       0.40      0.50      0.44        10
weighted avg       0.64      0.80      0.71        10



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 6. Handling Class Imbalance with SMOTE
Since there are only 10 Spanish Speakers, the model is biased towards the non-speakers. We apply SMOTE to oversample the minority class and retrain the model.

In [61]:

# Separate features and labels
X = agg_df.drop('label', axis=1)
y = agg_df['label']

# Split into train/test (stratified to preserve class ratio in test set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

# Apply SMOTE to training set only
smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train_bal, y_train_bal)

# redict and evaluate
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[4 4]
 [1 1]]
              precision    recall  f1-score   support

         0.0       0.80      0.50      0.62         8
         1.0       0.20      0.50      0.29         2

    accuracy                           0.50        10
   macro avg       0.50      0.50      0.45        10
weighted avg       0.68      0.50      0.55        10



SMOTE helped with imbalance, as now we detect both labels.

In [62]:

X = agg_df.drop(columns='label')
y = agg_df['label']

models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Logistic Regression": Pipeline([
        ("scaler", StandardScaler()),
        ("logreg", LogisticRegression(random_state=42, max_iter=1000))
    ])
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='f1_macro')
    print(f"{name}: Avg F1 (macro) = {scores.mean():.3f} ± {scores.std():.3f}")


Random Forest: Avg F1 (macro) = 0.390 ± 0.047
Gradient Boosting: Avg F1 (macro) = 0.351 ± 0.053
Logistic Regression: Avg F1 (macro) = 0.452 ± 0.182


Based on the 3 models, the logistic regression has the highest F1 score. It is our best model for predicting a participant not in our dataset as bilingual in Spanish. 

## 8. Conclusion
- ERP signals show potential for detecting Spanish bilingualism.
- Class imbalance was a major issue, improved via SMOTE.
- Future steps: try other classifiers, feature selection, or dimensionality reduction.