# Emotion Prediction Notebook

This notebook guides you through building and evaluating an emotion prediction model from text data. It includes:

- **Data Preparation:** Loading a CSV of texts with emotion labels, balancing the dataset, and generating text embeddings.
- **Model Development:** Initializing the model, defining its architecture, training it, and evaluating its performance.
- **Prediction & Visualization:** Running predictions on sample texts and visualizing the results using polar charts.




----

### Imports and Initial Setup
This section imports all the necessary libraries and modules, including utilities for data handling, embedding generation, and model evaluation. It also defines several helper functions and data models using Pydantic. Additionally, paths, filenames, and parameters for the dataset and model configuration are set up. Key tasks include:
- Reading a CSV file with emotion labels and texts.
- Preparing a balanced dataset by sampling equal numbers of examples per label.
- Defining data models (`Run`, `Report`, `Task`) to structure the data.
- Setting file paths for intermediate outputs (embeddings, labels) and mapping conditions and colors.
- Loading previously saved tasks and reports from pickle files.

In [None]:
#####
# Imports
#####
from pydantic import BaseModel
from typing import List, Optional
import os
import pandas as pd
import pickle
import numpy as np
import scipy.stats as st
import plotly.graph_objects as go
import utils
from importlib import reload
from utils import generate_embedding
import filter
# Load your CSV file containing emotion labels and texts.
def prep_data(training_data, samples):
    df = pd.read_csv(training_data)

    # Determine the number of unique labels.
    unique_labels = df['label'].unique()
    num_labels = len(unique_labels)

    # Compute the number of samples per label.
    samples_per_label = samples // num_labels

    # Use groupby and sample to get a balanced dataset.
    balanced_df = df.groupby('label', group_keys=False).apply(
    lambda group: group.sample(n=samples_per_label, random_state=42)
    )

# Extract texts and labels.
    balanced_texts = balanced_df['text'].tolist()
    balanced_labels = balanced_df['label'].tolist()
    return balanced_texts,balanced_labels
#####
# Datamodels
#####
class Run(BaseModel):
    run_number: int
    report: str
    embeddings: Optional[List[float]] = None

class Report(BaseModel):
    name: str
    run: List[Run]
    condition: List[str]

class Task(BaseModel):
    task_name: str
    description: str
    embeddings: Optional[List[float]] = None

#####
# Paths & Naming
#####
total_samples = 500
embedded_reports = "data/interim/reports.pkl"
embedded_tasks = "data/interim/tasks.pkl"
training_data = "data/raw/emotions.csv"
task_names = ["Gehen","Schreibtisch","Tisch"]
embeddings_path = f"data/model/input/emotions_embeddings_{total_samples}.pkl"
labels_path = f"data/model/labels/emotions_labels_{total_samples}.pkl"
conditions_map = {
    1: "complete",
    2: "incomplete",
    3: "interrupted"
}
colors = [
    "rgba(255, 0, 0, 0.6)", # red
    "rgba(0, 255, 0, 0.6)", # green
    "rgba(0, 0, 255, 0.6)" # blue
    ]
# Define emotion names (order must match the predictor's output).
emotion_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
with open(embedded_tasks, "rb") as f:
    tasks_data = pickle.load(f)
tasks = [Task(**data) for data in tasks_data]

with open(embedded_reports, "rb") as f:
    report_data = pickle.load(f)
reports = [Report(**data) for data in report_data]


----

### Data Preparation and Embedding Generation
This part loads the balanced texts and labels using the earlier defined helper function. It checks if the embeddings and labels already exist in specified files; if not, it generates embeddings for the texts using a batch processing function from the utilities module and saves them. The same is done for the labels:
- Reading and balancing the dataset from the CSV file.
- Generating text embeddings if they are not already computed.
- Saving the generated embeddings and labels to disk for future use.

In [None]:
balanced_texts, balanced_labels = prep_data(training_data=training_data,samples=total_samples)
if os.path.exists(embeddings_path):
    with open(embeddings_path, "rb") as f:
        emotions_embeddings = pickle.load(f)
else:
    # Generate embeddings for the balanced texts.
    emotions_embeddings = await utils.batch_generate_embeddings(balanced_texts)
    with open(embeddings_path, "wb") as f:
        pickle.dump(emotions_embeddings, f)

if os.path.exists(labels_path):
    with open(labels_path, "rb") as f:
        emotions_labels = pickle.load(f)
else:    
    # Save the labels to a file.
    with open(labels_path, "wb") as f:
        pickle.dump(balanced_labels, f)

----


### Model Initialization, Architecture Definition, Training, and Evaluation
Here, the emotion prediction model is set up. The workflow includes:
- Instantiating the `EmotionPredictor` with the computed embeddings and corresponding labels.
- Defining the model architecture with specified hidden layer sizes.
- Training the model using a predefined number of epochs and learning rate.
- Evaluating the model's performance on a test set after training.

In [None]:
# 3. Initialize the EmotionPredictor with the embeddings and labels.
predictor = filter.EmotionPredictor(emotions_embeddings, balanced_labels)

# 4. Define the model architecture.
predictor.define_model(hidden_size1=500, hidden_size2=150, hidden_size3=25, model_path="./emotions_model.pth")

# 5. Train the model.
predictor.train_model(epochs=200, lr=0.01)

# 6. Evaluate the model on the test set.
predictor.evaluate_model()


----

### Running a Prediction Example
This segment demonstrates how to use the trained model to predict the emotion probabilities for a specific text report:
- An embedding from a specific run in the reports data is extracted.
- The predictor uses this embedding to generate probabilities for each emotion.
- The predicted probabilities are printed alongside the corresponding text report, mapping each value to a particular emotion (e.g., sadness, joy, love, etc.).

In [None]:

embedding = reports[2].run[1].embeddings
probabilities = predictor.predict(embedding)
print(f"""
Predicted probabilities for each emotion for the following text:
{reports[2].run[1].report}

sadness: {probabilities[0][0]}, 
joy: {probabilities[0][1]},
love: {probabilities[0][2]},
anger: {probabilities[0][3]},
fear: {probabilities[0][4]},
surprise: {probabilities[0][5]}  
""")

----


### Aggregating Predictions by Condition
In this cell, the code iterates over the report data to collect and aggregate the emotion prediction probabilities based on different conditions (e.g., "complete", "incomplete", "interrupted"):
- For each report, the corresponding condition is identified.
- For each run within a report, if valid, the prediction probabilities are computed.
- These probabilities are grouped in a dictionary keyed by the condition, setting the stage for later analysis and visualization.
- An array of emotion names is prepared and the conditions are sorted for consistency.


In [None]:
# Dictionary to collect probability distributions per condition.
# Key: condition (e.g. "Condition complete"), Value: list of probability arrays for each run.
condition_probabilities = {}

for report in reports:
    # If you want to associate a run with only one condition (say, the first)
    cond_key = f"Condition {report.condition[0]}"
    if cond_key not in condition_probabilities:
        condition_probabilities[cond_key] = []
        
    for run in report.run:
        if run.embeddings is not None and run.report.strip() != "":
            emb = np.array(run.embeddings)
            probs = predictor.predict(emb)  # This returns a probability distribution
            # Append the probabilities to the list for this condition
            condition_probabilities[cond_key].append(probs)
angles = emotion_names + [emotion_names[0]]
sorted_conditions = sorted(condition_probabilities.keys())

----


### Visualizing Mean and Variability of Emotion Predictions
The final section creates a polar chart visualization using Plotly to display the average emotion probabilities and their variability (standard deviation) for each condition:
- Subplots are generated for each condition.
- For each condition, the mean and standard deviation of the probabilities across runs are calculated.
- A shaded polygon (±1 standard deviation) is added to the plot for each condition to visually represent the variability.
- The mean probabilities are plotted as a line with markers on a polar coordinate system.
- The layout of the polar axes is adjusted to have a consistent range, ensuring that the visualization is clear and comparable across conditions.


In [None]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Define emotion names (order must match the predictor's output).
emotion_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
# Create angles; repeat the first emotion to close the polygon.
angles = emotion_names + [emotion_names[0]]
sorted_conditions = sorted(condition_probabilities.keys())

# Create a subplot with one polar chart per condition.
fig = make_subplots(
    rows=1, cols=len(sorted_conditions),
    specs=[[{'type': 'polar'}] * len(sorted_conditions)],
    subplot_titles=[f"{cond}" for cond in sorted_conditions]
)

# Define colors for conditions.
colors = {
    sorted_conditions[0]: "red",
    sorted_conditions[1]: "blue",
    sorted_conditions[2]: "green"
}

for i, cond in enumerate(sorted_conditions):
    # Convert the list of probability arrays to a NumPy array.
    data = np.array(condition_probabilities[cond])
    
    # If the data has an extra dimension (e.g., shape (n_runs, 2, n_emotions))
    # and the two rows are identical, select the first row.
    if data.ndim == 3 and data.shape[1] == 2:
        data = data[:, 0, :]  # Now data shape becomes (n_runs, n_emotions)
    
    # Compute mean and standard deviation for each emotion.
    mean_values = data.mean(axis=0)   # shape: (n_emotions,)
    std_values = data.std(axis=0)     # shape: (n_emotions,)
    
    # Close the polygons by appending the first value at the end.
    mean_closed = np.concatenate([mean_values[0], [mean_values[0][0]]])
    # Compute the upper and lower bounds for the fill.
    upper_bound = mean_values + std_values
    lower_bound = mean_values - std_values
    # Close the bounds.
    upper_closed = np.concatenate([upper_bound[0], [upper_bound[0][0]]])
    lower_closed = np.concatenate([lower_bound[0], [lower_bound[0][0]]])
    
    # Build a polygon for the shaded area (upper bound then reversed lower bound).
    fill_r = np.concatenate([upper_closed, lower_closed[::-1]])
    fill_theta = np.concatenate([angles, angles[::-1]])
    
    # Add the shaded area for ±1 standard deviation.
    fig.add_trace(go.Scatterpolar(
        r=fill_r,
        theta=fill_theta,
        fill='toself',
        fillcolor=colors.get(cond, "black"),
        opacity=0.2,
        line=dict(color='rgba(0,0,0,0)'),
        showlegend=False,
        name=f'{cond} Std'
    ), row=1, col=i+1)
    
    # Add the mean line.
    fig.add_trace(go.Scatterpolar(
        r=mean_closed,
        theta=angles,
        mode='lines+markers',
        name=f'{cond} Mean',
        line=dict(color=colors.get(cond, "black"))
    ), row=1, col=i+1)

# Let Plotly auto-scale the radial axis or set an appropriate range.
for i in range(1, len(sorted_conditions) + 1):
    polar_id = f"polar{i}" if i > 1 else "polar"
    fig.update_layout({
        polar_id: dict(
            radialaxis=dict(
                range=[0, 1],  # Adjust this range if needed.
                autorange=False
            )
        )
    })

fig.show()


----