# **Analyzing Task-Report Similarities Using Text Embeddings and Cosine Similarity**

## **Description**
This notebook processes task descriptions and experimental reports, extracts text embeddings, and computes cosine similarity scores between tasks and reports. The similarity scores are grouped by experimental conditions and visualized using Kernel Density Estimation (KDE) to explore patterns in textual similarity.

## **Rationale**
The goal of this analysis is to assess the relationship between task descriptions and reports generated during experiments. By embedding textual data and using cosine similarity, we can quantify how closely a report aligns with predefined tasks under different experimental conditions. This approach enables:
- Automated comparison of structured and unstructured text data.
- Identification of condition-dependent text similarities.
- Visualization of similarity distributions for interpretability.

The workflow follows a structured approach:
1. **Data Loading**: Extracts task descriptions and experimental reports from `.docx` files.
2. **Data Processing**: Cleans and organizes text, identifying experimental runs.
3. **Embedding Generation**: Converts text into vector embeddings using openai `text-embedding-large` model
4. **Similarity Computation**: Measures cosine similarity between task and report embeddings.
5. **Analysis & Visualization**: Groups similarities by experimental condition and visualizes distributions.


### **Imports and Data Models**
This cell initializes the notebook by importing essential libraries, defining helper functions, and structuring data models.  
- **Imports:** Includes standard libraries (`os`, `re`, `pickle`, `numpy`, `scipy.stats`, `pandas`) along with external dependencies (`pydantic`, `docx`, `plotly.graph_objects`, and a custom `utils` module).  
- **Helper Function:** Implements `cosine_similarity()` to compute similarity between two vectors using the cosine similarity metric.  
- **Data Models:**  
  - `Run`: Represents an experimental run with a run number, report text, and optional embeddings.  
  - `Report`: Defines a collection of runs associated with a specific report name and conditions.  
  - `Task`: Stores task-related information, including a name, description, and optional embeddings.  
- **Paths & Naming:** Defines paths for raw and interim data storage, task names, a mapping of condition labels, and colors for visualization.

In [18]:
#####
# Imports
#####
from pydantic import BaseModel
from typing import List, Optional
import os
from docx import Document
import re
import utils
import pandas as pd
import pickle
import numpy as np
import scipy.stats as st
import plotly.graph_objects as go

#####
# Helper functions
#####
def cosine_similarity(vec1, vec2):
    """Compute the cosine similarity between two vectors."""
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    norm1 = np.linalg.norm(v1)
    norm2 = np.linalg.norm(v2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return np.dot(v1, v2) / (norm1 * norm2)

#####
# Datamodels
#####
class Run(BaseModel):
    run_number: int
    report: str
    embeddings: Optional[List[float]] = None

class Report(BaseModel):
    name: str
    run: List[Run]
    condition: List[str]

class Task(BaseModel):
    task_name: str
    description: str
    embeddings: Optional[List[float]] = None

#####
# Paths & Naming
#####

condition_path = "data/raw/conditions.xlsx"
task_directory = "data/raw/tasks"
report_directory = "data/raw/reports"
embedded_reports = "data/interim/reports.pkl"
embedded_tasks = "data/interim/tasks.pkl"
task_names = ["Gehen","Schreibtisch","Tisch"]
conditions_map = {
    1: "complete",
    2: "incomplete",
    3: "interrupted"
}
colors = [
    "rgba(255, 0, 0, 0.6)", # red
    "rgba(0, 255, 0, 0.6)", # green
    "rgba(0, 0, 255, 0.6)" # blue
    ]


-----

### **Load and Filter Condition Data**
- Reads an Excel file (`conditions.xlsx`) containing experimental condition data.  
- Filters the dataset to include only entries where the `Experimentator` is `"Maren"` and the `Metric` is `"full_text"`.  
- Creates a dictionary (`condition_dict`) mapping participants (`Proband`), tasks, and conditions.

In [None]:
df = pd.read_excel(condition_path, sheet_name='table incl full_text')
df_restricted = df[(df['Experimentator'] == 'Maren')& (df['Metric'] == "full_text")]
condition_dict = df_restricted[['Proband', 'Task', 'Condition']].to_dict(orient='records')

----


### **Load and Process Task Descriptions**
- Lists all `.docx` files in the `task_directory`.  
- Iterates through each task document, extracts its text, and stores it in a `Task` object.  
- The task descriptions are stored in a list, associating each with a predefined task name.


In [None]:
# List all .docx files in the directory
task_files = [f for f in os.listdir(task_directory) if f.endswith('.docx')]
tasks = []
# Iterate over each file and extract text
for idx, file in enumerate(task_files, 0):
    file_path = os.path.join(task_directory, file)
    doc = Document(file_path)
    text = ""
    for para in doc.paragraphs:
        # Create a Task object for each paragraph
        text += para.text
    task = Task(task_name=f"{[task_names[idx]]}", description=text)
    tasks.append(task)


----

### **Load and Process Reports**
- Scans subdirectories in `report_directory` to find `.docx` report files that start with `"M"`.  
- Extracts the participant identifier (`Proband`) from the filename.  
- Iterates through each document, identifying experimental runs using specific markers (e.g., `"INT-Pb21-W1"`).  
- Splits the text into runs and maps them to their corresponding participant conditions using `condition_dict`.  
- Constructs a `Report` object with the extracted runs and associated conditions, appending it to a list.

In [None]:
# List all subdirectories in the report directory
subdirs = [d for d in os.listdir(report_directory) if os.path.isdir(os.path.join(report_directory, d))]
# List all .docx files in the subdirectories that start with "M"
report_files = [
    f"{report_directory}/{subdir}/{f}" # File path
    for subdir in subdirs # Subdirectory
    for f in os.listdir(os.path.join(report_directory, subdir)) # File
    if f.endswith('.docx') and subdir.startswith("M") # Filter
    ]
reports = []
for file_path in report_files:
    doc = Document(file_path)
    report_name = re.search(r'\d+', file_path).group()
    runs = []
    current_run_text = []
    current_run_number = None
    for para in doc.paragraphs:
        text = para.text.strip()
        # Check if the paragraph is a run marker (e.g., "INT-Pb21-W1")
        if text.startswith("INT-") and "-W" in text:
            # If there's an active run, save it before starting a new one
            if current_run_number is not None:
                raw_text = "\n".join(current_run_text).strip()
                runs.append(Run(
                    run_number=current_run_number,
                    report=raw_text, 
                ))
            # Extract the run number from the marker using regex
            match = re.search(r'-W(\d+)', text)
            current_run_number = int(match.group(1)) if match else None
            current_run_text = []  # Reset the run text accumulator
        else:
            # Otherwise, accumulate text for the current run
            current_run_text.append(text)
    
    # Add the last run if it exists
    if current_run_number is not None and current_run_text:
        runs.append(Run(
            run_number=current_run_number,
            report="\n".join(current_run_text).strip()
        ))
    conditions = [entry["Condition"] for entry in condition_dict if entry["Proband"] == int(report_name)]
    # Create the Report object using the correct field name 'run'
    report = Report(name=report_name, run=runs, condition=[conditions_map[num][0] for num in conditions[:3]])
    reports.append(report)



----

### **Generate and Store Text Embeddings**
- Extracts task descriptions and asynchronously generates their embeddings using `utils.batch_generate_embeddings()`.  
- Iterates through each `Report`, processing its runs to generate embeddings for the run text.  
- Saves the processed `reports` and `tasks` as pickled files (`reports.pkl`, `tasks.pkl`) for later use.


In [None]:
# Process Task embeddings
task_texts = [task.description for task in tasks]
# Generate embeddings asynchronously for all tasks at once
tasks_embeddings = await utils.batch_generate_embeddings(task_texts)
# Update each Task with its corresponding embedding
for task, emb in zip(tasks, tasks_embeddings):
    task.embeddings = emb

# Process Report Run embeddings
for report in reports:
    run_texts = [run.report for run in report.run]
    if run_texts:  # Only process if there are runs in the report
        runs_embeddings = await utils.batch_generate_embeddings(run_texts)
        for run, emb in zip(report.run, runs_embeddings):
            run.embeddings = emb

# Cache the embedded reports and tasks
with open(embedded_reports, "wb") as f:
    pickle.dump([report.dict() for report in reports], f)

with open(embedded_tasks, "wb") as f:
    pickle.dump([task.dict() for task in tasks], f)


/var/folders/ph/1fwt_7bd2lqdr0w4cgss08l40000gp/T/ipykernel_25241/2734855478.py:19: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  pickle.dump([report.dict() for report in reports], f)
/var/folders/ph/1fwt_7bd2lqdr0w4cgss08l40000gp/T/ipykernel_25241/2734855478.py:22: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  pickle.dump([task.dict() for task in tasks], f)


----

### **Compute Similarity Scores**
- Iterates through reports and their corresponding runs.  
- Computes cosine similarity scores between each run's embeddings and task embeddings.  
- Groups similarity scores by experimental conditions.

In [None]:
# Aggregate similarity scores overall (ignoring tasks) but grouped by condition.
similarity_by_condition = {}

for report in reports:
    for run in report.run:
        # Skip runs with empty text or missing embeddings.
        if not run.report.strip() or run.embeddings is None:
            continue
        for idx, task in enumerate(tasks):
            if task.embeddings is None:
                continue
            score = cosine_similarity(run.embeddings, task.embeddings)
            similarity_by_condition.setdefault(report.condition[idx], []).append(score)



----

### **Visualize Similarity Distributions**
- Initializes a Plotly figure to visualize the cosine similarity distributions by condition.  
- Uses Gaussian Kernel Density Estimation (KDE) to estimate similarity score distributions.  
- Normalizes density values for consistent visualization.  
- Assigns colors to each condition and plots filled curves representing similarity distributions.  
- Displays the final interactive figure.

In [19]:
fig = go.Figure()
# Define colors for conditions.
color_map = {cond:colors[i] for i, cond in enumerate(conditions_map.values())}
# For each condition, compute the KDE and add a filled trace.
for cond, sim_list in similarity_by_condition.items():
    data = sim_list
    if not data:
        continue
    kde = st.gaussian_kde(data)
    x_vals = np.linspace(0, 1, 200)
    y_vals = kde(x_vals)
    # Normalize the density values (scaled to a max height of 0.8 for visual consistency)
    y_vals_norm = (y_vals - np.min(y_vals)) / (np.max(y_vals) - np.min(y_vals)) * 0.8
    
    fig.add_trace(go.Scatter(
        x=x_vals,
        y=y_vals_norm,
        mode="lines",
        line_shape="spline",
        fill="tozeroy",
        line=dict(color=color_map.get(cond), width=2),
        fillcolor=color_map.get(cond),
        name=cond
    ))

# Update layout.
fig.update_layout(
    title="Cosine Similarity Distributions by Condition",
    xaxis_title="Cosine Similarity",
    yaxis_title="Normalized Density",
    template="plotly_white",
    width=800,
    height=600,
    margin=dict(l=50, r=50, t=100, b=50)
)
fig.update_xaxes(range=[0, 1])
fig.show()
