# RL Pipeline Notebook
This notebook runs the analysis pipeline using the translated Python modules.

## New Output Structure
All processed data and results are now saved to the `python_output` directory, organized as follows:
- `trial_data/`: Processed trial data at various stages
- `estimation_data/`: Processed estimation data
- `analysis_results/`: Statistical analysis results and plots
- `reports/`: Summary reports and metadata in JSON format

Each module will save its outputs to these directories, making it easier to inspect the results at each stage of the pipeline.

In [38]:
from importlib import import_module
setup = import_module('00_setup')
print('LOCAL_DATA_DIR:', setup.LOCAL_DATA_DIR)
print('STRESS_DATA_DIR:', setup.STRESS_DATA_DIR)


LOCAL_DATA_DIR: /Users/edeneldar/Documents/RL/RL_Maggie/Data
STRESS_DATA_DIR: /Users/edeneldar/Documents/RL/RL_Maggie/data_healthy_Noa


In [39]:
# Optional: copy raw files from shared drive if available
# copy_mod = import_module('RL_Maggie.python.01_copy_raw_files')
# copy_mod.copy_files_from_rl(setup.RAW_SHARED_DIR, setup.LOCAL_DATA_DIR)


## Trial-Level Data

In [40]:
# Import modules for trial data processing
fibro_mod = import_module('02_trial_etl_fibro')
stress_mod = import_module('03_trial_etl_stress')

# Get trial data for both groups
res_fibro = fibro_mod.get_trial_data_healthy_fibro(setup.LOCAL_DATA_DIR)
res_stress = stress_mod.get_trial_data_healthy_stress()

# Print some basic information about the results
print(f"Fibro participants: {len(res_fibro['healthy_participants'])} healthy, {len(res_fibro['fibro_participants'])} fibro")
print(f"Stress participants: {len(res_stress['participants_with_7_blocks'])} total")
print(f"Non-learners: {len(res_fibro['non_learners'])} fibro/healthy, {len(res_stress['non_learners'])} stress")


Empty file: /Users/edeneldar/Documents/RL/RL_Maggie/Data/sub_901_Reversal_2024-02-19_16h00.26.975.csv
Fibro participants: 31 healthy, 161 fibro
Stress participants: 56 total
Non-learners: 37 fibro/healthy, 46 stress
Fibro participants: 31 healthy, 161 fibro
Stress participants: 56 total
Non-learners: 37 fibro/healthy, 46 stress


In [41]:
merge_mod = import_module('04_trial_merge_clean')
full_trial_data = merge_mod.full_trial_data
full_trial_data_learners = merge_mod.full_trial_data_learners
all_non_learners = merge_mod.all_non_learners
print('Trial rows:', len(full_trial_data))


Trial rows: 11373


In [42]:
# Load and run trial analysis
try:
    analysis_mod = import_module('05_trial_analysis')
    print("\nAnalysis completed successfully")
except Exception as e:
    print(f"\nWarning: Analysis encountered an issue: {e}")
    print("You may need to check the balance of your data or modify the analysis approach.")



Analysis completed successfully


## Estimation-Level Data

In [35]:
# Import estimation ETL module with error handling
try:
    est_etl = import_module('06_estimation_etl')
    # Access data if available
    if hasattr(est_etl, 'full_estimation_data') and hasattr(est_etl, 'full_estimation_data_clean'):
        full_estimation_data = est_etl.full_estimation_data
        full_estimation_data_clean = est_etl.full_estimation_data_clean
        print('Estimation rows:', len(full_estimation_data))
    else:
        print('Warning: Estimation data not available')
except Exception as e:
    print(f'Error loading estimation data: {e}')
    print('Creating empty estimation dataframes to continue pipeline')
    import pandas as pd
    full_estimation_data = pd.DataFrame()
    full_estimation_data_clean = pd.DataFrame()


Estimation rows: 6587


In [36]:
# Load estimation analysis with error handling
try:
    estimation_analysis_mod = import_module('07_estimation_analysis')
    print('Estimation analysis completed successfully')
except Exception as e:
    print(f'Error in estimation analysis: {e}')
    print('Try running the notebook again to verify if the fixes for estimation_etl worked')


Estimation analysis completed successfully


## Questionnaire Analysis

In [37]:
try:
    import_module('08_questionnaire_analysis')
except FileNotFoundError as e:
    print('Questionnaire file missing:', e)


ModuleNotFoundError: No module named '08_questionnaire_analysis'

## Note on Fixed Issues

The code was fixed to address the following issues:

1. Fixed indentation errors in `03_trial_etl_stress.py`
2. Added proper import for `STRESS_DATA_DIR` from the setup module
3. Fixed the code that identifies participants with 7 blocks in both stress and fibro modules
4. Modified the ANOVA analysis in `05_trial_analysis.py` to handle unbalanced data:
   - Added diagnostics to check data balance
   - Implemented a fallback to standard ANOVA if repeated measures ANOVA fails
   - Added better error handling and reporting
5. Enhanced error handling in `06_estimation_etl.py` to handle missing columns:
   - Added robust error handling for missing 'high_prob_image_file' and other required columns
   - Added detailed error messages to identify problematic files
   - Modified the notebook to continue pipeline execution even if estimation data processing fails
6. Fixed the boolean column access in `07_estimation_analysis.py`:
   - Changed direct boolean column access to string-based column access
   - Added proper error handling for missing data
   - Added checks to ensure sufficient data for ANOVA analysis

These changes improve the robustness of the pipeline by preventing errors related to participant indexing, indentation, unbalanced data in statistical analyses, missing columns in input files, and unsafe boolean column access.

In [None]:
# Explore the output directory structure
import os
from pathlib import Path

setup = import_module('00_setup')
OUTPUT_DIR = setup.OUTPUT_DIR

def print_directory_contents(directory, indent=''):
    """Print the contents of a directory in a tree-like format"""
    if not directory.exists():
        print(f"{indent}Directory does not exist: {directory}")
        return
        
    print(f"{indent}📁 {directory.name}/")
    indent += '  '
    
    # Get all files and directories
    items = list(directory.iterdir())
    
    # Sort directories first, then files
    dirs = sorted([item for item in items if item.is_dir()])
    files = sorted([item for item in items if item.is_file()])
    
    # Print directories
    for d in dirs:
        print_directory_contents(d, indent)
    
    # Print files
    for f in files:
        file_size = f.stat().st_size / 1024  # Size in KB
        print(f"{indent}📄 {f.name} ({file_size:.1f} KB)")

# Print the output directory structure
print("Output Directory Structure:")
print_directory_contents(OUTPUT_DIR)

## Output Files Overview

### Trial Data Files
- `fibro_healthy_trial_data.csv`: Trial data for fibromyalgia and healthy control participants
- `stress_trial_data.csv`: Trial data for healthy stress participants
- `full_trial_data.csv`: Combined trial data from all groups
- `full_trial_data_learners.csv`: Trial data excluding non-learners

### Estimation Data Files
- `full_estimation_data.csv`: Complete estimation data from all participants
- `full_estimation_data_clean.csv`: Cleaned estimation data (excludes non-learners and low-variability participants)
- `participant_estimation_variability.csv`: Variability measures for each participant's estimation responses

### Analysis Results
- `anova_input_data.csv`: Data prepared for ANOVA analysis
- `standard_anova_results.csv`: Results from standard ANOVA analysis
- `rm_anova_results.csv`: Results from repeated measures ANOVA (when available)
- `choice_summary_by_group.csv`: Summary statistics of choice behavior by group, block, and pair type
- `estimation_summary_by_group.csv`: Summary statistics of estimation behavior by group, block, and pair type

### Reports (JSON files)
- `data_structure.json`: Information about the structure of the data (participants, blocks, etc.)
- `anova_results.json`: Results from ANOVA analyses in a structured format
- `participant_summary.json`: Summary statistics about participants in each group
- `all_non_learners.json`: List of non-learner participants

### Plots
- `choice_plot_*.png`: Plots of choice behavior for each group
- `estimation_plot_*.png`: Plots of estimation behavior for each group and pair type

## Generating an Output Report

After running the pipeline, you can generate a detailed report of all output files using the `output_report.py` script:

```python
import importlib
report_mod = importlib.import_module('output_report')
```

Or from the command line:

```bash
cd /Users/edeneldar/Documents/RL
python RL_Maggie/python/output_report.py
```

This will:
1. Scan all files in the `python_output` directory
2. Generate a detailed JSON report of each file (size, content, modification time)
3. Save the report to `python_output/reports/output_report.json`
4. Print a summary of all files to the console

This report is useful for:
- Verifying that all expected output files were created
- Checking file sizes and row counts
- Finding the most recently modified files
- Documenting the analysis outputs