# PE4MOVE Data Preparation Pipeline

This notebook prepares the PE4MOVE dataset for machine learning analysis by:
1. Loading and exploring the raw data
2. Identifying intervention and control groups
3. Filtering participants with complete T1 (follow-up) data
4. Creating derived variables (motivation, self-monitoring)
5. Cleaning and selecting relevant attributes
6. Exporting separate CSV files for intervention and control groups

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")

## 2. Load Dataset

Load the PE4MOVE dataset and display basic information about its structure.

In [2]:
# Load the dataset
df = pd.read_csv('data/PE4MOVE_6MWT.csv')

print(f"Dataset shape: {df.shape}")
print(f"Participants: {len(df):,}")
print(f"Variables: {df.shape[1]}")
print(f"\nFirst few columns: {df.columns[:10].tolist()}")

Dataset shape: (3193, 381)
Participants: 3,193
Variables: 381

First few columns: ['Age', 'Sex', 'MVPA_Frequency_T0', 'MVPA_d0', 'MVPA_d1', 'MVPA_d2', 'MVPA_d3', 'MVPA_d4', 'MVPA_d5', 'MVPA_d6']


## 3. Identify Intervention and Control Groups

The dataset contains a `Group_Final` variable that indicates whether each participant was in Group A or Group B. We need to determine which group received the intervention by examining changes in MVPA (Moderate-to-Vigorous Physical Activity) frequency from T0 (baseline) to T1 (follow-up).

In [3]:
# Check group distribution
print("Group Distribution:")
print(df['Group_Final'].value_counts())
print(f"\nMissing group assignments: {df['Group_Final'].isna().sum()}")

# Compare MVPA changes between groups to identify intervention group
print("\n" + "="*70)
print("DETERMINING INTERVENTION vs CONTROL GROUP")
print("="*70)

for group in ['A', 'B']:
    group_data = df[df['Group_Final'] == group]
    
    # Use only paired data (participants with both T0 and T1)
    paired_mask = group_data['MVPA_Frequency_T0'].notna() & group_data['MVPA_Frequency_T1'].notna()
    paired_data = group_data[paired_mask]
    
    t0_mean = paired_data['MVPA_Frequency_T0'].mean()
    t1_mean = paired_data['MVPA_Frequency_T1'].mean()
    change = t1_mean - t0_mean
    change_pct = (change / t0_mean * 100) if t0_mean > 0 else 0
    
    print(f"\nGroup {group} (n={paired_mask.sum()} paired):")
    print(f"  MVPA_Frequency_T0: {t0_mean:.2f}")
    print(f"  MVPA_Frequency_T1: {t1_mean:.2f}")
    print(f"  Change: {change:+.2f} ({change_pct:+.1f}%)")

print("\n" + "="*70)
print("CONCLUSION: Group A shows positive change → INTERVENTION GROUP")
print("            Group B shows negative change → CONTROL GROUP")
print("="*70)

Group Distribution:
Group_Final
A    2095
B    1098
Name: count, dtype: int64

Missing group assignments: 0

DETERMINING INTERVENTION vs CONTROL GROUP

Group A (n=1007 paired):
  MVPA_Frequency_T0: 3.20
  MVPA_Frequency_T1: 3.49
  Change: +0.29 (+8.9%)

Group B (n=763 paired):
  MVPA_Frequency_T0: 3.05
  MVPA_Frequency_T1: 3.26
  Change: +0.21 (+6.9%)

CONCLUSION: Group A shows positive change → INTERVENTION GROUP
            Group B shows negative change → CONTROL GROUP


## 4. Split Dataset into Intervention and Control Groups

Based on the analysis above, we split the dataset into:
- **Intervention group** (Group A): Received the PE4MOVE program
- **Control group** (Group B): Did not receive the intervention

In [4]:
# Split into intervention and control groups
df_intervention = df[df['Group_Final'] == 'A'].copy()
df_control = df[df['Group_Final'] == 'B'].copy()

print(f"Intervention group (Group A): {len(df_intervention)} participants")
print(f"Control group (Group B): {len(df_control)} participants")

Intervention group (Group A): 2095 participants
Control group (Group B): 1098 participants


## 5. Filter Participants with Complete T1 Data

For our analysis, we only include participants who have complete follow-up (T1) data for MVPA frequency. This ensures we can measure the outcome of interest.

In [5]:
# Filter intervention group for complete T1 data
print("INTERVENTION GROUP:")
print(f"  Original: {len(df_intervention)} participants")
print(f"  Missing MVPA_Frequency_T1: {df_intervention['MVPA_Frequency_T1'].isna().sum()}")

df_intervention_clean = df_intervention[df_intervention['MVPA_Frequency_T1'].notna()].copy()
print(f"  After filtering: {len(df_intervention_clean)} participants")
print(f"  Retention rate: {len(df_intervention_clean)/len(df_intervention)*100:.1f}%")

# Filter control group for complete T1 data
print("\nCONTROL GROUP:")
print(f"  Original: {len(df_control)} participants")
print(f"  Missing MVPA_Frequency_T1: {df_control['MVPA_Frequency_T1'].isna().sum()}")

df_control_clean = df_control[df_control['MVPA_Frequency_T1'].notna()].copy()
print(f"  After filtering: {len(df_control_clean)} participants")
print(f"  Retention rate: {len(df_control_clean)/len(df_control)*100:.1f}%")

INTERVENTION GROUP:
  Original: 2095 participants
  Missing MVPA_Frequency_T1: 1088
  After filtering: 1007 participants
  Retention rate: 48.1%

CONTROL GROUP:
  Original: 1098 participants
  Missing MVPA_Frequency_T1: 335
  After filtering: 763 participants
  Retention rate: 69.5%


### 5.1 Remove "Prefer not to say" Values

Before calculating, we need to replace values ("prefer not to say") with NaN in score-based columns. This ensures that these values are excluded from all calculations.

In [6]:
def clean_prefer_not_to_say_values(df):
    """
    Replace 'prefer not to say' values with NaN for all relevant columns.
    Different columns use different numeric codes for this response.
    """
    
    # Define the mapping of columns to their "prefer not to say" values
    prefer_not_to_say_mapping = {
        # Value = 3
        'Sex': 3,
        
        # Value = 6
        'Leisure_Exercise_T0': 6,
        'Leisure_Exercise_T1': 6,
        'YAP_sedentary_general_T0': 6,
        'YAP_sedentary_general_T1': 6,
        'Leisure_PA_T0': 6,
        'Leisure_PA_T1': 6,
        'PE_hours_T0': 6,
        'PE_hours_T1': 6,
        'Extracurricular_Session_Coach_T0': 6,
        'Extracurricular_Session_Coach_T1': 6,
        'Extracurricular_Session_School_T0': 6,
        'Extracurricular_Session_School_T1': 6,
        
        # Value = 8
        'MVPA_Frequency_T0': 8,
        'MVPA_Frequency_T1': 8,
        'MVPA_Usual_Week_T0': 8,
        'MVPA_Usual_Week_T1': 8,
        
        # Value = 11
        'COVID_impact_T0': 11,
        'COVID_impact_T1': 11,
    }
    
    # Add all Self_Monitoring columns (value = 6)
    for i in range(1, 5):
        prefer_not_to_say_mapping[f'Self_Monitoring_{i}_T0'] = 6
        prefer_not_to_say_mapping[f'Self_Monitoring_{i}_T1'] = 6
    
    # Add all Motivation-related columns (value = 6)
    motiv_types = ['Instrinsic', 'Identified', 'Extrinsic', 'Introjected']
    for motiv_type in motiv_types:
        for i in range(1, 5):
            prefer_not_to_say_mapping[f'Motiv_{motiv_type}_{i}_T0'] = 6
            prefer_not_to_say_mapping[f'Motiv_{motiv_type}_{i}_T1'] = 6
    
    # Add all Amotivation columns (value = 6)
    for i in range(1, 5):
        prefer_not_to_say_mapping[f'Amotivation_{i}_T0'] = 6
        prefer_not_to_say_mapping[f'Amotivation_{i}_T1'] = 6
    
    # Replace the values
    total_replaced = 0
    replacements_by_value = {}
    
    for col, pref_value in prefer_not_to_say_mapping.items():
        if col in df.columns:
            count = (df[col] == pref_value).sum()
            if count > 0:
                df[col] = df[col].replace(pref_value, np.nan)
                total_replaced += count
                if pref_value not in replacements_by_value:
                    replacements_by_value[pref_value] = 0
                replacements_by_value[pref_value] += count
    
    # Print summary
    print(f"Cleaned 'prefer not to say' values:")
    for value, count in sorted(replacements_by_value.items()):
        print(f"  Value {value}: {count} occurrences replaced with NaN")
    print(f"  Total: {total_replaced} values replaced with NaN\n")
    
    return df

# Clean both datasets
print("INTERVENTION GROUP:")
df_intervention_clean = clean_prefer_not_to_say_values(df_intervention_clean)

print("CONTROL GROUP:")
df_control_clean = clean_prefer_not_to_say_values(df_control_clean)


INTERVENTION GROUP:
Cleaned 'prefer not to say' values:
  Value 6: 4112 occurrences replaced with NaN
  Value 8: 65 occurrences replaced with NaN
  Value 11: 121 occurrences replaced with NaN
  Total: 4298 values replaced with NaN

CONTROL GROUP:
Cleaned 'prefer not to say' values:
  Value 6: 2479 occurrences replaced with NaN
  Value 8: 48 occurrences replaced with NaN
  Value 11: 67 occurrences replaced with NaN
  Total: 2594 values replaced with NaN



## 6. Create Derived Variables

### 6.1 Motivation Scores

We create overall motivation scores based on Self-Determination Theory:
- **Formula**: `((Intrinsic + Identified) / 2) - ((Extrinsic + Introjected + Amotivation) / 3)`
- Higher scores indicate more autonomous (self-determined) motivation
- Created for both T0 (baseline) and T1 (follow-up)

In [7]:
def create_motivation_scores(df):
    """Create overall motivation scores from individual components."""
    
    # Motivation_T0
    df['Motivation_T0'] = (
        (df['Motiv_Instrinsic_1_T0'] + df['Motiv_Instrinsic_2_T0'] + 
         df['Motiv_Instrinsic_3_T0'] + df['Motiv_Instrinsic_4_T0'] +
         df['Motiv_Identified_1_T0'] + df['Motiv_Identified_2_T0'] + 
         df['Motiv_Identified_3_T0'] + df['Motiv_Identified_4_T0']) / 2 -
        (df['Motiv_Extrinsic_1_T0'] + df['Motiv_Extrinsic_2_T0'] + 
         df['Motiv_Extrinsic_3_T0'] + df['Motiv_Extrinsic_4_T0'] +
         df['Motiv_Introjected_1_T0'] + df['Motiv_Introjected_2_T0'] + 
         df['Motiv_Introjected_3_T0'] + df['Motiv_Introjected_4_T0'] +
         df['Amotivation_1_T0'] + df['Amotivation_2_T0'] + 
         df['Amotivation_3_T0'] + df['Amotivation_4_T0']) / 3
    )
    
    # Motivation_T1
    df['Motivation_T1'] = (
        (df['Motiv_Instrinsic_1_T1'] + df['Motiv_Instrinsic_2_T1'] + 
         df['Motiv_Instrinsic_3_T1'] + df['Motiv_Instrinsic_4_T1'] +
         df['Motiv_Identified_1_T1'] + df['Motiv_Identified_2_T1'] + 
         df['Motiv_Identified_3_T1'] + df['Motiv_Identified_4_T1']) / 2 -
        (df['Motiv_Extrinsic_1_T1'] + df['Motiv_Extrinsic_2_T1'] + 
         df['Motiv_Extrinsic_3_T1'] + df['Motiv_Extrinsic_4_T1'] +
         df['Motiv_Introjected_1_T1'] + df['Motiv_Introjected_2_T1'] + 
         df['Motiv_Introjected_3_T1'] + df['Motiv_Introjected_4_T1'] +
         df['Amotivation_1_T1'] + df['Amotivation_2_T1'] + 
         df['Amotivation_3_T1'] + df['Amotivation_4_T1']) / 3
    )
    
    print(f"Created Motivation_T0: mean={df['Motivation_T0'].mean():.2f}, std={df['Motivation_T0'].std():.2f}")
    print(f"Created Motivation_T1: mean={df['Motivation_T1'].mean():.2f}, std={df['Motivation_T1'].std():.2f}")
    
    return df

# Create motivation scores for both groups
print("INTERVENTION GROUP:")
df_intervention_clean = create_motivation_scores(df_intervention_clean)

print("\nCONTROL GROUP:")
df_control_clean = create_motivation_scores(df_control_clean)

INTERVENTION GROUP:
Created Motivation_T0: mean=7.08, std=4.60
Created Motivation_T1: mean=6.18, std=5.20

CONTROL GROUP:
Created Motivation_T0: mean=6.80, std=4.61
Created Motivation_T1: mean=6.19, std=5.33


### 6.2 Self-Monitoring Scores

We create overall self-monitoring scores by averaging the 4 individual self-monitoring items for both T0 and T1.

In [8]:
def create_self_monitoring_scores(df):
    """Create overall self-monitoring scores from individual items."""
    
    # Self_Monitoring_T0: average of the 4 T0 items
    df['Self_Monitoring_T0'] = (
        df['Self_Monitoring_1_T0'] + df['Self_Monitoring_2_T0'] + 
        df['Self_Monitoring_3_T0'] + df['Self_Monitoring_4_T0']
    ) / 4
    
    # Self_Monitoring_T1: average of the 4 T1 items
    df['Self_Monitoring_T1'] = (
        df['Self_Monitoring_1_T1'] + df['Self_Monitoring_2_T1'] + 
        df['Self_Monitoring_3_T1'] + df['Self_Monitoring_4_T1']
    ) / 4
    
    print(f"Created Self_Monitoring_T0: mean={df['Self_Monitoring_T0'].mean():.2f}, std={df['Self_Monitoring_T0'].std():.2f}")
    print(f"Created Self_Monitoring_T1: mean={df['Self_Monitoring_T1'].mean():.2f}, std={df['Self_Monitoring_T1'].std():.2f}")
    
    return df

# Create self-monitoring scores for both groups
print("INTERVENTION GROUP:")
df_intervention_clean = create_self_monitoring_scores(df_intervention_clean)

print("\nCONTROL GROUP:")
df_control_clean = create_self_monitoring_scores(df_control_clean)

INTERVENTION GROUP:
Created Self_Monitoring_T0: mean=3.29, std=1.21
Created Self_Monitoring_T1: mean=3.35, std=1.17

CONTROL GROUP:
Created Self_Monitoring_T0: mean=3.17, std=1.23
Created Self_Monitoring_T1: mean=3.30, std=1.21


### 6.4 MVPA_Improvement Scores

Now we calculate MVPA improvement scores by finding the difference in MVPA frequency from T0 to T1. This calculation happens AFTER cleaning "prefer not to say" values, so those responses are properly excluded.

In [9]:
def calculate_mvpa_improvement(df):
    """
    Calculate MVPA_Improvement as the change from T0 to T1.
    
    MVPA_Improvement = MVPA_Frequency_T1 - MVPA_Frequency_T0
    
    Positive values indicate improvement (increased MVPA frequency)
    Negative values indicate decline (decreased MVPA frequency)
    
    Note: Value 8 ("prefer not to say") has been replaced with NaN before this calculation.
    """
    
    df['MVPA_Improvement'] = df['MVPA_Frequency_T1'] - df['MVPA_Frequency_T0']
    
    # Summary statistics (excluding NaN values)
    improvement = df['MVPA_Improvement']
    valid_improvements = improvement.dropna()
    n_improved = (valid_improvements > 0).sum()
    n_declined = (valid_improvements < 0).sum()
    n_unchanged = (valid_improvements == 0).sum()
    n_missing = improvement.isna().sum()
    
    print(f"MVPA_Improvement statistics:")
    print(f"  Valid calculations: {len(valid_improvements)} participants")
    print(f"  Missing (due to NaN in T0 or T1): {n_missing} participants")
    print(f"  Mean change: {improvement.mean():.2f}")
    print(f"  Std deviation: {improvement.std():.2f}")
    print(f"  Min: {improvement.min():.2f}")
    print(f"  Max: {improvement.max():.2f}")
    print(f"\n  Participants improved: {n_improved} ({n_improved/len(valid_improvements)*100:.1f}%)")
    print(f"  Participants declined: {n_declined} ({n_declined/len(valid_improvements)*100:.1f}%)")
    print(f"  Participants unchanged: {n_unchanged} ({n_unchanged/len(valid_improvements)*100:.1f}%)")
    
    return df

# Calculate MVPA_Improvement for intervention group (AFTER cleaning value 8)
print("INTERVENTION GROUP:")
df_intervention_clean = calculate_mvpa_improvement(df_intervention_clean)

# Calculate MVPA_Improvement for control group (AFTER cleaning value 8)
print("\nCONTROL GROUP:")
df_control_clean = calculate_mvpa_improvement(df_control_clean)


INTERVENTION GROUP:
MVPA_Improvement statistics:
  Valid calculations: 971 participants
  Missing (due to NaN in T0 or T1): 36 participants
  Mean change: 0.31
  Std deviation: 1.71
  Min: -7.00
  Max: 7.00

  Participants improved: 407 (41.9%)
  Participants declined: 263 (27.1%)
  Participants unchanged: 301 (31.0%)

CONTROL GROUP:
MVPA_Improvement statistics:
  Valid calculations: 741 participants
  Missing (due to NaN in T0 or T1): 22 participants
  Mean change: 0.25
  Std deviation: 1.80
  Min: -7.00
  Max: 7.00

  Participants improved: 302 (40.8%)
  Participants declined: 215 (29.0%)
  Participants unchanged: 224 (30.2%)


## 7. Select Final Variables

Now we select only the specific 41 variables needed for ML analysis:
- Demographics (Age, Sex, Gender, Age_Group)
- Physical activity measures (MVPA frequency, leisure activities, usual week)
- Sedentary behavior (YAP_sedentary_general for T0 and T1)
- Anthropometric data (Weight, Height, BMI)
- Fitness tests (6-minute walk, standing long jump, handgrip strength)
- School-related PA (PE hours, extracurricular sessions)
- COVID impact
- **Derived aggregate scores** (Motivation_T0, Motivation_T1, Self_Monitoring_T0, Self_Monitoring_T1)
- Outcome measure (MVPA_Improvement)

In [10]:
def clean_dataset(df):
    """Select only the required columns for final analysis and clean 'prefer not to say' values."""
    
    # Define the exact columns we want to keep (41 columns total)
    required_columns = [
        'Age', 'Sex', 'MVPA_Frequency_T0', 'Leisure_Exercise_T0',
        'YAP_sedentary_general_T0', 'Leisure_PA_T0', 'MVPA_Usual_Week_T0', 'Group_Final',
        'Weight_kg_T0', 'Weight_kg_T1', 'Height_cm_T0', 'Height_cm_T1',
        'MVPA_Frequency_T1', 'MVPA_Usual_Week_T1', 'Leisure_Exercise_T1',
        'PE_hours_T0', 'PE_hours_T1',
        'Extracurricular_Session_Coach_T0', 'Extracurricular_Session_Coach_T1',
        'Extracurricular_Session_School_T0', 'Extracurricular_Session_School_T1',
        'Leisure_PA_T1', 'YAP_sedentary_general_T1',
        'COVID_impact_T0', 'COVID_impact_T1',
        'SixMW_T0', 'SixMW_T1', 'SLJ_T0', 'SLJ_T1',
        'HG_Right_T0', 'HG_Left_T0', 'HG_Right_T1', 'HG_Left_T1','MVPA_Improvement', 
        'Motivation_T0', 'Motivation_T1',
        'Self_Monitoring_T0', 'Self_Monitoring_T1'
    ]
    
    # Check which required columns exist in the dataframe
    available_columns = [col for col in required_columns if col in df.columns]
    missing_columns = [col for col in required_columns if col not in df.columns]
    
    if missing_columns:
        print(f"⚠️  Warning: {len(missing_columns)} required columns not found in dataset:")
        for col in missing_columns:
            print(f"     - {col}")
    
    # Select only available columns
    df_cleaned = df[available_columns].copy()
    
    # Replace value 8 ("prefer not to say") with NaN for score-based columns
    # These columns typically have scores ranging from 0-7, where 8 means "prefer not to say"
    score_columns = [
        'MVPA_Frequency_T0', 'MVPA_Frequency_T1', 
        'Leisure_Exercise_T0', 'Leisure_Exercise_T1',
        'MVPA_Usual_Week_T0', 'MVPA_Usual_Week_T1',
        'Leisure_PA_T0', 'Leisure_PA_T1',
        'YAP_sedentary_general_T0', 'YAP_sedentary_general_T1'
    ]
    
    replaced_count = 0
    for col in score_columns:
        if col in df_cleaned.columns:
            count_8s = (df_cleaned[col] == 8).sum()
            if count_8s > 0:
                df_cleaned[col] = df_cleaned[col].replace(8, np.nan)
                replaced_count += count_8s
                print(f"  Replaced {count_8s} value(s) of 8 with NaN in {col}")
    
    if replaced_count > 0:
        print(f"  Total: {replaced_count} 'prefer not to say' values (8) replaced with NaN")
    
    print(f"\nDataset shape: {df.shape} → {df_cleaned.shape}")
    print(f"Selected {len(available_columns)} columns out of {len(required_columns)} required")
    
    return df_cleaned

# Clean both datasets
print("INTERVENTION GROUP:")
df_intervention_final = clean_dataset(df_intervention_clean)

print("\nCONTROL GROUP:")
df_control_final = clean_dataset(df_control_clean)

INTERVENTION GROUP:

Dataset shape: (1007, 386) → (1007, 38)
Selected 38 columns out of 38 required

CONTROL GROUP:

Dataset shape: (763, 386) → (763, 38)
Selected 38 columns out of 38 required


## 8. Final Dataset Summary

Review the final cleaned datasets before export.

In [11]:
print("="*70)
print("FINAL DATASET SUMMARY")
print("="*70)

print("\nINTERVENTION GROUP:")
print(f"  Participants: {len(df_intervention_final)}")
print(f"  Variables: {df_intervention_final.shape[1]}")
print(f"\n  Key derived variables (aggregated):")
print(f"    ✓ Motivation_T0 (from motivation components)")
print(f"    ✓ Motivation_T1 (from motivation components)")
print(f"    ✓ Self_Monitoring_T0 (average of 4 items)")
print(f"    ✓ Self_Monitoring_T1 (average of 4 items)")
print(f"\n  Sample of other variables:")
for var in ['Age', 'Sex', 'MVPA_Frequency_T0', 'MVPA_Frequency_T1', 
            'YAP_sedentary_general_T0', 'YAP_sedentary_general_T1',
            'BMI_T0', 'SixMW_T0', 'SixMW_T1', 'MVPA_Improvement']:
    if var in df_intervention_final.columns:
        print(f"    ✓ {var}")

print("\nCONTROL GROUP:")
print(f"  Participants: {len(df_control_final)}")
print(f"  Variables: {df_control_final.shape[1]}")
print(f"\n  Key derived variables (aggregated):")
print(f"    ✓ Motivation_T0 (from motivation components)")
print(f"    ✓ Motivation_T1 (from motivation components)")
print(f"    ✓ Self_Monitoring_T0 (average of 4 items)")
print(f"    ✓ Self_Monitoring_T1 (average of 4 items)")
print(f"\n  Sample of other variables:")
for var in ['Age', 'Sex', 'MVPA_Frequency_T0', 'MVPA_Frequency_T1', 
            'YAP_sedentary_general_T0', 'YAP_sedentary_general_T1',
            'BMI_T0', 'SixMW_T0', 'SixMW_T1', 'MVPA_Improvement']:
    if var in df_control_final.columns:
        print(f"    ✓ {var}")

FINAL DATASET SUMMARY

INTERVENTION GROUP:
  Participants: 1007
  Variables: 38

  Key derived variables (aggregated):
    ✓ Motivation_T0 (from motivation components)
    ✓ Motivation_T1 (from motivation components)
    ✓ Self_Monitoring_T0 (average of 4 items)
    ✓ Self_Monitoring_T1 (average of 4 items)

  Sample of other variables:
    ✓ Age
    ✓ Sex
    ✓ MVPA_Frequency_T0
    ✓ MVPA_Frequency_T1
    ✓ YAP_sedentary_general_T0
    ✓ YAP_sedentary_general_T1
    ✓ SixMW_T0
    ✓ SixMW_T1
    ✓ MVPA_Improvement

CONTROL GROUP:
  Participants: 763
  Variables: 38

  Key derived variables (aggregated):
    ✓ Motivation_T0 (from motivation components)
    ✓ Motivation_T1 (from motivation components)
    ✓ Self_Monitoring_T0 (average of 4 items)
    ✓ Self_Monitoring_T1 (average of 4 items)

  Sample of other variables:
    ✓ Age
    ✓ Sex
    ✓ MVPA_Frequency_T0
    ✓ MVPA_Frequency_T1
    ✓ YAP_sedentary_general_T0
    ✓ YAP_sedentary_general_T1
    ✓ SixMW_T0
    ✓ SixMW_T1
    ✓ M

## 9. Export Final Datasets

Export the cleaned datasets to separate CSV files for further analysis.

In [12]:
# Export intervention group
intervention_filename = 'data/intervention_group_clean.csv'
df_intervention_final.to_csv(intervention_filename, index=False)
print(f"✅ Intervention group exported to: {intervention_filename}")
print(f"   {len(df_intervention_final)} participants, {df_intervention_final.shape[1]} variables")

# Export control group
control_filename = 'data/control_group_clean.csv'
df_control_final.to_csv(control_filename, index=False)
print(f"\n✅ Control group exported to: {control_filename}")
print(f"   {len(df_control_final)} participants, {df_control_final.shape[1]} variables")

print("\n" + "="*70)
print("DATA PREPARATION COMPLETE!")
print("="*70)
print("\nNext steps:")
print("  1. Use intervention_group_clean.csv for intervention analysis")
print("  2. Use control_group_clean.csv for control analysis")
print("  3. Compare outcomes between groups for effectiveness evaluation")

✅ Intervention group exported to: data/intervention_group_clean.csv
   1007 participants, 38 variables

✅ Control group exported to: data/control_group_clean.csv
   763 participants, 38 variables

DATA PREPARATION COMPLETE!

Next steps:
  1. Use intervention_group_clean.csv for intervention analysis
  2. Use control_group_clean.csv for control analysis
  3. Compare outcomes between groups for effectiveness evaluation
