# MoCA Data Concordance Check

This notebook performs a comprehensive comparison between two MoCA datasets to verify 100% concordance for 75 participants.

**Files being compared:**
- **File 1 (CSV):** `moca_wide_filtered.csv` - Contains 106 participants
- **File 2 (Excel):** `2026-01-28_Moca_structural_verified.xlsx` - Contains 75 participants (verified subset)

**What we're checking:**
- All 75 participants from File 2 exist in File 1
- Ages match across all timepoints (T0, T2, T4, T6, T8, T10)
- MoCA raw scores match across all timepoints
- MoCA corrected scores match across all timepoints

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np

## 2. Load Data Files

In [2]:
# Define file paths
file1_path = r"C:\Users\okkam\Desktop\labo\article 2\Longitudinal_Multimodal_Data_CIMAQ\Moca\Moca_updated\moca_wide_filtered.csv"
file2_path = r"C:\Users\okkam\Desktop\labo\article 2\Longitudinal_Multimodal_Data_CIMAQ\2026-01-28_Moca_structural_verified.xlsx"

# Load File 1 (CSV)
print("Loading File 1 (CSV)...")
df_csv = pd.read_csv(file1_path)
print(f"✓ Loaded {len(df_csv)} participants from CSV file")

# Load File 2 (Excel) - Data starts at row 45 (header=44 in 0-based indexing)
print("\nLoading File 2 (Excel)...")
df_excel = pd.read_excel(file2_path, header=44)
print(f"✓ Loaded {len(df_excel)} participants from Excel file")

# Display basic info
print(f"\n{'='*60}")
print("File 1 (CSV) Structure:")
print(f"  Shape: {df_csv.shape}")
print(f"  Participants: {df_csv['id_participant'].nunique()}")
print(f"\nFile 2 (Excel) Structure:")
print(f"  Shape: {df_excel.shape}")
print(f"  Participants: {df_excel['PSCID'].nunique()}")

Loading File 1 (CSV)...
✓ Loaded 106 participants from CSV file

Loading File 2 (Excel)...
✓ Loaded 75 participants from Excel file

File 1 (CSV) Structure:
  Shape: (106, 19)
  Participants: 106

File 2 (Excel) Structure:
  Shape: (75, 25)
  Participants: 75


## 3. Verify Participant Overlap

In [3]:
# Get unique participant IDs from both files
participants_csv = set(df_csv['id_participant'].values)
participants_excel = set(df_excel['PSCID'].values)

# Find overlap
common_participants = participants_csv.intersection(participants_excel)
missing_participants = participants_excel - participants_csv

print(f"{'='*60}")
print("PARTICIPANT OVERLAP ANALYSIS")
print(f"{'='*60}")
print(f"Participants in CSV:   {len(participants_csv)}")
print(f"Participants in Excel: {len(participants_excel)}")
print(f"Common participants:   {len(common_participants)}")
print(f"Missing from CSV:      {len(missing_participants)}")

if missing_participants:
    print(f"\n⚠ Warning: {len(missing_participants)} participants in Excel are missing from CSV:")
    print(sorted(missing_participants))
else:
    print("\n✓ All Excel participants found in CSV file!")

PARTICIPANT OVERLAP ANALYSIS
Participants in CSV:   106
Participants in Excel: 75
Common participants:   75
Missing from CSV:      0

✓ All Excel participants found in CSV file!


## 4. Merge Datasets for Comparison

In [4]:
# Merge the two datasets on participant ID
merged = df_csv.merge(
    df_excel, 
    left_on='id_participant', 
    right_on='PSCID', 
    how='inner',
    suffixes=('_csv', '_excel')
)

print(f"Merged dataset contains {len(merged)} participants")
print(f"\nColumn mapping:")
print(f"  CSV participant ID:   'id_participant'")
print(f"  Excel participant ID: 'PSCID'")
print(f"\nFirst few participants in merged dataset:")
print(merged[['id_participant', 'PSCID']].head())

Merged dataset contains 75 participants

Column mapping:
  CSV participant ID:   'id_participant'
  Excel participant ID: 'PSCID'

First few participants in merged dataset:
   id_participant    PSCID
0         3002498  3002498
1         3025432  3025432
2         3100205  3100205
3         3123186  3123186
4         3149469  3149469


## 5. Define Column Mappings

Map column names between the two files:
- **Ages:** `age_Initiale` → `Age_T0`, `age_suivi-2ans` → `Age_T2`, etc.
- **Raw MoCA Scores:** `moca_score_total_30_Initiale` → `Moca_raw_T0`, etc.
- **Corrected MoCA Scores:** `moca_score_total_plus_scolarite_30_Initiale` → `Moca_cor_T0`, etc.

In [5]:
# Define column pairs to compare between CSV and Excel files
column_comparisons = {
    'Age': [
        ('age_Initiale', 'Age_T0'),
        ('age_suivi-2ans', 'Age_T2'),
        ('age_suivi-4ans', 'Age_T4'),
        ('age_suivi-6ans', 'Age_T6'),
        ('age_suivi-8ans', 'Age_T8'),
        ('age_suivi-10ans', 'Age_T10'),
    ],
    'MoCA Raw Scores': [
        ('moca_score_total_30_Initiale', 'Moca_raw_T0'),
        ('moca_score_total_30_suivi-2ans', 'Moca_raw_T2'),
        ('moca_score_total_30_suivi-4ans', 'Moca_raw_T4'),
        ('moca_score_total_30_suivi-6ans', 'Moca_raw_T6'),
        ('moca_score_total_30_suivi-8ans', 'Moca_raw_T8'),
        ('moca_score_total_30_suivi-10ans', 'Moca_raw_T10'),
    ],
    'MoCA Corrected Scores': [
        ('moca_score_total_plus_scolarite_30_Initiale', 'Moca_cor_T0'),
        ('moca_score_total_plus_scolarite_30_suivi-2ans', 'Moca_cor_T2'),
        ('moca_score_total_plus_scolarite_30_suivi-4ans', 'Moca_cor_T4'),
        ('moca_score_total_plus_scolarite_30_suivi-6ans', 'Moca_cor_T6'),
        ('moca_score_total_plus_scolarite_30_suivi-8ans', 'Moca_cor_T8'),
    ],
}

print("Column mappings defined:")
for category, pairs in column_comparisons.items():
    print(f"\n{category}: {len(pairs)} timepoints")

Column mappings defined:

Age: 6 timepoints

MoCA Raw Scores: 6 timepoints

MoCA Corrected Scores: 5 timepoints


## 6. Perform Concordance Analysis

For each column pair, we'll check:
- **Matches:** Both files have the same value
- **Mismatches:** Both files have different values
- **Both Missing:** Both files have no data (NaN)
- **One Missing:** Only one file has data

In [6]:
# Initialize tracking variables
total_comparisons = 0
total_matches = 0
total_mismatches = 0
total_both_missing = 0
total_one_missing = 0
mismatched_details = []

print(f"{'='*80}")
print("CONCORDANCE ANALYSIS RESULTS")
print(f"{'='*80}")

# Iterate through each category and column pair
for category, pairs in column_comparisons.items():
    print(f"\n--- {category} ---")
    
    for col_csv, col_excel in pairs:
        # Check if column exists in Excel file
        if col_excel not in merged.columns:
            print(f"  {col_csv:50s} vs {col_excel:20s}: COLUMN NOT FOUND")
            continue
        
        # Initialize counters for this column pair
        matches = 0
        mismatches = 0
        both_missing = 0
        one_missing = 0
        
        # Compare values for each participant
        for idx, row in merged.iterrows():
            val_csv = row[col_csv]
            val_excel = row[col_excel]
            total_comparisons += 1
            
            # Check if both values are missing (NaN)
            if pd.isna(val_csv) and pd.isna(val_excel):
                both_missing += 1
                total_both_missing += 1
            
            # Check if only one value is missing
            elif pd.isna(val_csv) or pd.isna(val_excel):
                one_missing += 1
                total_one_missing += 1
            
            # Compare numeric values with small tolerance for floating point
            elif isinstance(val_csv, (int, float)) and isinstance(val_excel, (int, float)):
                if abs(val_csv - val_excel) < 0.01:  # Allow 0.01 tolerance
                    matches += 1
                    total_matches += 1
                else:
                    mismatches += 1
                    total_mismatches += 1
                    mismatched_details.append({
                        'participant': row['id_participant'],
                        'column_pair': f"{col_csv} vs {col_excel}",
                        'csv_value': val_csv,
                        'excel_value': val_excel,
                        'difference': abs(val_csv - val_excel)
                    })
            
            # Compare other value types
            elif val_csv == val_excel:
                matches += 1
                total_matches += 1
            else:
                mismatches += 1
                total_mismatches += 1
                mismatched_details.append({
                    'participant': row['id_participant'],
                    'column_pair': f"{col_csv} vs {col_excel}",
                    'csv_value': val_csv,
                    'excel_value': val_excel,
                    'difference': 'N/A'
                })
        
        # Determine status
        status = "OK - MATCH" if mismatches == 0 else "ERROR - MISMATCH"
        
        # Print results for this column pair
        print(f"  {col_csv:50s} vs {col_excel:20s}")
        print(f"    Matches: {matches:3d} | Mismatches: {mismatches:3d} | "
              f"Both Missing: {both_missing:3d} | One Missing: {one_missing:3d} | [{status}]")

CONCORDANCE ANALYSIS RESULTS

--- Age ---
  age_Initiale                                       vs Age_T0              
    Matches:  75 | Mismatches:   0 | Both Missing:   0 | One Missing:   0 | [OK - MATCH]
  age_suivi-2ans                                     vs Age_T2              
    Matches:  69 | Mismatches:   0 | Both Missing:   6 | One Missing:   0 | [OK - MATCH]
  age_suivi-4ans                                     vs Age_T4              
    Matches:  58 | Mismatches:   0 | Both Missing:  17 | One Missing:   0 | [OK - MATCH]
  age_suivi-6ans                                     vs Age_T6              
    Matches:  46 | Mismatches:   0 | Both Missing:  29 | One Missing:   0 | [OK - MATCH]
  age_suivi-8ans                                     vs Age_T8              
    Matches:  36 | Mismatches:   0 | Both Missing:  39 | One Missing:   0 | [OK - MATCH]
  age_suivi-10ans                                    vs Age_T10             
    Matches:   5 | Mismatches:   0 | Both Missing: 

## 7. Summary Statistics

In [7]:
print(f"\n{'='*80}")
print("SUMMARY STATISTICS")
print(f"{'='*80}")
print(f"Total comparisons performed:  {total_comparisons:4d}")
print(f"Total matches:                {total_matches:4d} ({100*total_matches/total_comparisons:.1f}%)")
print(f"Total mismatches:             {total_mismatches:4d} ({100*total_mismatches/total_comparisons if total_comparisons > 0 else 0:.1f}%)")
print(f"Both values missing:          {total_both_missing:4d} ({100*total_both_missing/total_comparisons:.1f}%)")
print(f"One value missing:            {total_one_missing:4d} ({100*total_one_missing/total_comparisons if total_comparisons > 0 else 0:.1f}%)")

# Calculate concordance rate (matches / non-missing comparisons)
non_missing_comparisons = total_matches + total_mismatches
if non_missing_comparisons > 0:
    concordance_rate = 100 * total_matches / non_missing_comparisons
    print(f"\nConcordance rate (non-missing values): {concordance_rate:.1f}%")

print(f"{'='*80}")


SUMMARY STATISTICS
Total comparisons performed:  1275
Total matches:                 862 (67.6%)
Total mismatches:                0 (0.0%)
Both values missing:           413 (32.4%)
One value missing:               0 (0.0%)

Concordance rate (non-missing values): 100.0%


## 8. Final Verdict

In [8]:
if total_mismatches == 0:
    print("\n" + "="*80)
    print("✓✓✓ VERIFICATION SUCCESSFUL ✓✓✓")
    print("="*80)
    print("\n100% CONCORDANCE ACHIEVED!")
    print(f"\nAll {total_matches} non-missing values match perfectly between the two files.")
    print(f"The {len(merged)} participants in the Excel file have complete data integrity.")
    print("\n" + "="*80)
else:
    print("\n" + "="*80)
    print("✗✗✗ VERIFICATION FAILED ✗✗✗")
    print("="*80)
    print(f"\nDiscrepancies found: {total_mismatches} mismatches detected")
    print("\nFirst 20 mismatches:")
    for i, detail in enumerate(mismatched_details[:20], 1):
        print(f"\n  {i}. Participant {detail['participant']}")
        print(f"     Column pair: {detail['column_pair']}")
        print(f"     CSV value:   {detail['csv_value']}")
        print(f"     Excel value: {detail['excel_value']}")
        if detail['difference'] != 'N/A':
            print(f"     Difference:  {detail['difference']}")
    
    if len(mismatched_details) > 20:
        print(f"\n  ... and {len(mismatched_details) - 20} more mismatches")
    print("\n" + "="*80)


✓✓✓ VERIFICATION SUCCESSFUL ✓✓✓

100% CONCORDANCE ACHIEVED!

All 862 non-missing values match perfectly between the two files.
The 75 participants in the Excel file have complete data integrity.

