# Exploratory Data Analysis (EDA)
## Predicting AI Acceptance in Mental Health Interventions through Self-Determination Theory

This notebook creates two clean, analysis-ready datasets:
1. **CN_all_coalesced.csv** - China data with all composite scores
2. **merged.csv** - Combined USA + China data ready for hypothesis testing

### Research Questions:
- **H1**: Main Effect - SDT predicts AI Acceptance
- **H2**: Attitudinal Moderation - AI attitudes moderate SDT ‚Üí AI Acceptance relationship
- **H3**: Cross-Cultural Moderation - Effects stronger in China vs US
- **H4**: Mediation by Epistemic Trust - Epistemic Trust mediates SDT ‚Üí AI Acceptance

## 0. Library Import

In [125]:
from __future__ import annotations

import warnings
from pathlib import Path
from typing import List, Dict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

warnings.filterwarnings("ignore")

plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")

DATA_DIR = Path("data")
MERGED_DIR = DATA_DIR / "merged"
CHINA_DIR = DATA_DIR / "china"
USA_DIR = DATA_DIR / "usa"

MERGED_DIR.mkdir(parents=True, exist_ok=True)

## 1. Load and Prepare China Data

China has three separate files that need to be coalesced:
- CN_all.csv - Combined sample
- CN_client.csv - Client-specific items
- CN_therapist.csv - Therapist-specific items

In [144]:
cn_all = pd.read_csv(CHINA_DIR / "CN_all.csv")
cn_client = pd.read_csv(CHINA_DIR / "CN_client.csv")
cn_therapist = pd.read_csv(CHINA_DIR / "CN_therapist.csv")

print(f"CN_all: {cn_all.shape}")
print(f"CN_client: {cn_client.shape}")
print(f"CN_therapist: {cn_therapist.shape}")

CN_all: (485, 514)
CN_client: (216, 565)
CN_therapist: (269, 565)


### 1.1 Merge Client and Therapist Data into CN_all

Client and therapist files have additional items not in CN_all. We'll merge them in.

In [145]:
# Find columns unique to client/therapist files
all_cols = set(cn_all.columns)
client_cols = set(cn_client.columns)
ther_cols = set(cn_therapist.columns)

# Columns that exist in client/therapist but not in all
missing_from_all = sorted(list((client_cols | ther_cols) - all_cols))

print(f"Found {len(missing_from_all)} columns to add from client/therapist files")
print(f"Examples: {missing_from_all[:5]}")

Found 53 columns to add from client/therapist files
Examples: ['AI_use_sum', 'AIavatar_CO_mean', 'AIavatar_EOU_mean', 'AIavatar_HC_mean', 'AIavatar_HM_mean']


In [146]:
# Get columns that overlap between client and therapist
overlap_cols = sorted(list(set(missing_from_all).intersection(client_cols).intersection(ther_cols)))
print(f"{len(overlap_cols)} columns exist in both client and therapist files")

53 columns exist in both client and therapist files


In [147]:
# Prepare client and therapist data for merging
# Add suffixes to distinguish source
client_extra_cols = [c for c in missing_from_all if c in cn_client.columns]
therapist_extra_cols = [c for c in missing_from_all if c in cn_therapist.columns]

# Create separate dataframes with just ID and the extra columns
client_extra = cn_client[["ID"] + client_extra_cols].copy()
therapist_extra = cn_therapist[["ID"] + therapist_extra_cols].copy()

# Add suffixes to overlapping columns
client_extra = client_extra.rename(columns={c: f"{c}_client" for c in client_extra_cols})
therapist_extra = therapist_extra.rename(columns={c: f"{c}_therapist" for c in therapist_extra_cols})

# Merge into cn_all
cn_augmented = cn_all.merge(client_extra, on="ID", how="left")
cn_augmented = cn_augmented.merge(therapist_extra, on="ID", how="left")

print(f"Augmented CN_all from {cn_all.shape} to {cn_augmented.shape}")

Augmented CN_all from (485, 514) to (485, 620)


In [148]:
cn_augmented.head()

Unnamed: 0,ID,responsetime,workinmh,workinmh_text,mh_service,receive_mh_service,age,gender,ethnicity,province,...,chatbot_CO_mean_therapist,chatbot_EOU_mean_therapist,chatbot_HC_mean_therapist,chatbot_HM_mean_therapist,chatbot_PPR_mean_therapist,chatbot_SE_mean_therapist,chatbot_TQE_mean_therapist,chatbot_mean_therapist,stagesAI_mean_therapist,therapist_filter_therapist
0,380.0,3853.0,2.0,,1.0,2.0,10.0,3.0,0.0,ÈôïË•øÁúÅ,...,3.0,3.5,2.75,4.0,2.5,3.5,3.666667,3.27381,4.428395,0.0
1,269.0,1778.0,2.0,,1.0,1.0,17.0,3.0,0.0,‰∏äÊµ∑Â∏Ç,...,5.0,5.0,2.0,5.0,1.0,5.0,5.0,4.0,5.0,0.0
2,73.0,1810.0,2.0,,1.0,2.0,2.0,3.0,0.0,ÈªëÈæôÊ±üÁúÅ,...,2.0,3.0,3.0,3.0,3.0,2.0,3.0,2.714286,3.352469,0.0
3,544.0,2001.0,1.0,,2.0,2.0,7.0,3.0,0.0,Âåó‰∫¨Â∏Ç,...,,,,,,,,,,
4,415.0,1523.0,2.0,,1.0,1.0,1.0,3.0,0.0,ÂÜÖËíôÂè§Ëá™Ê≤ªÂå∫,...,,,,,,,,,,


### 1.2 Coalesce Client/Therapist Columns

For columns that exist in both client and therapist versions, we'll coalesce them into single unified columns.

In [149]:
# Create coalesced versions of overlapping columns
cn_coalesced = cn_augmented.copy()

# Get list of base names that have both _client and _therapist versions
client_suffix_cols = [c for c in cn_coalesced.columns if c.endswith("_client")]
therapist_suffix_cols = [c for c in cn_coalesced.columns if c.endswith("_therapist")]

# Find base names that exist in both
client_bases = {c.replace("_client", "") for c in client_suffix_cols}
therapist_bases = {c.replace("_therapist", "") for c in therapist_suffix_cols}
common_bases = sorted(list(client_bases & therapist_bases))

# For each paired column, create a unified version
for base in common_bases:
    client_col = f"{base}_client"
    therapist_col = f"{base}_therapist"
    
    # Coalesce: use client value if available, otherwise therapist value
    cn_coalesced[base] = cn_coalesced[client_col].fillna(cn_coalesced[therapist_col])

print(f"Created {len(common_bases)} coalesced columns")
print(f"Examples: {common_bases[:5]}")

Created 53 coalesced columns
Examples: ['AI_use_sum', 'AIavatar_CO_mean', 'AIavatar_EOU_mean', 'AIavatar_HC_mean', 'AIavatar_HM_mean']


### 1.3 Add role_label Variable

Identify whether each participant is a therapist or client based on which version has data.

In [150]:
# Determine role based on which file the participant came from
# Check if they have data in client-specific or therapist-specific columns
if 'therapist' in cn_coalesced.columns:
    # If there's already a therapist column, use it
    cn_coalesced['role_label'] = cn_coalesced['therapist'].map({1: 'therapist', 0: 'client'})
else:
    # Infer from which suffixed columns have data
    # Check a sample client-only column to see who has data
    sample_client_col = f"{common_bases[0]}_client" if common_bases else None
    sample_ther_col = f"{common_bases[0]}_therapist" if common_bases else None
    
    if sample_client_col and sample_ther_col:
        cn_coalesced['role_label'] = 'unknown'
        cn_coalesced.loc[cn_coalesced[sample_client_col].notna(), 'role_label'] = 'client'
        cn_coalesced.loc[cn_coalesced[sample_ther_col].notna(), 'role_label'] = 'therapist'

if 'role_label' in cn_coalesced.columns:
    print("Role distribution:")
    print(cn_coalesced['role_label'].value_counts())
else:
    print("Could not determine role_label")

Role distribution:
role_label
therapist    269
client       216
Name: count, dtype: int64


### 1.4 Compute Composite Scores for China

Calculate mean scores for all multi-item scales.

In [151]:
#  1. TENS_Life_mean (Self-Determination - 9 items, first 6 are reverse-coded)
tens_items = ['TENS_Life_1r', 'TENS_Life_2r', 'TENS_Life_3r', 'TENS_Life_4r', 'TENS_Life_5r', 'TENS_Life_6r',
              'TENS_Life_7', 'TENS_Life_8', 'TENS_Life_9']
tens_available = [c for c in tens_items if c in cn_coalesced.columns]
if tens_available:
    cn_coalesced['TENS_Life_mean'] = cn_coalesced[tens_available].mean(axis=1)
    print(f"TENS_Life_mean: {len(tens_available)}/9 items, Mean={cn_coalesced['TENS_Life_mean'].mean():.2f}")

# 2. ET_mean (Epistemic Trust - 15 items)
et_items = [f'ET_{i}' for i in range(1, 16)]
et_available = [c for c in et_items if c in cn_coalesced.columns]
if et_available:
    cn_coalesced['ET_mean'] = cn_coalesced[et_available].mean(axis=1)
    print(f"ET_mean: {len(et_available)}/15 items, Mean={cn_coalesced['ET_mean'].mean():.2f}")

# 3. SSRPH_mean (Stigma - 5 items)
ssrph_items = [f'SSRPH_{i}' for i in range(1, 6)]
ssrph_available = [c for c in ssrph_items if c in cn_coalesced.columns]
if ssrph_available:
    cn_coalesced['SSRPH_mean'] = cn_coalesced[ssrph_available].mean(axis=1)
    print(f"SSRPH_mean: {len(ssrph_available)}/5 items, Mean={cn_coalesced['SSRPH_mean'].mean():.2f}")

# 4. PHQ5_mean (Depression - 5 items)
phq_items = [f'PHQ5_{i}' for i in range(1, 6)]
phq_available = [c for c in phq_items if c in cn_coalesced.columns]
if phq_available:
    cn_coalesced['PHQ5_mean'] = cn_coalesced[phq_available].mean(axis=1)
    print(f"PHQ5_mean: {len(phq_available)}/5 items, Mean={cn_coalesced['PHQ5_mean'].mean():.2f}")

# 5. GAAIS_mean (General AI Attitudes - using existing GAAIS_pos and GAAIS_neg if available)
if 'GAAIS_pos' in cn_coalesced.columns and 'GAAIS_neg' in cn_coalesced.columns:
    # Overall is average of positive and (reversed) negative
    cn_coalesced['GAAIS_mean'] = (cn_coalesced['GAAIS_pos'] + (8 - cn_coalesced['GAAIS_neg'])) / 2
    print(f"GAAIS_mean: computed from pos/neg subscales, Mean={cn_coalesced['GAAIS_mean'].mean():.2f}")
elif 'GAAIS_mean' not in cn_coalesced.columns:
    # Compute from individual items
    gaais_items = [f'GAAIS_{i}' for i in range(1, 11)]
    gaais_available = [c for c in gaais_items if c in cn_coalesced.columns]
    if gaais_available:
        cn_coalesced['GAAIS_mean'] = cn_coalesced[gaais_available].mean(axis=1)
        print(f"GAAIS_mean: {len(gaais_available)} items, Mean={cn_coalesced['GAAIS_mean'].mean():.2f}")

# 6. UTAUT_AI_mean (AI Acceptance - 26 items)
utaut_ai_items = [f'UTAUT_AI{i}' for i in range(1, 27)]
# Also check for reversed items
utaut_ai_items_with_r = []
for i in range(1, 27):
    if f'UTAUT_AI{i}r' in cn_coalesced.columns:
        utaut_ai_items_with_r.append(f'UTAUT_AI{i}r')
    elif f'UTAUT_AI{i}' in cn_coalesced.columns:
        utaut_ai_items_with_r.append(f'UTAUT_AI{i}')

if utaut_ai_items_with_r:
    cn_coalesced['UTAUT_AI_mean'] = cn_coalesced[utaut_ai_items_with_r].mean(axis=1)
    print(f"UTAUT_AI_mean: {len(utaut_ai_items_with_r)} items, Mean={cn_coalesced['UTAUT_AI_mean'].mean():.2f}")

print("All composite scores computed for China")

TENS_Life_mean: 9/9 items, Mean=4.35
ET_mean: 15/15 items, Mean=4.70
SSRPH_mean: 5/5 items, Mean=2.30
PHQ5_mean: 5/5 items, Mean=1.53
GAAIS_mean: computed from pos/neg subscales, Mean=4.57
UTAUT_AI_mean: 26 items, Mean=3.64
All composite scores computed for China


### 1.5 Add Country Label and Export China Data

In [152]:
# Add Country column
cn_coalesced['Country'] = 'China'

# Export to CSV
output_path = CHINA_DIR / "CN_all_coalesced.csv"
cn_coalesced.to_csv(output_path, index=False)

print("CHINA DATA COMPLETE")
print(f"Saved: {output_path}")
print(f"Shape: {cn_coalesced.shape}")
print(f"Participants: {len(cn_coalesced)}")
print(f"Variables: {len(cn_coalesced.columns)}")

CHINA DATA COMPLETE
Saved: data/china/CN_all_coalesced.csv
Shape: (485, 679)
Participants: 485
Variables: 679


In [153]:
# Show key composites
key_composites = ['TENS_Life_mean', 'ET_mean', 'SSRPH_mean', 'PHQ5_mean', 'GAAIS_mean', 'UTAUT_AI_mean']
available_composites = [c for c in key_composites if c in cn_coalesced.columns]
print(f"Key composite scores ({len(available_composites)}):")
for comp in available_composites:
    n_valid = cn_coalesced[comp].notna().sum()
    mean = cn_coalesced[comp].mean()
    print(f"{comp:20} {n_valid:3} valid, M={mean:.2f}")

Key composite scores (6):
TENS_Life_mean       485 valid, M=4.35
ET_mean              485 valid, M=4.70
SSRPH_mean           485 valid, M=2.30
PHQ5_mean            485 valid, M=1.53
GAAIS_mean           485 valid, M=4.57
UTAUT_AI_mean        485 valid, M=3.64


## 2. Load and Prepare USA Data

In [154]:
# Load USA data
usa_all = pd.read_csv(USA_DIR / "USA_all.csv", low_memory=False)
print(f"USA_all: {usa_all.shape}")

USA_all: (1857, 624)


### 2.1 Standardize USA Column Names

USA uses different naming conventions. We'll standardize key variables to match China.

In [155]:
# Create mapping dictionary for USA column renaming
usa_renamed = usa_all.copy()

# Rename demographics to match China
rename_map = {}
if 'Age' in usa_renamed.columns and 'age' not in usa_renamed.columns:
    rename_map['Age'] = 'age'
if 'Gender' in usa_renamed.columns and 'gender' not in usa_renamed.columns:
    rename_map['Gender'] = 'gender'
if 'Edu' in usa_renamed.columns and 'edu' not in usa_renamed.columns:
    rename_map['Edu'] = 'edu'

# Rename PHQ_5 to PHQ5
phq_rename = {}
for i in range(1, 6):
    if f'PHQ_5_{i}' in usa_renamed.columns:
        phq_rename[f'PHQ_5_{i}'] = f'PHQ5_{i}'
rename_map.update(phq_rename)

if 'PHQ_5_mean' in usa_renamed.columns:
    rename_map['PHQ_5_mean'] = 'PHQ5_mean'

# Apply renaming
if rename_map:
    usa_renamed = usa_renamed.rename(columns=rename_map)
    print(f"Renamed {len(rename_map)} USA columns to match China conventions")
    print(f"Examples: {list(rename_map.items())[:5]}")

Renamed 9 USA columns to match China conventions
Examples: [('Age', 'age'), ('Gender', 'gender'), ('Edu', 'edu'), ('PHQ_5_1', 'PHQ5_1'), ('PHQ_5_2', 'PHQ5_2')]


### 2.2 Compute Composite Scores for USA

In [158]:
# 1. TENS_Life_mean (USA has non-reversed versions)
tens_items_usa = [f'TENS_Life_{i}' for i in range(1, 10)]
tens_available_usa = [c for c in tens_items_usa if c in usa_renamed.columns]
if tens_available_usa:
    # Need to reverse items 1-6 for USA (assuming 7-point scale)
    usa_reversed = usa_renamed.copy()
    for i in range(1, 7):
        col = f'TENS_Life_{i}'
        if col in usa_reversed.columns:
            usa_reversed[f'TENS_Life_{i}r'] = 8 - usa_reversed[col]
    
    # Now compute mean with reversed versions
    tens_for_mean = [f'TENS_Life_{i}r' for i in range(1, 7)] + [f'TENS_Life_{i}' for i in range(7, 10)]
    tens_for_mean = [c for c in tens_for_mean if c in usa_reversed.columns]
    usa_renamed['TENS_Life_mean'] = usa_reversed[tens_for_mean].mean(axis=1)
    print(f"TENS_Life_mean: {len(tens_available_usa)} items (reversed 1-6), Mean={usa_renamed['TENS_Life_mean'].mean():.2f}")

# 2. ET_mean
et_items = [f'ET_{i}' for i in range(1, 16)]
et_available = [c for c in et_items if c in usa_renamed.columns]
if et_available:
    usa_renamed['ET_mean'] = usa_renamed[et_available].mean(axis=1)
    print(f"ET_mean: {len(et_available)}/15 items, Mean={usa_renamed['ET_mean'].mean():.2f}")

# 3. SSRPH_mean
ssrph_items = [f'SSRPH_{i}' for i in range(1, 6)]
ssrph_available = [c for c in ssrph_items if c in usa_renamed.columns]
if ssrph_available:
    usa_renamed['SSRPH_mean'] = usa_renamed[ssrph_available].mean(axis=1)
    print(f"SSRPH_mean: {len(ssrph_available)}/5 items, Mean={usa_renamed['SSRPH_mean'].mean():.2f}")

# 4. PHQ5_mean (if not already present)
if 'PHQ5_mean' not in usa_renamed.columns:
    phq_items = [f'PHQ5_{i}' for i in range(1, 6)]
    phq_available = [c for c in phq_items if c in usa_renamed.columns]
    if phq_available:
        usa_renamed['PHQ5_mean'] = usa_renamed[phq_available].mean(axis=1)
        print(f"PHQ5_mean: {len(phq_available)}/5 items, Mean={usa_renamed['PHQ5_mean'].mean():.2f}")
else:
    print(f"PHQ5_mean: already exists, Mean={usa_renamed['PHQ5_mean'].mean():.2f}")

# 5. GAAIS_mean
if 'GAAIS_pos' in usa_renamed.columns and 'GAAIS_neg' in usa_renamed.columns:
    usa_renamed['GAAIS_mean'] = (usa_renamed['GAAIS_pos'] + (8 - usa_renamed['GAAIS_neg'])) / 2
    print(f"GAAIS_mean: computed from pos/neg subscales, Mean={usa_renamed['GAAIS_mean'].mean():.2f}")
elif 'GAAIS_mean' not in usa_renamed.columns:
    gaais_items = [f'GAAIS_{i}' for i in range(1, 11)]
    gaais_available = [c for c in gaais_items if c in usa_renamed.columns]
    if gaais_available:
        usa_renamed['GAAIS_mean'] = usa_renamed[gaais_available].mean(axis=1)
        print(f"GAAIS_mean: {len(gaais_available)} items, Mean={usa_renamed['GAAIS_mean'].mean():.2f}")

# 6. UTAUT - USA has different structure (UTAUT_1_X, UTAUT_2_X, UTAUT_3_X)
# We'll average across all UTAUT items for simplicity
utaut_cols = [c for c in usa_renamed.columns if c.startswith('UTAUT_') and '_' in c[6:]]
# Filter out validation columns
utaut_cols = [c for c in utaut_cols if 'validation' not in c.lower()]
if utaut_cols:
    usa_renamed['UTAUT_AI_mean'] = usa_renamed[utaut_cols].mean(axis=1)
    print(f"UTAUT_AI_mean: {len(utaut_cols)} items, Mean={usa_renamed['UTAUT_AI_mean'].mean():.2f}")

print("All USA composite scores computed")

TENS_Life_mean: 9 items (reversed 1-6), Mean=4.88
ET_mean: 15/15 items, Mean=4.15
SSRPH_mean: 5/5 items, Mean=0.93
PHQ5_mean: already exists, Mean=1.39
GAAIS_mean: computed from pos/neg subscales, Mean=4.62
UTAUT_AI_mean: 87 items, Mean=5.10
All USA composite scores computed


### 2.3 Add Country Label for USA

In [159]:
# Add Country column
usa_renamed['Country'] = 'USA'

print("USA DATA COMPLETE")
print(f"USA data processed")
print(f"Shape: {usa_renamed.shape}")
print(f"Participants: {len(usa_renamed)}")
print(f"Variables: {len(usa_renamed.columns)}")

USA DATA COMPLETE
USA data processed
Shape: (1857, 630)
Participants: 1857
Variables: 630


In [160]:
# Show key composites
key_composites = ['TENS_Life_mean', 'ET_mean', 'SSRPH_mean', 'PHQ5_mean', 'GAAIS_mean', 'UTAUT_AI_mean']
available_composites = [c for c in key_composites if c in usa_renamed.columns]
print(f"Key composite scores ({len(available_composites)}):")
for comp in available_composites:
    n_valid = usa_renamed[comp].notna().sum()
    mean = usa_renamed[comp].mean()
    print(f"{comp:20} {n_valid:4} valid, M={mean:.2f}")

Key composite scores (6):
TENS_Life_mean       1618 valid, M=4.88
ET_mean              1620 valid, M=4.15
SSRPH_mean           1607 valid, M=0.93
PHQ5_mean            1621 valid, M=1.39
GAAIS_mean           1726 valid, M=4.62
UTAUT_AI_mean        1677 valid, M=5.10


## 3. Merge USA and China Data

Create a single merged dataset with columns that exist in both countries.

In [161]:
# Find common columns
cn_cols = set(cn_coalesced.columns)
usa_cols = set(usa_renamed.columns)
common_cols = sorted(list(cn_cols & usa_cols))

print(f"Common columns: {len(common_cols)}")
print(f"China-only columns: {len(cn_cols - usa_cols)}")
print(f"USA-only columns: {len(usa_cols - cn_cols)}")

Common columns: 263
China-only columns: 416
USA-only columns: 367


In [162]:
# Ensure key variables are in common columns
key_vars = ['ID', 'Country', 'age', 'gender', 'edu', 
            'TENS_Life_mean', 'ET_mean', 'SSRPH_mean', 'PHQ5_mean', 'GAAIS_mean', 'UTAUT_AI_mean',
            'GAAIS_pos', 'GAAIS_neg']
key_vars_available = [v for v in key_vars if v in common_cols]

print(f"Key variables in common: {len(key_vars_available)}/{len(key_vars)}")
print(f"Available: {key_vars_available}")

missing_key = [v for v in key_vars if v not in common_cols and v not in ['ID']]
if missing_key:
    print(f"Missing from common: {missing_key}")

Key variables in common: 12/13
Available: ['Country', 'age', 'gender', 'edu', 'TENS_Life_mean', 'ET_mean', 'SSRPH_mean', 'PHQ5_mean', 'GAAIS_mean', 'UTAUT_AI_mean', 'GAAIS_pos', 'GAAIS_neg']


In [163]:
# Start with common columns
merge_cols = common_cols.copy()

# Add any missing key variables that exist in either dataset
for var in key_vars:
    if var not in merge_cols:
        if var in cn_cols or var in usa_cols:
            merge_cols.append(var)

merge_cols = sorted(list(set(merge_cols)))

print(f"Final merge columns: {len(merge_cols)}")

Final merge columns: 264


In [164]:
# Create subsets with only merge columns (fill missing with NaN)
cn_subset = cn_coalesced.reindex(columns=merge_cols)
usa_subset = usa_renamed.reindex(columns=merge_cols)

print(f"China subset: {cn_subset.shape}")
print(f"USA subset: {usa_subset.shape}")

China subset: (485, 264)
USA subset: (1857, 264)


In [166]:
# Concatenate the datasets
merged = pd.concat([cn_subset, usa_subset], axis=0, ignore_index=True)

print("MERGED DATASET CREATED")
print(f"Merged dataset created")
print(f"Shape: {merged.shape}")
print(f"Total participants: {len(merged)}")
print(f"Variables: {len(merged.columns)}")

MERGED DATASET CREATED
Merged dataset created
Shape: (2342, 264)
Total participants: 2342
Variables: 264


In [167]:
# Show country breakdown
print(f"Country distribution:")
print(merged['Country'].value_counts().to_string())

Country distribution:
Country
USA      1857
China     485


### 3.1 Export Merged Dataset

In [168]:
# Export to CSV
output_path = MERGED_DIR / "merged.csv"
merged.to_csv(output_path, index=False)

print("MERGED DATASET EXPORTED")
print(f"Saved to: {output_path}")

MERGED DATASET EXPORTED
Saved to: data/merged/merged.csv


In [169]:
# Show summary of key variables
print(f"Key composite scores summary:")
key_composites = ['TENS_Life_mean', 'ET_mean', 'SSRPH_mean', 'PHQ5_mean', 'GAAIS_mean', 'UTAUT_AI_mean']
for comp in key_composites:
    if comp in merged.columns:
        total_valid = merged[comp].notna().sum()
        total = len(merged)
        pct = total_valid / total * 100
        mean = merged[comp].mean()
        print(f"{comp:20} {total_valid:4}/{total} ({pct:5.1f}%) valid, M={mean:.2f}")

Key composite scores summary:
TENS_Life_mean       2103/2342 ( 89.8%) valid, M=4.76
ET_mean              2105/2342 ( 89.9%) valid, M=4.27
SSRPH_mean           2092/2342 ( 89.3%) valid, M=1.25
PHQ5_mean            2106/2342 ( 89.9%) valid, M=1.42
GAAIS_mean           2211/2342 ( 94.4%) valid, M=4.61
UTAUT_AI_mean        2162/2342 ( 92.3%) valid, M=4.78


## 4. Data Quality Summary

In [170]:
# Summary by country
print("DATA QUALITY SUMMARY BY COUNTRY")

for country in ['China', 'USA']:
    df_country = merged[merged['Country'] == country]
    print(f" {country} (N={len(df_country)})")
    
    for comp in key_composites:
        if comp in df_country.columns:
            n_valid = df_country[comp].notna().sum()
            mean = df_country[comp].mean()
            std = df_country[comp].std()
            print(f"{comp:20} N={n_valid:4}, M={mean:.2f}, SD={std:.2f}")

DATA QUALITY SUMMARY BY COUNTRY
 China (N=485)
TENS_Life_mean       N= 485, M=4.35, SD=0.99
ET_mean              N= 485, M=4.70, SD=0.91
SSRPH_mean           N= 485, M=2.30, SD=0.93
PHQ5_mean            N= 485, M=1.53, SD=0.99
GAAIS_mean           N= 485, M=4.57, SD=0.69
UTAUT_AI_mean        N= 485, M=3.64, SD=0.55
 USA (N=1857)
TENS_Life_mean       N=1618, M=4.88, SD=1.00
ET_mean              N=1620, M=4.15, SD=0.80
SSRPH_mean           N=1607, M=0.93, SD=0.76
PHQ5_mean            N=1621, M=1.39, SD=1.05
GAAIS_mean           N=1726, M=4.62, SD=0.82
UTAUT_AI_mean        N=1677, M=5.10, SD=1.38


## 5. Complete Cases for Hypotheses

In [171]:
# Check complete cases for each hypothesis
print("COMPLETE CASES FOR HYPOTHESES")

hypothesis_vars = {
    'H1: SDT to AI Acceptance': ['TENS_Life_mean', 'UTAUT_AI_mean', 'age', 'gender'],
    'H2: AI Attitude Moderation': ['TENS_Life_mean', 'UTAUT_AI_mean', 'GAAIS_mean'],
    'H3: Cross-Cultural': ['TENS_Life_mean', 'UTAUT_AI_mean', 'Country'],
    'H4: ET Mediation': ['TENS_Life_mean', 'ET_mean', 'UTAUT_AI_mean']
}

for hyp, vars_list in hypothesis_vars.items():
    available_vars = [v for v in vars_list if v in merged.columns]
    complete = merged[available_vars].dropna()
    
    print(f"{hyp}")
    print(f"Variables: {', '.join(available_vars)}")
    print(f"Complete cases: {len(complete):4} / {len(merged)} ({len(complete)/len(merged)*100:.1f}%)")
    
    # Breakdown by country
    if 'Country' in complete.columns:
        for country in ['China', 'USA']:
            n = len(complete[complete['Country'] == country])
            print(f"{country}: {n:4}")

COMPLETE CASES FOR HYPOTHESES
H1: SDT to AI Acceptance
Variables: TENS_Life_mean, UTAUT_AI_mean, age, gender
Complete cases: 2096 / 2342 (89.5%)
H2: AI Attitude Moderation
Variables: TENS_Life_mean, UTAUT_AI_mean, GAAIS_mean
Complete cases: 2096 / 2342 (89.5%)
H3: Cross-Cultural
Variables: TENS_Life_mean, UTAUT_AI_mean, Country
Complete cases: 2096 / 2342 (89.5%)
China:  485
USA: 1611
H4: ET Mediation
Variables: TENS_Life_mean, ET_mean, UTAUT_AI_mean
Complete cases: 2096 / 2342 (89.5%)


## üéâ Analysis-Ready Datasets Created!

Two clean datasets have been created:

1. **`data/china/CN_all_coalesced.csv`** - China data with:
   - Client and therapist data merged
   - All composite scores computed
   - Ready for China-specific analyses

2. **`data/merged/merged.csv`** - Combined USA + China data with:
   - Only common variables across countries
   - All key composite scores
   - Ready for hypothesis testing (H1-H4)

### Key Variables Available:
- **TENS_Life_mean** - Self-Determination (SDT)
- **UTAUT_AI_mean** - AI Acceptance
- **GAAIS_mean** - General AI Attitudes
- **ET_mean** - Epistemic Trust  
- **SSRPH_mean** - Stigma
- **PHQ5_mean** - Depression
- **Country** - USA / China
- **age, gender, edu** - Demographics

### Next Steps:
Use `merged.csv` for all hypothesis testing!