In [1]:
# =============================================================================
# DATA VALIDATION AND CLEANING SCRIPT v3.2 (GROUP-SPECIFIC IMPUTATION)
# Meta Political Content Research - Italian Parliamentarians Dataset
# =============================================================================
#
# This script validates, cleans, and prepares the combined Italian political
# accounts dataset for analysis. It ensures data integrity after multiple
# merges and creates analysis-ready datasets for RQ1-RQ4.
#
# CHANGE FROM v3.2: Group-specific view imputation method
#   - OLD (v3.1): Pooled ratio-weighted (ratio=1.31 for all groups)
#   - NEW (v3.2): Group-specific ratios derived from each group's distribution
#   - Parameters derived dynamically via power law fit to 101-500 view range
#   - Addresses heterogeneity: Extremists more left-skewed, Prominent more right-skewed
#   - Sensitivity: Maximum 0.002% difference between pooled and group-specific
#
# MODIFICATION: Complete Weeks Only
#   - Weekly aggregation now only includes complete weeks
#   - Partial weeks at the start and end of the study period are excluded
#   - This ensures consistent temporal comparison across all groups
#
# STUDY TIME-FRAME:
#   - Start date: 2021-01-01 (Pre-policy baseline)
#   - End date:   2025-11-30 (Post-reversal period)
#   - Posts outside this range are excluded from analysis
#
# CRITICAL CHECKS PERFORMED:
#   1. Merge integrity: Each surface.id assigned to exactly ONE main_list
#   2. Duplicate detection: No duplicate post IDs in dataset
#   3. Metadata consistency: Account info consistent across posts
#   4. Temporal filtering: Posts within study time-frame only
#   5. NA handling: Views imputed, engagement NAs removed
#
# INPUT:
#   - Combined dataset from combine_datasets_*.R scripts
#   - Location: combined_datasets/political_posts_*.rds
#   - Compatible with v2 (3 groups) and v3.2 (4 groups with MP split)
#
# OUTPUT DATASETS (saved to cleaned_data/):
#   1. cleaned_posts_*.rds       - All posts with analysis variables
#   2. accounts_summary_*.rds    - Account-level aggregated statistics
#   3. accounts_both_periods_*.rds - Accounts active pre AND post policy
#   4. posts_both_periods_*.rds  - Posts from balanced panel accounts
#   5. weekly_aggregation_*.rds  - Weekly time series for RQ1 (COMPLETE WEEKS ONLY)
#   6. monthly_aggregation_*.rds - Monthly time series for RQ1
#   7. account_period_*.rds      - Account-level by period (pre/post)
#   8. surface_info_*.rds/csv    - Comprehensive account metadata
#
# OUTPUT FOR MCL API QUERIES (to fetch additional account info):
#   - surface_ids_for_api_*.txt  - One ID per line
#   - surface_ids_for_api_*.csv  - IDs with names and list assignments
#   - surface_ids_r_vector_*.R   - Copy-paste ready R vector
#   - surface_ids_[LIST]_*.txt   - Separate files by main_list
#
# NA HANDLING STRATEGY:
#   - Views (NA):        Imputed with GROUP-SPECIFIC ratio-weighted method
#                        (parameters derived dynamically from each group's distribution)
#   - Reactions (NA):    Posts REMOVED - no documented threshold
#   - Shares (NA):       Posts REMOVED - no documented threshold  
#   - Comments (NA):     Posts REMOVED - no documented threshold
#   - Content_type (NA): Labeled 'mcl_unsupported_attachment' - MCL does not
#                        support this attachment type
#
# MERGE INTEGRITY RESOLUTION OPTIONS:
#   - STOP (default): Halt execution if issues found, save diagnostics
#   - REMOVE:         Remove all posts from affected accounts
#   - MAJORITY:       Assign account to list with most posts
#   - PRIORITY:       Use priority order (MPs > Prominent > Extremists)
#
# RESEARCH QUESTIONS SUPPORTED:
#   - RQ1: When/extent of reach reduction → weekly/monthly_aggregation
#   - RQ2: Engagement-reach moderation   → cleaned_posts
#   - RQ3: Equal effects across spectrum → accounts_both_periods
#   - RQ4: Experience effects (v3.2)     → accounts_both_periods (MPs only)
#
# REQUIREMENTS:
#   - R packages: tidyverse, lubridate
#   - Input: Combined dataset from combine_datasets_*.R
#
# USAGE:
#   1. Ensure combined dataset exists in combined_datasets/
#   2. (Optional) Modify study_start_date / study_end_date if needed
#   3. (Optional) Modify multi_list_resolution if integrity issues expected
#   4. Run script: source("data_validation_cleaning_v3.R")
#   5. Check console output for validation results
#   6. Load cleaned data: readRDS("cleaned_data/cleaned_posts_*.rds")
#
# VERSION HISTORY:
#   v1.0 - Initial validation script
#   v2.0 - Added NA removal (instead of imputation) for engagement metrics
#   v3.0 - Added comprehensive merge integrity checks
#        - Added surface info dataset with account metadata
#        - Added MCL API query file outputs
#        - Added automatic version detection (3 vs 4 groups)
#   v3.1 - Added study time-frame filtering (2021-01-01 to 2025-11-30)
#   v3.1-REFINED - Updated view imputation to ratio-weighted method
#                  based on empirical power law extrapolation (pooled ratio=1.31)
#   v3.2 - GROUP-SPECIFIC imputation: derives separate parameters for each group
#        - Changed default conflict resolution to PRIORITY
#        - Parameters: Extremists (α=-0.36, ratio=1.65), Prominent (α=0.33, ratio=0.66),
#          MPs_Reelected (α=0.14, ratio=0.82), MPs_New (α=0.03, ratio=0.94)
#   v3.2-COMPLETE-WEEKS - Added complete weeks filtering for weekly aggregation
#        - Partial weeks at start/end of study period are excluded
#        - Ensures consistent temporal comparison
#
# AUTHOR: [Your name]
# DATE: [Current date]
# PROJECT: Meta Political Content Moderation Research
# =============================================================================

library(tidyverse)
library(lubridate)

# Load shared utilities and configuration
source("scripts/utils.R")
config <- load_config("IT")

# =============================================================================
# CONFIGURATION
# =============================================================================

# ┌─────────────────────────────────────────────────────────────────────────┐
# │ STUDY TIME-FRAME                                                        │
# │ Posts outside this range will be excluded from analysis                 │
# └─────────────────────────────────────────────────────────────────────────┘

study_start_date <- config$study_period$start_date  # Pre-policy baseline
study_end_date   <- config$study_period$end_date    # Post-reversal period

# ┌─────────────────────────────────────────────────────────────────────────┐
# │ COMPLETE WEEKS CONFIGURATION                                            │
# │ Only complete weeks will be included in weekly_aggregation              │
# └─────────────────────────────────────────────────────────────────────────┘

# Week start day (1 = Monday, 7 = Sunday) - using ISO standard (Monday)
# lubridate floor_date with "week" uses weeks starting on Sunday by default
# We'll calculate complete weeks based on the actual week boundaries

# ┌─────────────────────────────────────────────────────────────────────────┐
# │ GROUP-SPECIFIC IMPUTATION CONFIGURATION                                 │
# │ Parameters derived dynamically from each group's distribution           │
# └─────────────────────────────────────────────────────────────────────────┘

IMPUTATION_BIN_WIDTH <- 50      # Bin width for power law fitting
IMPUTATION_MAX_VIEW <- 500      # Maximum view count for fitting
IMPUTATION_MIN_BINS <- 3        # Minimum non-empty bins for reliable fit
IMPUTATION_MIN_POSTS <- 100     # Minimum posts near threshold for fitting
POOLED_FALLBACK_RATIO <- 1.0    # Fallback ratio if group fit fails (uniform)

# =============================================================================
# HELPER FUNCTIONS FOR COMPLETE WEEKS
# =============================================================================

#' Calculate the first complete week start date
#' 
#' Given a study start date, returns the start of the first complete week.
#' If the study starts on a Sunday (week start), that date is returned.
#' Otherwise, returns the following Sunday.
#' 
#' @param start_date The study start date
#' @return The start date of the first complete week
#' 
get_first_complete_week_start <- function(start_date) {
  # floor_date gives the start of the week containing the date (Sunday)
  week_start <- floor_date(start_date, "week")
  
  # If study starts exactly on a Sunday, that week is complete from our perspective
  if (week_start == start_date) {
    return(start_date)
  } else {
    # Otherwise, the first complete week starts next Sunday
    return(week_start + days(7))
  }
}

#' Calculate the last complete week end date
#' 
#' Given a study end date, returns the end (Saturday) of the last complete week.
#' A complete week ends on Saturday (day before Sunday).
#' 
#' @param end_date The study end date
#' @return The end date of the last complete week (Saturday)
#' 
get_last_complete_week_end <- function(end_date) {
  # Get the start of the week containing end_date
  week_start <- floor_date(end_date, "week")
  
  # The end of that week is Saturday (week_start + 6 days)
  week_end <- week_start + days(6)
  
  # If end_date is Saturday or later in the week, that week is complete
  if (end_date >= week_end) {
    return(week_end)
  } else {
    # Otherwise, the last complete week is the previous week
    return(week_end - days(7))
  }
}

#' Check if a week is complete within the study period
#' 
#' @param week_start The start date of the week (Sunday)
#' @param study_start The study start date
#' @param study_end The study end date
#' @return TRUE if the week is complete, FALSE otherwise
#' 
is_complete_week <- function(week_start, study_start, study_end) {
  week_end <- week_start + days(6)  # Saturday
  return(week_start >= study_start & week_end <= study_end)
}

# =============================================================================
# GROUP-SPECIFIC IMPUTATION FUNCTIONS (v3.2)
# =============================================================================
# Each group's imputation parameters are derived from its own near-threshold
# distribution via power law extrapolation. This addresses heterogeneity:
#   - Extremists: More left-skewed (ratio ~1.65, more very low views)
#   - Prominent Politicians: More right-skewed (ratio ~0.66, closer to 100)
#   - MPs: Between these extremes (ratio ~0.82-0.94)
# =============================================================================

#' Derive imputation ratio for a group from near-threshold distribution
#' 
#' Fits power law to bin counts and extrapolates to estimate ratio of
#' posts in 1-50 vs 50-100 range.
#' 
#' @param views Vector of observed (non-NA) views for the group
#' @param bin_width Width of bins for fitting (default 50)
#' @param max_view Maximum view count to include in fit (default 500)
#' @param min_bins Minimum non-empty bins required (default 3)
#' @param min_posts Minimum posts near threshold required (default 100)
#' @param fallback_ratio Ratio to use if fit fails (default 1.0 = uniform)
#' @return List with ratio, alpha, r_squared, and reliability flag
#' 
derive_imputation_ratio <- function(views, 
                                    bin_width = IMPUTATION_BIN_WIDTH, 
                                    max_view = IMPUTATION_MAX_VIEW,
                                    min_bins = IMPUTATION_MIN_BINS,
                                    min_posts = IMPUTATION_MIN_POSTS,
                                    fallback_ratio = POOLED_FALLBACK_RATIO) {
  
  # Filter to near-threshold views (101 to max_view)
  near_threshold <- views[views > 100 & views <= max_view]
  n_near <- length(near_threshold)
  
  # Check if enough data
  if (n_near < min_posts) {
    return(list(
      ratio = fallback_ratio,
      alpha = NA,
      r_squared = NA,
      n_near_threshold = n_near,
      reliable = FALSE,
      reason = paste("Insufficient posts near threshold:", n_near, "<", min_posts)
    ))
  }
  
  # Create bins
  breaks <- seq(100, max_view, by = bin_width)
  bins <- cut(near_threshold, breaks = breaks, include.lowest = TRUE)
  bin_counts <- table(bins)
  bin_mids <- breaks[-length(breaks)] + bin_width / 2
  
  # Filter out zero counts
  valid <- bin_counts > 0
  n_valid_bins <- sum(valid)
  
  if (n_valid_bins < min_bins) {
    return(list(
      ratio = fallback_ratio,
      alpha = NA,
      r_squared = NA,
      n_near_threshold = n_near,
      reliable = FALSE,
      reason = paste("Insufficient non-empty bins:", n_valid_bins, "<", min_bins)
    ))
  }
  
  # Fit power law: log(count) ~ log(bin_mid)
  fit <- lm(log(as.numeric(bin_counts[valid])) ~ log(bin_mids[valid]))
  alpha <- coef(fit)[2]
  r_squared <- summary(fit)$r.squared
  
  # Derive ratio using integral of power law
  a <- alpha + 1
  
  if (abs(a) > 1e-10) {
    integral_1_50 <- (50^a - 1^a) / a
    integral_50_100 <- (100^a - 50^a) / a
  } else {
    integral_1_50 <- log(50) - log(1)
    integral_50_100 <- log(100) - log(50)
  }
  
  ratio <- integral_1_50 / integral_50_100
  
  # Ensure ratio is positive and reasonable
  if (is.na(ratio) || ratio <= 0 || ratio > 10) {
    return(list(
      ratio = fallback_ratio,
      alpha = alpha,
      r_squared = r_squared,
      n_near_threshold = n_near,
      reliable = FALSE,
      reason = paste("Derived ratio out of range:", round(ratio, 3))
    ))
  }
  
  return(list(
    ratio = ratio,
    alpha = alpha,
    r_squared = r_squared,
    n_near_threshold = n_near,
    reliable = TRUE,
    reason = "Successfully derived from data"
  ))
}

#' Generate imputed values using ratio-weighted method
#' 
#' @param n Number of values to generate
#' @param ratio Ratio of posts in 1-50 vs 50-100 (>1 means more low values)
#' @param seed Random seed (NULL for no seed setting within function)
#' @return Vector of imputed integer values in [1, 100]
#' 
generate_imputed_values <- function(n, ratio, seed = NULL) {
  if (!is.null(seed)) set.seed(seed)
  
  if (n == 0) return(integer(0))
  
  p_low <- ratio / (ratio + 1)
  in_low_half <- runif(n) < p_low
  
  values <- integer(n)
  n_low <- sum(in_low_half)
  n_high <- n - n_low
  
  if (n_low > 0) {
    values[in_low_half] <- sample(1:50, n_low, replace = TRUE)
  }
  if (n_high > 0) {
    values[!in_low_half] <- sample(51:100, n_high, replace = TRUE)
  }
  
  return(values)
}

#' Compute summary statistics for imputed values
#' 
#' @param ratio Ratio used for imputation
#' @param n_samples Number of samples for estimation (default 100000)
#' @return List with mean, median, sd, pct_under_50
#' 
compute_imputation_stats <- function(ratio, n_samples = 100000) {
  set.seed(12345)  # Fixed seed for consistent stats reporting
  values <- generate_imputed_values(n_samples, ratio)
  
  list(
    mean = round(mean(values), 1),
    median = median(values),
    sd = round(sd(values), 1),
    pct_under_50 = round(100 * mean(values <= 50), 1),
    p_low = round(100 * ratio / (ratio + 1), 1)
  )
}

# =============================================================================

cat("\n")
cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n")
cat("DATA VALIDATION AND CLEANING (v3.2 - GROUP-SPECIFIC IMPUTATION)\n")
cat("WITH COMPLETE WEEKS FILTERING\n")
cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# ============================================================================
# STEP 1: LOAD DATA
# ============================================================================

cat("STEP 1: Loading combined dataset...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# Find most recent combined dataset
# Find and load most recent combined dataset
data_file <- find_most_recent_file("combined_datasets", "political_posts_.*\\.rds$")

if (is.null(data_file)) {
  stop("No combined dataset found in combined_datasets/ directory.\n",
       "Please run the dataset combination script first.")
}

cat("Loading:", basename(data_file), "\n")
cat("File date:", format(file.mtime(data_file), "%Y-%m-%d %H:%M:%S"), "\n\n")

data <- readRDS(data_file)

cat("✓ Loaded data\n")
cat("  Total posts:", nrow(data), "\n")
cat("  Columns:", ncol(data), "\n")
cat("  Unique surface.id:", n_distinct(data$surface.id), "\n")
cat("  Unique post id:", n_distinct(data$id), "\n\n")

# ============================================================================
# STEP 2: MERGE INTEGRITY VALIDATION (CRITICAL)
# ============================================================================

cat("STEP 2: Validating merge integrity (CRITICAL CHECKS)...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

integrity_errors <- list()
integrity_warnings <- list()

# ---------------------------------------------------------------------------
# CHECK 2.1: Each surface.id must be assigned to exactly ONE main_list
# ---------------------------------------------------------------------------

cat("CHECK 2.1: Verifying each account (surface.id) has exactly ONE main_list...\n\n")

surface_list_mapping <- data %>%
  group_by(surface.id) %>%
  summarise(
    n_lists = n_distinct(main_list),
    lists = paste(sort(unique(main_list)), collapse = ", "),
    n_posts = n(),
    surface_name = first(surface.name),
    surface_username = first(surface.username),
    .groups = "drop"
  )

# Accounts with multiple list assignments
multi_list_accounts <- surface_list_mapping %>%
  filter(n_lists > 1) %>%
  arrange(desc(n_lists), desc(n_posts))

if (nrow(multi_list_accounts) > 0) {
  cat("✗ CRITICAL ERROR: Found", nrow(multi_list_accounts), 
      "accounts assigned to MULTIPLE lists!\n\n")
  
  cat("Accounts with multiple list assignments:\n")
  cat("-" %>% rep(60) %>% paste0(collapse = ""), "\n")
  
  # Print details
  for (i in 1:min(nrow(multi_list_accounts), 50)) {
    acc <- multi_list_accounts[i, ]
    cat(sprintf("\n  Account %d:\n", i))
    cat(sprintf("    surface.id: %s\n", acc$surface.id))
    cat(sprintf("    name: %s\n", acc$surface_name))
    cat(sprintf("    username: %s\n", acc$surface_username))
    cat(sprintf("    assigned to: %s\n", acc$lists))
    cat(sprintf("    total posts: %d\n", acc$n_posts))
    
    # Show post distribution across lists for this account
    post_dist <- data %>%
      filter(surface.id == acc$surface.id) %>%
      count(main_list) %>%
      mutate(info = sprintf("%s: %d posts", main_list, n))
    
    cat(sprintf("    distribution: %s\n", paste(post_dist$info, collapse = ", ")))
  }
  
  if (nrow(multi_list_accounts) > 50) {
    cat("\n  ... and", nrow(multi_list_accounts) - 50, "more accounts\n")
  }
  
  cat("\n")
  
  # Store for later handling
  integrity_errors$multi_list_accounts <- multi_list_accounts
  
  # Calculate total affected posts
  affected_posts <- data %>%
    filter(surface.id %in% multi_list_accounts$surface.id) %>%
    nrow()
  
  cat("Total posts affected:", affected_posts, "\n\n")
  
} else {
  cat("✓ PASSED: All", nrow(surface_list_mapping), 
      "accounts are assigned to exactly ONE list\n\n")
}

# ---------------------------------------------------------------------------
# CHECK 2.2: Duplicate post IDs
# ---------------------------------------------------------------------------

cat("CHECK 2.2: Checking for duplicate post IDs...\n\n")

duplicate_posts <- data %>%
  group_by(id) %>%
  filter(n() > 1) %>%
  ungroup() %>%
  arrange(id)

if (nrow(duplicate_posts) > 0) {
  n_dup_ids <- n_distinct(duplicate_posts$id)
  cat("✗ ERROR: Found", nrow(duplicate_posts), "rows with", n_dup_ids, "duplicate post IDs!\n\n")
  
  # Show examples
  cat("Examples of duplicate posts:\n")
  dup_examples <- duplicate_posts %>%
    group_by(id) %>%
    slice_head(n = 2) %>%
    ungroup() %>%
    head(20) %>%
    dplyr::select(id, surface.id, surface.name, main_list, date, statistics.views)
  
  print(dup_examples)
  cat("\n")
  
  integrity_errors$duplicate_posts <- duplicate_posts
  
} else {
  cat("✓ PASSED: All", n_distinct(data$id), "post IDs are unique\n\n")
}

# ---------------------------------------------------------------------------
# CHECK 2.3: Consistency of account metadata across posts
# ---------------------------------------------------------------------------

cat("CHECK 2.3: Checking consistency of account metadata across posts...\n\n")

# Check if surface.name and surface.username are consistent for each surface.id
metadata_consistency <- data %>%
  group_by(surface.id) %>%
  summarise(
    n_names = n_distinct(surface.name, na.rm = TRUE),
    n_usernames = n_distinct(surface.username, na.rm = TRUE),
    names = paste(unique(na.omit(surface.name)), collapse = " | "),
    usernames = paste(unique(na.omit(surface.username)), collapse = " | "),
    .groups = "drop"
  )

inconsistent_names <- metadata_consistency %>%
  filter(n_names > 1)

inconsistent_usernames <- metadata_consistency %>%
  filter(n_usernames > 1)

if (nrow(inconsistent_names) > 0) {
  cat("⚠ WARNING: Found", nrow(inconsistent_names), 
      "accounts with inconsistent surface.name values\n")
  cat("  (This may be due to name changes over time - usually not critical)\n")
  
  if (nrow(inconsistent_names) <= 10) {
    print(inconsistent_names %>% dplyr::select(surface.id, n_names, names))
  } else {
    print(head(inconsistent_names %>% dplyr::select(surface.id, n_names, names), 10))
    cat("  ... and", nrow(inconsistent_names) - 10, "more\n")
  }
  cat("\n")
  
  integrity_warnings$inconsistent_names <- inconsistent_names
} else {
  cat("✓ PASSED: surface.name is consistent for all accounts\n")
}

if (nrow(inconsistent_usernames) > 0) {
  cat("⚠ WARNING: Found", nrow(inconsistent_usernames), 
      "accounts with inconsistent surface.username values\n")
  
  if (nrow(inconsistent_usernames) <= 10) {
    print(inconsistent_usernames %>% dplyr::select(surface.id, n_usernames, usernames))
  } else {
    print(head(inconsistent_usernames %>% dplyr::select(surface.id, n_usernames, usernames), 10))
    cat("  ... and", nrow(inconsistent_usernames) - 10, "more\n")
  }
  cat("\n")
  
  integrity_warnings$inconsistent_usernames <- inconsistent_usernames
} else {
  cat("✓ PASSED: surface.username is consistent for all accounts\n")
}
cat("\n")

# ---------------------------------------------------------------------------
# CHECK 2.4: Sub-list consistency within main_list
# ---------------------------------------------------------------------------

cat("CHECK 2.4: Checking sub_list consistency within main_list...\n\n")

if ("sub_list" %in% names(data)) {
  sublist_consistency <- data %>%
    group_by(surface.id, main_list) %>%
    summarise(
      n_sublists = n_distinct(sub_list, na.rm = TRUE),
      sublists = paste(unique(na.omit(sub_list)), collapse = " | "),
      .groups = "drop"
    ) %>%
    filter(n_sublists > 1)
  
  if (nrow(sublist_consistency) > 0) {
    cat("⚠ WARNING: Found", nrow(sublist_consistency), 
        "accounts with inconsistent sub_list within main_list\n")
    print(head(sublist_consistency, 10))
    cat("\n")
    
    integrity_warnings$inconsistent_sublists <- sublist_consistency
  } else {
    cat("✓ PASSED: sub_list is consistent within main_list for all accounts\n\n")
  }
} else {
  cat("ℹ sub_list column not found - skipping check\n\n")
}

# ---------------------------------------------------------------------------
# CHECK 2.5: Cross-reference post counts with unique accounts
# ---------------------------------------------------------------------------

cat("CHECK 2.5: Cross-referencing counts...\n\n")

# Summary by main_list
list_summary <- data %>%
  group_by(main_list) %>%
  summarise(
    n_posts = n(),
    n_accounts = n_distinct(surface.id),
    posts_per_account = round(n_posts / n_accounts, 1),
    .groups = "drop"
  )

cat("Posts and accounts by main_list:\n")
print(list_summary)
cat("\n")

# Check if totals match
total_unique_accounts <- n_distinct(data$surface.id)
sum_accounts_by_list <- sum(list_summary$n_accounts)

if (length(integrity_errors$multi_list_accounts) > 0) {
  expected_overcounting <- nrow(integrity_errors$multi_list_accounts) * 
    (mean(integrity_errors$multi_list_accounts$n_lists) - 1)
  cat("Note: Sum of accounts by list (", sum_accounts_by_list, 
      ") > unique accounts (", total_unique_accounts, ")\n", sep = "")
  cat("This is expected due to", nrow(integrity_errors$multi_list_accounts), 
      "accounts appearing in multiple lists\n\n")
} else {
  if (sum_accounts_by_list == total_unique_accounts) {
    cat("✓ PASSED: Account counts are consistent (sum by list = unique total)\n\n")
  } else {
    cat("⚠ Unexpected: Sum by list (", sum_accounts_by_list, 
        ") != unique total (", total_unique_accounts, ")\n\n", sep = "")
  }
}

# ---------------------------------------------------------------------------
# INTEGRITY CHECK SUMMARY & DECISION
# ---------------------------------------------------------------------------

cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n")
cat("MERGE INTEGRITY CHECK SUMMARY\n")
cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

has_critical_errors <- length(integrity_errors) > 0

if (has_critical_errors) {
  cat("✗ CRITICAL ERRORS FOUND:\n\n")
  
  if (!is.null(integrity_errors$multi_list_accounts)) {
    cat("  - ", nrow(integrity_errors$multi_list_accounts), 
        " accounts assigned to multiple lists\n", sep = "")
  }
  
  if (!is.null(integrity_errors$duplicate_posts)) {
    cat("  - ", n_distinct(integrity_errors$duplicate_posts$id), 
        " duplicate post IDs\n", sep = "")
  }
  
  cat("\n")
  cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n")
  cat("RESOLUTION OPTIONS\n")
  cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n\n")
  
  # ---------------------------------------------------------------------------
  # HANDLE MULTI-LIST ACCOUNTS
  # ---------------------------------------------------------------------------
  
  if (!is.null(integrity_errors$multi_list_accounts)) {
    cat("For accounts assigned to multiple lists, choose a resolution:\n\n")
    cat("  Option 1: STOP - Fix the merge script and re-run\n")
    cat("  Option 2: REMOVE - Remove all posts from affected accounts\n")
    cat("  Option 3: MAJORITY - Assign account to list with most posts\n")
    cat("  Option 4: PRIORITY - Use priority order (MPs > Prominent > Extremists)\n")
    cat("\n")
    
    # Set default resolution here (can be changed)
    # PRIORITY is recommended: assigns to most specific/relevant category
    multi_list_resolution <- "PRIORITY"  # Options: "STOP", "REMOVE", "MAJORITY", "PRIORITY"
    
    cat("Current setting: multi_list_resolution = '", multi_list_resolution, "'\n\n", sep = "")
    
    if (multi_list_resolution == "STOP") {
      # Save diagnostic info before stopping
      diagnostic_file <- paste0("merge_integrity_errors_", 
                                format(Sys.time(), "%Y%m%d_%H%M%S"), ".rds")
      saveRDS(integrity_errors, diagnostic_file)
      cat("Diagnostic information saved to:", diagnostic_file, "\n\n")
      
      stop("STOPPING: Critical merge integrity errors detected.\n",
           "Please review the errors above and fix the merge script.\n",
           "Alternatively, change 'multi_list_resolution' to handle automatically.")
      
    } else if (multi_list_resolution == "REMOVE") {
      cat("Removing all posts from accounts with multiple list assignments...\n")
      
      accounts_to_remove <- integrity_errors$multi_list_accounts$surface.id
      n_posts_before <- nrow(data)
      
      data <- data %>%
        filter(!(surface.id %in% accounts_to_remove))
      
      n_posts_removed <- n_posts_before - nrow(data)
      cat("✓ Removed", n_posts_removed, "posts from", 
          length(accounts_to_remove), "accounts\n\n")
      
    } else if (multi_list_resolution == "MAJORITY") {
      cat("Assigning accounts to the list with most posts...\n")
      
      # For each multi-list account, determine majority list
      majority_assignments <- data %>%
        filter(surface.id %in% integrity_errors$multi_list_accounts$surface.id) %>%
        count(surface.id, main_list) %>%
        group_by(surface.id) %>%
        slice_max(n, n = 1, with_ties = FALSE) %>%
        ungroup() %>%
        dplyr::select(surface.id, assigned_list = main_list)
      
      # Update main_list for affected accounts
      data <- data %>%
        left_join(majority_assignments, by = "surface.id") %>%
        mutate(
          main_list = if_else(!is.na(assigned_list), assigned_list, main_list)
        ) %>%
        dplyr::select(-assigned_list)
      
      cat("✓ Reassigned", nrow(majority_assignments), "accounts to majority list\n")
      cat("  New distribution:\n")
      print(data %>% 
              filter(surface.id %in% integrity_errors$multi_list_accounts$surface.id) %>%
              count(main_list))
      cat("\n")
      
    } else if (multi_list_resolution == "PRIORITY") {
      cat("Assigning accounts using priority order...\n")
      cat("  Priority: MPs_Reelected > MPs_New > Prominent_Politicians > Extremists\n")
      cat("  (MPs = core sample; Prominent = established figures; Extremists = catch-all)\n\n")
      
      # Define priority (lower number = higher priority)
      # MPs are the core research sample
      # Prominent_Politicians are established political figures
      # Extremists is the broadest catch-all category
      priority_order <- c("MPs_Reelected" = 1, "MPs_New" = 2, "MPs" = 3, 
                          "Prominent_Politicians" = 4, "Extremists" = 5)
      
      # For each multi-list account, use highest priority list
      priority_assignments <- data %>%
        filter(surface.id %in% integrity_errors$multi_list_accounts$surface.id) %>%
        distinct(surface.id, main_list) %>%
        mutate(priority = priority_order[main_list]) %>%
        group_by(surface.id) %>%
        slice_min(priority, n = 1, with_ties = FALSE) %>%
        ungroup() %>%
        dplyr::select(surface.id, assigned_list = main_list)
      
      # Log each assignment
      cat("Assigning accounts to highest-priority list:\n")
      for (i in 1:nrow(priority_assignments)) {
        acc_id <- priority_assignments$surface.id[i]
        acc_info <- integrity_errors$multi_list_accounts %>%
          filter(surface.id == acc_id)
        assigned <- priority_assignments$assigned_list[i]
        cat(sprintf("  %s: %s → %s\n", 
                    acc_info$surface_name, acc_info$lists, assigned))
      }
      cat("\n")
      
      # Update main_list for affected accounts
      data <- data %>%
        left_join(priority_assignments, by = "surface.id") %>%
        mutate(
          main_list = if_else(!is.na(assigned_list), assigned_list, main_list)
        ) %>%
        dplyr::select(-assigned_list)
      
      cat("✓ Reassigned", nrow(priority_assignments), "accounts using priority\n")
      cat("  New distribution:\n")
      print(data %>% 
              filter(surface.id %in% integrity_errors$multi_list_accounts$surface.id) %>%
              count(main_list))
      cat("\n")
    }
  }
  
  # ---------------------------------------------------------------------------
  # HANDLE DUPLICATE POSTS
  # ---------------------------------------------------------------------------
  
  if (!is.null(integrity_errors$duplicate_posts)) {
    cat("Handling duplicate post IDs...\n")
    
    n_before <- nrow(data)
    
    # Keep first occurrence of each duplicate
    data <- data %>%
      distinct(id, .keep_all = TRUE)
    
    n_removed <- n_before - nrow(data)
    cat("✓ Removed", n_removed, "duplicate posts (kept first occurrence)\n\n")
  }
  
} else {
  cat("✓ ALL MERGE INTEGRITY CHECKS PASSED\n\n")
}

# Warnings summary
if (length(integrity_warnings) > 0) {
  cat("Warnings (non-critical):\n")
  for (warning_name in names(integrity_warnings)) {
    cat("  - ", warning_name, ": ", nrow(integrity_warnings[[warning_name]]), " cases\n", sep = "")
  }
  cat("\n")
}

# ---------------------------------------------------------------------------
# FINAL VERIFICATION: Re-check that each surface.id now has exactly ONE list
# ---------------------------------------------------------------------------

cat("FINAL VERIFICATION: Re-checking list assignments...\n")

final_check <- data %>%
  group_by(surface.id) %>%
  summarise(n_lists = n_distinct(main_list), .groups = "drop") %>%
  filter(n_lists > 1)

if (nrow(final_check) > 0) {
  stop("FATAL: After resolution, ", nrow(final_check), 
       " accounts still have multiple list assignments!")
} else {
  cat("✓ VERIFIED: All", n_distinct(data$surface.id), 
      "accounts now have exactly ONE list assignment\n\n")
}

# ============================================================================
# STEP 3: CRITICAL FIELD VALIDATION
# ============================================================================

cat("STEP 3: Validating critical fields...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# Check for required fields
required_fields <- c("id", "surface.id", "main_list", "date", "creation_time",
                     "statistics.views", "statistics.reaction_count", 
                     "statistics.share_count", "statistics.comment_count")

missing_fields <- setdiff(required_fields, names(data))
if (length(missing_fields) > 0) {
  stop("Missing required fields: ", paste(missing_fields, collapse = ", "))
}

cat("✓ All required fields present\n\n")

# Validate surface.id
na_surface_id <- sum(is.na(data$surface.id))
if (na_surface_id > 0) {
  cat("⚠ WARNING:", na_surface_id, "posts with NA surface.id\n")
  cat("These will be removed.\n\n")
} else {
  cat("✓ No NA values in surface.id\n\n")
}

# Validate date
na_date <- sum(is.na(data$date))
if (na_date > 0) {
  cat("⚠ WARNING:", na_date, "posts with NA date\n")
  cat("These will be removed.\n\n")
} else {
  cat("✓ No NA values in date\n\n")
}

# Validate main_list
cat("Main list distribution (after integrity checks):\n")
main_list_counts <- data %>% count(main_list)
print(main_list_counts)
cat("\n")

# Valid main_lists - flexible to handle both v3.1 and v3.2 outputs
valid_main_lists_pattern <- "^(MPs.*|Prominent_Politicians|Extremists)$"

invalid_lists <- data %>%
  filter(!grepl(valid_main_lists_pattern, main_list)) %>%
  distinct(main_list) %>%
  pull(main_list)

if (length(invalid_lists) > 0) {
  cat("⚠ WARNING: Invalid main_list values found:\n")
  print(invalid_lists)
  cat("\nExpected pattern: MPs.*, Prominent_Politicians, or Extremists\n")
  cat("These will be removed.\n\n")
}

# ============================================================================
# STEP 4: REMOVE INVALID RECORDS AND APPLY TIME-FRAME FILTER
# ============================================================================

cat("STEP 4: Removing invalid records and applying time-frame filter...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# Report study time-frame
cat("Study time-frame:\n")
cat("  Start date:", format(study_start_date, "%Y-%m-%d"), "\n")
cat("  End date:  ", format(study_end_date, "%Y-%m-%d"), "\n\n")

n_before <- nrow(data)

# Check posts outside time-frame BEFORE filtering
posts_before_start <- sum(data$date < study_start_date, na.rm = TRUE)
posts_after_end <- sum(data$date > study_end_date, na.rm = TRUE)
posts_outside_timeframe <- posts_before_start + posts_after_end

if (posts_outside_timeframe > 0) {
  cat("Posts outside study time-frame:\n")
  cat("  Before", format(study_start_date, "%Y-%m-%d"), ":", posts_before_start, "\n")
  cat("  After", format(study_end_date, "%Y-%m-%d"), " :", posts_after_end, "\n")
  cat("  Total to exclude:", posts_outside_timeframe, "\n\n")
  
  # Show date range of excluded posts
  if (posts_before_start > 0) {
    early_dates <- data %>% 
      filter(date < study_start_date) %>% 
      summarise(min = min(date), max = max(date))
    cat("  Early posts range:", format(early_dates$min, "%Y-%m-%d"), "to", 
        format(early_dates$max, "%Y-%m-%d"), "\n")
  }
  if (posts_after_end > 0) {
    late_dates <- data %>% 
      filter(date > study_end_date) %>% 
      summarise(min = min(date), max = max(date))
    cat("  Late posts range:", format(late_dates$min, "%Y-%m-%d"), "to", 
        format(late_dates$max, "%Y-%m-%d"), "\n")
  }
  cat("\n")
}

# Apply all filters including time-frame
data <- data %>%
  filter(
    !is.na(surface.id),
    !is.na(date),
    grepl(valid_main_lists_pattern, main_list),
    date >= study_start_date,
    date <= study_end_date
  )

n_removed <- n_before - nrow(data)
n_removed_other <- n_removed - posts_outside_timeframe

cat("Records removed:\n")
cat("  Invalid (NA surface.id, NA date, invalid list):", n_removed_other, "\n")
cat("  Outside study time-frame:", posts_outside_timeframe, "\n")
cat("  Total removed:", n_removed, "\n")
cat("Remaining posts:", nrow(data), "\n\n")

# ============================================================================
# STEP 5: TEMPORAL COVERAGE VALIDATION
# ============================================================================

cat("STEP 5: Validating temporal coverage...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# Overall date range
date_range <- data %>%
  summarise(
    min_date = min(date, na.rm = TRUE),
    max_date = max(date, na.rm = TRUE),
    span_days = as.numeric(difftime(max(date), min(date), units = "days"))
  )

cat("Overall date range:\n")
cat("  First post:", format(date_range$min_date, "%Y-%m-%d"), "\n")
cat("  Last post:", format(date_range$max_date, "%Y-%m-%d"), "\n")
cat("  Total span:", date_range$span_days, "days\n\n")

# Date range by main list
cat("Date range by main list:\n")
date_by_list <- data %>%
  group_by(main_list) %>%
  summarise(
    first_post = min(date),
    last_post = max(date),
    span_days = as.numeric(difftime(last_post, first_post, units = "days")),
    n_posts = n(),
    .groups = "drop"
  )
print(date_by_list)
cat("\n")

# Check for gaps
cat("Checking for temporal gaps...\n")
daily_posts <- data %>%
  count(date) %>%
  arrange(date)

# Find gaps > 30 days
gaps <- daily_posts %>%
  mutate(
    next_date = lead(date),
    gap_days = as.numeric(difftime(next_date, date, units = "days"))
  ) %>%
  filter(gap_days > 30) %>%
  dplyr::select(date, next_date, gap_days)

if (nrow(gaps) > 0) {
  cat("⚠ WARNING: Found", nrow(gaps), "gaps > 30 days:\n")
  print(gaps)
  cat("\n")
} else {
  cat("✓ No major temporal gaps detected\n\n")
}

# ============================================================================
# STEP 6: KEY META POLICY DATES
# ============================================================================

cat("STEP 6: Defining Meta policy periods...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# Define key policy dates
policy_dates <- tibble(
  event = c(
    "Pre-Policy",
    "Initial Announcement",
    "US Tests Begin",
    "Global Expansion 1",
    "Reduced Engagement Weight Test",
    "Global Implementation",
    "Refinements (Survey Signals)",
    "User Control Setting",
    "Policy Reversal",
    "Updated Approach"
  ),
  date = as.Date(c(
    "2020-01-01",
    "2021-02-10",
    config$policy_timeline$initial_tests,
    config$policy_timeline$european_expansion,
    "2022-05-24",
    config$policy_timeline$global_implementation,
    config$policy_timeline$engagement_deemphasis,
    config$policy_timeline$user_controls,
    config$policy_timeline$reversal_announcement,
    "2025-05-28"
  ))
)

cat("Key policy dates:\n")
print(policy_dates)
cat("\n")

# Check data coverage for key periods
cat("Data coverage around key date (2022-07-19 global implementation):\n")
key_date <- config$policy_timeline$global_implementation

coverage_check <- data %>%
  mutate(
    period = case_when(
      date < key_date - 180 ~ "6mo before",
      date >= key_date - 180 & date < key_date - 90 ~ "3-6mo before",
      date >= key_date - 90 & date < key_date - 30 ~ "1-3mo before",
      date >= key_date - 30 & date < key_date ~ "1mo before",
      date >= key_date & date < key_date + 30 ~ "1mo after",
      date >= key_date + 30 & date < key_date + 90 ~ "1-3mo after",
      date >= key_date + 90 & date < key_date + 180 ~ "3-6mo after",
      date >= key_date + 180 ~ "6mo+ after"
    )
  ) %>%
  count(period, main_list) %>%
  pivot_wider(names_from = main_list, values_from = n, values_fill = 0)

print(coverage_check)
cat("\n")

# ============================================================================
# STEP 7: ENGAGEMENT METRICS VALIDATION AND HANDLING
# ============================================================================

cat("STEP 7: Validating and handling engagement metrics...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# Set seed for reproducibility of random imputation
set.seed(42)
cat("Random seed set to 42 for reproducibility\n\n")

# Check for NA values in key metrics
engagement_fields <- c("statistics.views", "statistics.reaction_count",
                       "statistics.share_count", "statistics.comment_count")

cat("NA values in engagement metrics (BEFORE any handling):\n")
for (field in engagement_fields) {
  na_count <- sum(is.na(data[[field]]))
  pct <- round(100 * na_count / nrow(data), 2)
  cat(sprintf("  %-30s: %8d (%5.2f%%)\n", field, na_count, pct))
}
cat("\n")

# ---------------------------------------------------------------------------
# STEP 7a: Handle NA Views - GROUP-SPECIFIC RATIO-WEIGHTED IMPUTATION
# ---------------------------------------------------------------------------

cat("--- Handling NA Views (GROUP-SPECIFIC METHOD) ---\n\n")

cat("IMPUTATION METHOD: Group-Specific Ratio-Weighted\n")
cat("  Each group's parameters derived from its own near-threshold distribution\n")
cat("  Power law fit to bin counts (101-", IMPUTATION_MAX_VIEW, " views)\n", sep = "")
cat("  Addresses heterogeneity across political groups\n\n")

# Document NA views by group BEFORE imputation
cat("NA views by group (BEFORE imputation):\n")
na_views_by_group <- data %>%
  group_by(main_list) %>%
  summarise(
    total_posts = n(),
    na_views = sum(is.na(statistics.views)),
    pct_na = round(100 * na_views / total_posts, 2),
    .groups = "drop"
  ) %>%
  arrange(desc(pct_na))

print(na_views_by_group)
cat("\n")

# Check for pre-2017 posts
pre_2017_posts <- sum(data$date < as.Date("2017-01-01"), na.rm = TRUE)
if (pre_2017_posts > 0) {
  cat("⚠ NOTE:", pre_2017_posts, "posts are from before 2017-01-01\n")
  cat("  MCL documentation states view counts are not available for posts before this date.\n\n")
}

# Determine reason for NA views
data <- data %>%
  mutate(
    views_na_reason = case_when(
      !is.na(statistics.views) ~ "not_na",
      date < as.Date("2017-01-01") ~ "pre_2017",
      TRUE ~ "under_threshold"
    ),
    views_imputed = views_na_reason == "under_threshold",
    views_pre_2017 = views_na_reason == "pre_2017"
  )

# Report NA reasons
cat("NA views breakdown by reason:\n")
na_reason_summary <- data %>%
  count(views_na_reason) %>%
  mutate(pct = round(100 * n / sum(n), 2))
print(na_reason_summary)
cat("\n")

# Count posts to impute
n_to_impute <- sum(data$views_imputed)
n_pre_2017 <- sum(data$views_pre_2017)

if (n_to_impute > 0) {
  cat("ℹ IMPUTING", n_to_impute, "posts with NA views (≤100 views threshold)\n")
  cat("  Rationale: MCL confirmed NA means ≤100 views (censoring threshold)\n")
  cat("  Method: GROUP-SPECIFIC ratio-weighted based on each group's distribution\n\n")
  
  # Set master seed for reproducibility
  set.seed(42)
  cat("  Master random seed: 42\n\n")
  
  # Derive parameters for each group
  groups <- unique(data$main_list)
  group_imputation_params <- list()
  
  cat("Deriving imputation parameters by group:\n")
  cat("-" %>% rep(60) %>% paste0(collapse = ""), "\n\n")
  
  for (g in groups) {
    # Get observed views for this group
    group_views <- data %>%
      filter(main_list == g, !is.na(statistics.views)) %>%
      pull(statistics.views)
    
    # Derive ratio from near-threshold distribution
    params <- derive_imputation_ratio(
      views = group_views,
      bin_width = IMPUTATION_BIN_WIDTH,
      max_view = IMPUTATION_MAX_VIEW,
      min_bins = IMPUTATION_MIN_BINS,
      min_posts = IMPUTATION_MIN_POSTS,
      fallback_ratio = POOLED_FALLBACK_RATIO
    )
    
    # Compute expected imputation statistics
    imp_stats <- compute_imputation_stats(params$ratio)
    
    # Store parameters
    group_imputation_params[[g]] <- list(
      group = g,
      n_censored = sum(data$main_list == g & data$views_imputed),
      ratio = params$ratio,
      alpha = params$alpha,
      r_squared = params$r_squared,
      n_near_threshold = params$n_near_threshold,
      reliable = params$reliable,
      expected_mean = imp_stats$mean,
      expected_median = imp_stats$median,
      expected_pct_under_50 = imp_stats$pct_under_50
    )
    
    # Print summary
    cat(sprintf("%s:\n", g))
    cat(sprintf("  Posts to impute: %d\n", group_imputation_params[[g]]$n_censored))
    cat(sprintf("  Near threshold (101-%d): %d posts\n", IMPUTATION_MAX_VIEW, params$n_near_threshold))
    
    if (params$reliable) {
      cat(sprintf("  Power law exponent (α): %.3f\n", params$alpha))
      cat(sprintf("  R-squared: %.3f\n", params$r_squared))
      cat(sprintf("  Derived ratio (1-50 vs 50-100): %.2f\n", params$ratio))
      cat(sprintf("  → Expected imputed mean: %.1f, median: %d\n", imp_stats$mean, imp_stats$median))
    } else {
      cat(sprintf("  ⚠ Using fallback ratio: %.2f (uniform)\n", params$ratio))
    }
    cat("\n")
  }
  
  # Apply imputation by group
  cat("Applying imputation by group:\n")
  cat("-" %>% rep(40) %>% paste0(collapse = ""), "\n")
  
  imputation_log <- list()
  
  for (g in groups) {
    params <- group_imputation_params[[g]]
    
    # Get indices of censored posts in this group
    idx <- which(data$main_list == g & data$views_imputed)
    n_group_impute <- length(idx)
    
    if (n_group_impute > 0) {
      # Generate imputed values (master seed already set)
      imputed_values <- generate_imputed_values(n_group_impute, params$ratio, seed = NULL)
      
      # Apply to data
      data$statistics.views[idx] <- imputed_values
      
      # Log results
      imputation_log[[g]] <- list(
        n_imputed = n_group_impute,
        ratio_used = params$ratio,
        actual_mean = round(mean(imputed_values), 1),
        actual_median = median(imputed_values),
        actual_pct_under_50 = round(100 * mean(imputed_values <= 50), 1)
      )
      
      cat(sprintf("  %s: %d posts (ratio=%.2f, mean=%.1f, median=%d)\n",
                  g, n_group_impute, params$ratio, mean(imputed_values), median(imputed_values)))
    } else {
      imputation_log[[g]] <- list(n_imputed = 0)
    }
  }
  
  cat("\n✓ Imputed", n_to_impute, "view counts using group-specific method\n\n")
  
  # Summary table
  cat("Group-specific imputation summary:\n")
  imp_summary <- tibble(
    Group = sapply(group_imputation_params, function(x) x$group),
    N_Imputed = sapply(group_imputation_params, function(x) x$n_censored),
    Alpha = sapply(group_imputation_params, function(x) round(x$alpha, 3)),
    Ratio = sapply(group_imputation_params, function(x) round(x$ratio, 2)),
    Exp_Mean = sapply(group_imputation_params, function(x) x$expected_mean)
  )
  print(imp_summary)
  cat("\n")
  
} else {
  data$views_imputed <- FALSE
  group_imputation_params <- list()
  imputation_log <- list()
  cat("✓ No views to impute\n\n")
}

# ---------------------------------------------------------------------------
# STEP 7b: Handle NA Engagement Metrics - REMOVE POSTS
# ---------------------------------------------------------------------------

cat("--- Handling NA Engagement Metrics (REMOVAL) ---\n\n")

n_before_engagement_removal <- nrow(data)

# Count NAs before removal
na_reactions <- sum(is.na(data$statistics.reaction_count))
na_shares <- sum(is.na(data$statistics.share_count))
na_comments <- sum(is.na(data$statistics.comment_count))

cat("Posts with NA engagement metrics (BEFORE removal):\n")
cat("  NA reactions:", na_reactions, sprintf("(%.2f%%)\n", 100*na_reactions/nrow(data)))
cat("  NA shares:", na_shares, sprintf("(%.2f%%)\n", 100*na_shares/nrow(data)))
cat("  NA comments:", na_comments, sprintf("(%.2f%%)\n", 100*na_comments/nrow(data)))
cat("\n")

# Document by group before removal
na_engagement_by_group <- data %>%
  group_by(main_list) %>%
  summarise(
    total_posts = n(),
    na_reactions = sum(is.na(statistics.reaction_count)),
    na_shares = sum(is.na(statistics.share_count)),
    na_comments = sum(is.na(statistics.comment_count)),
    posts_any_na = sum(is.na(statistics.reaction_count) | 
                       is.na(statistics.share_count) | 
                       is.na(statistics.comment_count)),
    pct_any_na = round(100 * posts_any_na / total_posts, 2),
    .groups = "drop"
  ) %>%
  arrange(desc(pct_any_na))

cat("NA engagement by group (BEFORE removal):\n")
print(na_engagement_by_group)
cat("\n")

# REMOVE posts with NA engagement
cat("⚠ REMOVING posts with NA engagement metrics...\n")

data <- data %>%
  filter(
    !is.na(statistics.reaction_count),
    !is.na(statistics.share_count),
    !is.na(statistics.comment_count)
  )

n_engagement_removed <- n_before_engagement_removal - nrow(data)

cat("✓ Removed", n_engagement_removed, "posts with NA engagement metrics\n")
cat("  Remaining posts:", nrow(data), "\n\n")

# Update n_removed
n_removed <- n_removed + n_engagement_removed

# ---------------------------------------------------------------------------
# STEP 7c: Final Validation Checks
# ---------------------------------------------------------------------------

cat("--- Final Validation Checks ---\n\n")

# Check for negative values
negative_checks <- data %>%
  summarise(
    negative_views = sum(statistics.views < 0, na.rm = TRUE),
    negative_reactions = sum(statistics.reaction_count < 0, na.rm = TRUE),
    negative_shares = sum(statistics.share_count < 0, na.rm = TRUE),
    negative_comments = sum(statistics.comment_count < 0, na.rm = TRUE)
  )

if (sum(negative_checks) > 0) {
  cat("⚠ WARNING: Found negative values - setting to 0\n")
  data <- data %>%
    mutate(
      statistics.views = pmax(statistics.views, 0, na.rm = TRUE),
      statistics.reaction_count = pmax(statistics.reaction_count, 0),
      statistics.share_count = pmax(statistics.share_count, 0),
      statistics.comment_count = pmax(statistics.comment_count, 0)
    )
} else {
  cat("✓ No negative values found\n")
}

# Check NA content_type
na_content_type <- sum(is.na(data$content_type))
if (na_content_type > 0) {
  # MCL message: "This content isn't available right now. Content Library does not support this type of attachment."
  data <- data %>%
    mutate(content_type = replace_na(content_type, "mcl_unsupported_attachment"))
  cat("✓ Replaced", na_content_type, "NA content_type with 'mcl_unsupported_attachment'\n")
  cat("  (Original MCL message: 'Content Library does not support this type of attachment')\n")
} else {
  cat("✓ No NA values in content_type\n")
}
cat("\n")

# Final NA check
cat("NA values in engagement metrics (AFTER handling):\n")
for (field in engagement_fields) {
  na_count <- sum(is.na(data[[field]]))
  cat(sprintf("  %-30s: %d\n", field, na_count))
}
cat("\n")

# ============================================================================
# STEP 8: CREATE ANALYSIS VARIABLES
# ============================================================================

cat("STEP 8: Creating analysis-ready variables...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

data <- data %>%
  mutate(
    # Policy period
    policy_period = case_when(
      date < config$policy_timeline$global_implementation ~ "Pre-Policy",
      date >= config$policy_timeline$global_implementation ~ "Post-Policy"
    ),
    
    days_since_policy = as.numeric(difftime(date, config$policy_timeline$global_implementation, units = "days")),
    
    policy_phase = case_when(
      date < as.Date("2021-02-10") ~ "Baseline",
      date >= as.Date("2021-02-10") & date < config$policy_timeline$european_expansion ~ "Announcement",
      date >= config$policy_timeline$european_expansion & date < config$policy_timeline$global_implementation ~ "Testing",
      date >= config$policy_timeline$global_implementation & date < config$policy_timeline$engagement_deemphasis ~ "Implementation",
      date >= config$policy_timeline$engagement_deemphasis & date < config$policy_timeline$reversal_announcement ~ "Refinement",
      date >= config$policy_timeline$reversal_announcement ~ "Reversal"
    ),
    
    year_month = floor_date(date, "month"),
    week = floor_date(date, "week"),
    quarter = quarter(date, with_year = TRUE),
    
    total_engagement = statistics.reaction_count + statistics.share_count + statistics.comment_count,
    engagement_rate = if_else(statistics.views > 0, total_engagement / statistics.views, 0),
    
    log_views = log1p(statistics.views),
    log_reactions = log1p(statistics.reaction_count),
    log_shares = log1p(statistics.share_count),
    log_comments = log1p(statistics.comment_count),
    log_total_engagement = log1p(total_engagement)
  )

cat("✓ Created all analysis variables\n\n")

# ============================================================================
# STEP 9: ACCOUNT-LEVEL AGGREGATIONS
# ============================================================================

cat("STEP 9: Creating account-level dataset...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

accounts <- data %>%
  group_by(surface.id, surface.name, surface.username, main_list) %>%
  summarise(
    n_posts = n(),
    n_posts_pre = sum(policy_period == "Pre-Policy"),
    n_posts_post = sum(policy_period == "Post-Policy"),
    n_views_imputed = sum(views_imputed),
    pct_views_imputed = round(100 * n_views_imputed / n_posts, 2),
    first_post = min(date),
    last_post = max(date),
    days_active = as.numeric(difftime(last_post, first_post, units = "days")),
    sub_list = first(sub_list),
    total_views = sum(statistics.views, na.rm = TRUE),
    total_reactions = sum(statistics.reaction_count, na.rm = TRUE),
    total_shares = sum(statistics.share_count, na.rm = TRUE),
    total_comments = sum(statistics.comment_count, na.rm = TRUE),
    total_engagement = sum(total_engagement, na.rm = TRUE),
    avg_views = mean(statistics.views, na.rm = TRUE),
    avg_reactions = mean(statistics.reaction_count, na.rm = TRUE),
    avg_shares = mean(statistics.share_count, na.rm = TRUE),
    avg_comments = mean(statistics.comment_count, na.rm = TRUE),
    avg_engagement_rate = mean(engagement_rate, na.rm = TRUE),
    median_views = median(statistics.views, na.rm = TRUE),
    median_reactions = median(statistics.reaction_count, na.rm = TRUE),
    avg_views_pre = mean(statistics.views[policy_period == "Pre-Policy"], na.rm = TRUE),
    avg_views_post = mean(statistics.views[policy_period == "Post-Policy"], na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    reach_change = avg_views_post - avg_views_pre,
    reach_change_pct = if_else(avg_views_pre > 0,
                               100 * (avg_views_post - avg_views_pre) / avg_views_pre,
                               NA_real_),
    active_pre = n_posts_pre >= 10,
    active_post = n_posts_post >= 10,
    active_both = active_pre & active_post
  )

cat("✓ Created account-level summary\n")
cat("  Unique accounts:", nrow(accounts), "\n\n")

# FINAL CHECK: Verify no account appears in multiple rows
account_check <- accounts %>%
  group_by(surface.id) %>%
  filter(n() > 1) %>%
  ungroup()

if (nrow(account_check) > 0) {
  stop("FATAL: Account-level data has duplicate surface.ids!")
} else {
  cat("✓ VERIFIED: Each account appears exactly once in accounts dataset\n\n")
}

cat("Accounts by main list:\n")
print(accounts %>% count(main_list))
cat("\n")

# ---------------------------------------------------------------------------
# CREATE COMPREHENSIVE SURFACE INFO DATASET
# ---------------------------------------------------------------------------

cat("Creating comprehensive surface info dataset...\n\n")

# Extract all unique surface-level fields from the post data
# These are fields that describe the account/page itself, not individual posts

# First, identify all surface.* columns in the data
surface_columns <- names(data)[grepl("^surface\\.", names(data))]
cat("  Found", length(surface_columns), "surface-level columns:\n")
cat("   ", paste(surface_columns, collapse = ", "), "\n\n")

# Create surface info by taking the most recent/complete info for each surface.id
surface_info <- data %>%
  # Group by surface.id to get one row per account
  group_by(surface.id) %>%
  # For each surface field, take the most recent non-NA value
  summarise(
    # Core identifiers
    surface.name = last(na.omit(surface.name)),
    surface.username = last(na.omit(surface.username)),
    
    # List assignments (should be unique after integrity checks)
    main_list = first(main_list),
    sub_list = first(sub_list),
    
    # Additional surface fields if they exist (dynamic)
    across(
      any_of(setdiff(surface_columns, c("surface.id", "surface.name", "surface.username"))),
      ~last(na.omit(.))
    ),
    
    # Computed metadata from posts
    n_posts_total = n(),
    n_posts_pre_policy = sum(policy_period == "Pre-Policy"),
    n_posts_post_policy = sum(policy_period == "Post-Policy"),
    
    # Temporal coverage
    first_post_date = min(date),
    last_post_date = max(date),
    days_active = as.numeric(difftime(max(date), min(date), units = "days")),
    
    # Activity spans policy change?
    spans_policy_change = (min(date) < config$policy_timeline$global_implementation) & 
                          (max(date) >= config$policy_timeline$global_implementation),
    
    # Engagement summary (lifetime)
    total_views = sum(statistics.views, na.rm = TRUE),
    total_reactions = sum(statistics.reaction_count, na.rm = TRUE),
    total_shares = sum(statistics.share_count, na.rm = TRUE),
    total_comments = sum(statistics.comment_count, na.rm = TRUE),
    total_engagement = sum(total_engagement, na.rm = TRUE),
    
    # Average engagement per post
    avg_views_per_post = mean(statistics.views, na.rm = TRUE),
    avg_reactions_per_post = mean(statistics.reaction_count, na.rm = TRUE),
    avg_shares_per_post = mean(statistics.share_count, na.rm = TRUE),
    avg_comments_per_post = mean(statistics.comment_count, na.rm = TRUE),
    
    # Median engagement (more robust)
    median_views_per_post = median(statistics.views, na.rm = TRUE),
    median_reactions_per_post = median(statistics.reaction_count, na.rm = TRUE),
    
    # Engagement rate
    avg_engagement_rate = mean(engagement_rate, na.rm = TRUE),
    
    # Pre vs Post policy comparison
    avg_views_pre_policy = mean(statistics.views[policy_period == "Pre-Policy"], na.rm = TRUE),
    avg_views_post_policy = mean(statistics.views[policy_period == "Post-Policy"], na.rm = TRUE),
    avg_reactions_pre_policy = mean(statistics.reaction_count[policy_period == "Pre-Policy"], na.rm = TRUE),
    avg_reactions_post_policy = mean(statistics.reaction_count[policy_period == "Post-Policy"], na.rm = TRUE),
    
    # Content type breakdown
    n_content_types = n_distinct(content_type),
    primary_content_type = names(sort(table(content_type), decreasing = TRUE))[1],
    
    # Data quality flags
    n_views_imputed = sum(views_imputed),
    pct_views_imputed = round(100 * sum(views_imputed) / n(), 2),
    
    .groups = "drop"
  ) %>%
  # Add computed change metrics
mutate(
    reach_change = avg_views_post_policy - avg_views_pre_policy,
    reach_change_pct = if_else(
      !is.na(avg_views_pre_policy) & avg_views_pre_policy > 0,
      100 * (avg_views_post_policy - avg_views_pre_policy) / avg_views_pre_policy,
      NA_real_
    ),
    reactions_change = avg_reactions_post_policy - avg_reactions_pre_policy,
    reactions_change_pct = if_else(
      !is.na(avg_reactions_pre_policy) & avg_reactions_pre_policy > 0,
      100 * (avg_reactions_post_policy - avg_reactions_pre_policy) / avg_reactions_pre_policy,
      NA_real_
    )
  ) %>%
  # Reorder columns for clarity
  dplyr::select(
    # Identifiers first
    surface.id, surface.name, surface.username,
    # List assignments
    main_list, sub_list,
    # Any other surface.* columns
    starts_with("surface."),
    # Activity metrics
    n_posts_total, n_posts_pre_policy, n_posts_post_policy,
    first_post_date, last_post_date, days_active, spans_policy_change,
    # Engagement totals
    total_views, total_reactions, total_shares, total_comments, total_engagement,
    # Per-post averages
    avg_views_per_post, avg_reactions_per_post, avg_shares_per_post, avg_comments_per_post,
    median_views_per_post, median_reactions_per_post, avg_engagement_rate,
    # Pre/post comparison
    avg_views_pre_policy, avg_views_post_policy, reach_change, reach_change_pct,
    avg_reactions_pre_policy, avg_reactions_post_policy, reactions_change, reactions_change_pct,
    # Content info
    n_content_types, primary_content_type,
    # Data quality
    n_views_imputed, pct_views_imputed,
    # Anything else
    everything()
  )

cat("✓ Created surface info dataset\n")
cat("  Total surfaces:", nrow(surface_info), "\n")
cat("  Columns:", ncol(surface_info), "\n\n")

# Verify one-to-one mapping with main_list
surface_list_check <- surface_info %>%
  group_by(surface.id) %>%
  filter(n() > 1)

if (nrow(surface_list_check) > 0) {
  stop("FATAL: Surface info has duplicate surface.ids!")
} else {
  cat("✓ VERIFIED: Each surface appears exactly once in surface_info\n\n")
}

# Summary by main_list
cat("Surface info summary by main_list:\n")
surface_summary <- surface_info %>%
  group_by(main_list) %>%
  summarise(
    n_surfaces = n(),
    n_spanning_policy = sum(spans_policy_change),
    avg_posts = round(mean(n_posts_total), 1),
    avg_total_views = round(mean(total_views), 0),
    avg_reach_change_pct = round(mean(reach_change_pct, na.rm = TRUE), 1),
    .groups = "drop"
  )
print(surface_summary)
cat("\n")

# ---------------------------------------------------------------------------
# CREATE SURFACE ID LIST FOR MCL API QUERIES
# ---------------------------------------------------------------------------

cat("Creating surface ID list for MCL API queries...\n\n")

# Create a simple list of surface.ids for use with the MCL API script
surface_ids_for_api <- surface_info %>%
  dplyr::select(surface.id, surface.name, main_list) %>%
  arrange(main_list, surface.name)

cat("  Total surface IDs for API query:", nrow(surface_ids_for_api), "\n")
cat("  By main_list:\n")
print(surface_ids_for_api %>% count(main_list))
cat("\n")

accounts_both_periods <- accounts %>% filter(active_both)

cat("Accounts active in BOTH periods (≥10 posts each):\n")
print(accounts_both_periods %>% count(main_list))
cat("\n")

# ============================================================================
# STEP 10: DATA QUALITY CHECKS
# ============================================================================

cat("STEP 10: Final data quality checks...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

# Check for duplicates
duplicates <- data %>%
  group_by(id) %>%
  filter(n() > 1) %>%
  ungroup()

if (nrow(duplicates) > 0) {
  cat("⚠ WARNING: Removing", n_distinct(duplicates$id), "duplicate post IDs\n")
  data <- data %>% distinct(id, .keep_all = TRUE)
} else {
  cat("✓ No duplicate posts\n")
}

# Period coverage
cat("\nPosts per main list in each period:\n")
period_coverage <- data %>%
  count(main_list, policy_period) %>%
  pivot_wider(names_from = policy_period, values_from = n, values_fill = 0) %>%
  mutate(Total = rowSums(across(-main_list)))
print(period_coverage)
cat("\n")

# ============================================================================
# STEP 11: CREATE ANALYSIS-READY DATASETS
# ============================================================================

cat("STEP 11: Creating analysis-ready datasets...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

cleaned_data <- data

data_both_periods <- data %>%
  filter(surface.id %in% accounts_both_periods$surface.id)

# ---------------------------------------------------------------------------
# COMPLETE WEEKS FILTERING FOR WEEKLY AGGREGATION
# ---------------------------------------------------------------------------

cat("--- Complete Weeks Filtering ---\n\n")

# Calculate complete week boundaries
first_complete_week <- get_first_complete_week_start(study_start_date)
last_complete_week_end <- get_last_complete_week_end(study_end_date)
last_complete_week_start <- last_complete_week_end - days(6)

cat("Study time-frame:\n")
cat("  Start date:", format(study_start_date, "%Y-%m-%d"), 
    "(", weekdays(study_start_date), ")\n")
cat("  End date:  ", format(study_end_date, "%Y-%m-%d"), 
    "(", weekdays(study_end_date), ")\n\n")

cat("Complete weeks boundaries:\n")
cat("  First complete week starts:", format(first_complete_week, "%Y-%m-%d"), 
    "(", weekdays(first_complete_week), ")\n")
cat("  Last complete week ends:   ", format(last_complete_week_end, "%Y-%m-%d"), 
    "(", weekdays(last_complete_week_end), ")\n\n")

# Calculate how many weeks are excluded
all_weeks <- data %>%
  distinct(week) %>%
  arrange(week)

complete_weeks <- all_weeks %>%
  filter(week >= first_complete_week & week <= last_complete_week_start)

incomplete_weeks_start <- all_weeks %>%
  filter(week < first_complete_week)

incomplete_weeks_end <- all_weeks %>%
  filter(week > last_complete_week_start)

n_total_weeks <- nrow(all_weeks)
n_complete_weeks <- nrow(complete_weeks)
n_incomplete_start <- nrow(incomplete_weeks_start)
n_incomplete_end <- nrow(incomplete_weeks_end)

cat("Week counts:\n")
cat("  Total unique weeks in data:", n_total_weeks, "\n")
cat("  Complete weeks:            ", n_complete_weeks, "\n")
cat("  Partial weeks at start:    ", n_incomplete_start, "\n")
cat("  Partial weeks at end:      ", n_incomplete_end, "\n\n")

if (n_incomplete_start > 0) {
  cat("Partial weeks at START (excluded from weekly_aggregation):\n")
  for (i in 1:nrow(incomplete_weeks_start)) {
    w <- incomplete_weeks_start$week[i]
    w_end <- w + days(6)
    posts_in_week <- sum(data$week == w)
    cat(sprintf("  %s to %s (%d posts)\n", 
                format(w, "%Y-%m-%d"), format(w_end, "%Y-%m-%d"), posts_in_week))
  }
  cat("\n")
}

if (n_incomplete_end > 0) {
  cat("Partial weeks at END (excluded from weekly_aggregation):\n")
  for (i in 1:nrow(incomplete_weeks_end)) {
    w <- incomplete_weeks_end$week[i]
    w_end <- w + days(6)
    posts_in_week <- sum(data$week == w)
    cat(sprintf("  %s to %s (%d posts)\n", 
                format(w, "%Y-%m-%d"), format(w_end, "%Y-%m-%d"), posts_in_week))
  }
  cat("\n")
}

# Count posts excluded from weekly aggregation
posts_in_incomplete_weeks <- data %>%
  filter(week < first_complete_week | week > last_complete_week_start) %>%
  nrow()

cat("Posts in partial weeks:", posts_in_incomplete_weeks, 
    sprintf("(%.2f%% of total)\n", 100 * posts_in_incomplete_weeks / nrow(data)))
cat("These posts remain in cleaned_posts but are excluded from weekly_aggregation\n\n")

# Create weekly data with ONLY complete weeks
weekly_data <- data %>%
  # Filter to complete weeks only
  filter(week >= first_complete_week & week <= last_complete_week_start) %>%
  group_by(week, main_list) %>%
  summarise(
    n_posts = n(),
    n_accounts = n_distinct(surface.id),
    n_imputed = sum(views_imputed),
    pct_imputed = round(100 * n_imputed / n_posts, 2),
    total_views = sum(statistics.views, na.rm = TRUE),
    avg_views = mean(statistics.views, na.rm = TRUE),
    median_views = median(statistics.views, na.rm = TRUE),
    avg_reactions = mean(statistics.reaction_count, na.rm = TRUE),
    avg_shares = mean(statistics.share_count, na.rm = TRUE),
    avg_comments = mean(statistics.comment_count, na.rm = TRUE),
    avg_engagement_rate = mean(engagement_rate, na.rm = TRUE),
    policy_period = first(policy_period),
    policy_phase = first(policy_phase),
    .groups = "drop"
  ) %>%
  # Add flag indicating this is a complete week
  mutate(
    week_end = week + days(6),
    is_complete_week = TRUE
  )

cat("✓ Created weekly aggregation with COMPLETE WEEKS ONLY\n")
cat("  Weekly data rows:", nrow(weekly_data), "\n")
cat("  Date range:", format(min(weekly_data$week), "%Y-%m-%d"), "to", 
    format(max(weekly_data$week_end), "%Y-%m-%d"), "\n")
cat("  Unique complete weeks:", n_distinct(weekly_data$week), "\n\n")

# Monthly aggregation (unchanged - months are always complete within study period)
monthly_data <- data %>%
  group_by(year_month, main_list) %>%
  summarise(
    n_posts = n(),
    n_accounts = n_distinct(surface.id),
    n_imputed = sum(views_imputed),
    pct_imputed = round(100 * n_imputed / n_posts, 2),
    total_views = sum(statistics.views, na.rm = TRUE),
    avg_views = mean(statistics.views, na.rm = TRUE),
    median_views = median(statistics.views, na.rm = TRUE),
    avg_reactions = mean(statistics.reaction_count, na.rm = TRUE),
    avg_shares = mean(statistics.share_count, na.rm = TRUE),
    avg_comments = mean(statistics.comment_count, na.rm = TRUE),
    avg_engagement_rate = mean(engagement_rate, na.rm = TRUE),
    policy_period = first(policy_period),
    policy_phase = first(policy_phase),
    .groups = "drop"
  )

account_period_data <- data %>%
  group_by(surface.id, surface.name, main_list, policy_period) %>%
  summarise(
    n_posts = n(),
    n_views_imputed = sum(views_imputed),
    pct_views_imputed = round(100 * n_views_imputed / n_posts, 2),
    total_views = sum(statistics.views, na.rm = TRUE),
    avg_views = mean(statistics.views, na.rm = TRUE),
    median_views = median(statistics.views, na.rm = TRUE),
    avg_reactions = mean(statistics.reaction_count, na.rm = TRUE),
    avg_shares = mean(statistics.share_count, na.rm = TRUE),
    avg_comments = mean(statistics.comment_count, na.rm = TRUE),
    avg_engagement_rate = mean(engagement_rate, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = policy_period,
    values_from = c(n_posts, n_views_imputed, pct_views_imputed, total_views,
                    avg_views, median_views, avg_reactions, 
                    avg_shares, avg_comments, avg_engagement_rate),
    names_sep = "_"
  )

cat("✓ Created 6 analysis-ready datasets:\n")
cat("  1. cleaned_data:", nrow(cleaned_data), "posts\n")
cat("  2. data_both_periods:", nrow(data_both_periods), "posts\n")
cat("  3. weekly_data:", nrow(weekly_data), "rows (COMPLETE WEEKS ONLY)\n")
cat("  4. monthly_data:", nrow(monthly_data), "rows\n")
cat("  5. account_period_data:", nrow(account_period_data), "accounts\n")
cat("  6. surface_info:", nrow(surface_info), "surfaces (comprehensive account metadata)\n\n")

# ============================================================================
# STEP 12: SAVE CLEANED DATASETS
# ============================================================================

cat("STEP 12: Saving cleaned datasets...\n")
cat("-" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

if (!dir.exists("cleaned_data")) {
  dir.create("cleaned_data")
}

timestamp <- format(Sys.time(), "%Y%m%d_%H%M%S")

saveRDS(cleaned_data, paste0("cleaned_data/cleaned_posts_", timestamp, ".rds"))
saveRDS(accounts, paste0("cleaned_data/accounts_summary_", timestamp, ".rds"))
saveRDS(accounts_both_periods, paste0("cleaned_data/accounts_both_periods_", timestamp, ".rds"))
saveRDS(data_both_periods, paste0("cleaned_data/posts_both_periods_", timestamp, ".rds"))
saveRDS(weekly_data, paste0("cleaned_data/weekly_aggregation_", timestamp, ".rds"))
saveRDS(monthly_data, paste0("cleaned_data/monthly_aggregation_", timestamp, ".rds"))
saveRDS(account_period_data, paste0("cleaned_data/account_period_", timestamp, ".rds"))

# Save surface info dataset (comprehensive account metadata)
saveRDS(surface_info, paste0("cleaned_data/surface_info_", timestamp, ".rds"))
cat("  ✓ surface_info_", timestamp, ".rds\n", sep = "")

# Save surface info as CSV for easy viewing
write.csv(surface_info, 
          paste0("cleaned_data/surface_info_", timestamp, ".csv"),
          row.names = FALSE, fileEncoding = "UTF-8")
cat("  ✓ surface_info_", timestamp, ".csv\n", sep = "")

# Save surface IDs for MCL API queries (multiple formats for convenience)
# Format 1: Simple text file with one ID per line
writeLines(surface_ids_for_api$surface.id, 
           paste0("cleaned_data/surface_ids_for_api_", timestamp, ".txt"))
cat("  ✓ surface_ids_for_api_", timestamp, ".txt (one ID per line)\n", sep = "")

# Format 2: CSV with ID, name, and list for reference
write.csv(surface_ids_for_api,
          paste0("cleaned_data/surface_ids_for_api_", timestamp, ".csv"),
          row.names = FALSE, fileEncoding = "UTF-8")
cat("  ✓ surface_ids_for_api_", timestamp, ".csv\n", sep = "")

# Format 3: R vector format (can be copy-pasted into MCL API script)
r_vector_file <- paste0("cleaned_data/surface_ids_r_vector_", timestamp, ".R")
r_vector_content <- paste0(
  "# Surface IDs for MCL API script\n",
  "# Generated: ", Sys.time(), "\n",
  "# Total IDs: ", nrow(surface_ids_for_api), "\n",
  "# Copy this vector to the 'account_ids' parameter in the MCL API script\n\n",
  "account_ids <- c(\n",
  paste0('  "', surface_ids_for_api$surface.id, '"', collapse = ",\n"),
  "\n)\n"
)
writeLines(r_vector_content, r_vector_file)
cat("  ✓ surface_ids_r_vector_", timestamp, ".R (copy-paste ready)\n", sep = "")

# Format 4: By main_list (separate files for targeted API queries)
for (list_name in unique(surface_ids_for_api$main_list)) {
  list_ids <- surface_ids_for_api %>% 
    filter(main_list == list_name) %>% 
    pull(surface.id)
  
  safe_list_name <- gsub("[^a-zA-Z0-9_]", "_", list_name)
  list_file <- paste0("cleaned_data/surface_ids_", safe_list_name, "_", timestamp, ".txt")
  writeLines(list_ids, list_file)
  cat("  ✓ surface_ids_", safe_list_name, "_", timestamp, ".txt (", 
      length(list_ids), " IDs)\n", sep = "")
}

# Save metadata
mp_groups <- unique(data$main_list[grepl("^MPs", data$main_list)])
total_views_imputed <- sum(data$views_imputed)
pct_views_imputed <- round(100 * total_views_imputed / nrow(data), 2)

# Calculate actual imputed values summary for metadata
imputed_values_in_data <- data$statistics.views[data$views_imputed]

metadata <- list(
  timestamp = Sys.time(),
  source_file = basename(data_file),
  script_version = "v3.2_group_specific_imputation_complete_weeks",
  dataset_version = if(length(mp_groups) > 1) "v3.2_mp_split" else "v2_or_v3.1_single_mp",
  mp_groups = mp_groups,
  all_main_lists = unique(data$main_list),
  n_posts_original = n_before,
  n_posts_cleaned = nrow(cleaned_data),
  n_removed = n_removed,
  n_accounts = nrow(accounts),
  n_accounts_both_periods = nrow(accounts_both_periods),
  n_surfaces = nrow(surface_info),
  date_range = date_range,
  policy_dates = policy_dates,
  
  study_timeframe = list(
    start_date = study_start_date,
    end_date = study_end_date,
    description = "Posts outside this range were excluded from analysis"
  ),
  
  complete_weeks = list(
    enabled = TRUE,
    first_complete_week_start = first_complete_week,
    last_complete_week_end = last_complete_week_end,
    n_total_weeks = n_total_weeks,
    n_complete_weeks = n_complete_weeks,
    n_partial_weeks_start = n_incomplete_start,
    n_partial_weeks_end = n_incomplete_end,
    posts_in_partial_weeks = posts_in_incomplete_weeks,
    description = "Weekly aggregation only includes complete weeks (Sunday to Saturday)"
  ),
  
  integrity_checks = list(
    multi_list_accounts_found = !is.null(integrity_errors$multi_list_accounts),
    duplicate_posts_found = !is.null(integrity_errors$duplicate_posts),
    resolution_applied = if(has_critical_errors) multi_list_resolution else "none_needed",
    warnings = names(integrity_warnings)
  ),
  
  na_handling = list(
    views = list(
      method = "Group-specific ratio-weighted",
      description = "Power law fit to each group's near-threshold distribution (101-500 views)",
      configuration = list(
        bin_width = IMPUTATION_BIN_WIDTH,
        max_view = IMPUTATION_MAX_VIEW,
        min_bins = IMPUTATION_MIN_BINS,
        min_posts = IMPUTATION_MIN_POSTS,
        fallback_ratio = POOLED_FALLBACK_RATIO
      ),
      seed = 42,
      n_imputed = total_views_imputed,
      pct_imputed = pct_views_imputed,
      group_parameters = group_imputation_params,
      imputation_log = imputation_log,
      imputed_summary = list(
        mean = round(mean(imputed_values_in_data), 1),
        median = median(imputed_values_in_data),
        sd = round(sd(imputed_values_in_data), 1),
        pct_under_50 = round(100 * mean(imputed_values_in_data <= 50), 1)
      ),
      sensitivity = "Max 0.002% difference between pooled and group-specific",
      by_group = na_views_by_group
    ),
    engagement = list(
      method = "REMOVAL",
      n_posts_removed = n_engagement_removed,
      by_group = na_engagement_by_group
    )
  ),
  
  surface_info = list(
    description = "Comprehensive account/surface metadata extracted from posts",
    n_surfaces = nrow(surface_info),
    columns = names(surface_info),
    by_main_list = as.data.frame(surface_info %>% count(main_list)),
    api_query_files = c(
      paste0("surface_ids_for_api_", timestamp, ".txt"),
      paste0("surface_ids_for_api_", timestamp, ".csv"),
      paste0("surface_ids_r_vector_", timestamp, ".R")
    )
  ),
  
  data_flags = c("views_imputed", "views_pre_2017", "views_na_reason")
)

saveRDS(metadata, paste0("cleaned_data/metadata_", timestamp, ".rds"))

cat("✓ Saved all datasets to cleaned_data/\n\n")

# ============================================================================
# STEP 13: SUMMARY REPORT
# ============================================================================

cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n")
cat("DATA VALIDATION AND CLEANING SUMMARY (v3.2 - GROUP-SPECIFIC IMPUTATION)\n")
cat("WITH COMPLETE WEEKS FILTERING\n")
cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

cat("STUDY TIME-FRAME:\n")
cat("  Start date:", format(study_start_date, "%Y-%m-%d"), "\n")
cat("  End date:  ", format(study_end_date, "%Y-%m-%d"), "\n")
cat("  Posts outside this range were excluded\n\n")

cat("COMPLETE WEEKS FILTERING:\n")
cat("  First complete week:", format(first_complete_week, "%Y-%m-%d"), "to", 
    format(first_complete_week + days(6), "%Y-%m-%d"), "\n")
cat("  Last complete week: ", format(last_complete_week_start, "%Y-%m-%d"), "to", 
    format(last_complete_week_end, "%Y-%m-%d"), "\n")
cat("  Total complete weeks:", n_complete_weeks, "\n")
cat("  Partial weeks excluded:", n_incomplete_start + n_incomplete_end, "\n")
cat("  Posts in partial weeks:", posts_in_incomplete_weeks, 
    sprintf("(%.2f%%)\n", 100 * posts_in_incomplete_weeks / nrow(cleaned_data)))
cat("  NOTE: Partial week posts remain in cleaned_posts but excluded from weekly_aggregation\n\n")

cat("VIEW IMPUTATION METHOD (GROUP-SPECIFIC):\n")
cat("  Method: Group-specific ratio-weighted based on power law extrapolation\n")
cat("  Each group's parameters derived from its own near-threshold distribution\n\n")
cat("  Group-specific parameters:\n")
for (g in names(group_imputation_params)) {
  p <- group_imputation_params[[g]]
  cat(sprintf("    %s: α=%.3f, ratio=%.2f → mean=%.1f, median=%d\n",
              g, ifelse(is.na(p$alpha), NA, p$alpha), p$ratio, 
              p$expected_mean, p$expected_median))
}
cat("\n")
cat("  Total imputed:", total_views_imputed, sprintf("(%.1f%%)\n", pct_views_imputed))
cat("  Sensitivity: Max 0.002% difference between pooled and group-specific\n\n")

cat("MERGE INTEGRITY:\n")
if (has_critical_errors) {
  cat("  ⚠ Issues found and resolved\n")
  if (!is.null(integrity_errors$multi_list_accounts)) {
    cat("    - Multi-list accounts: ", nrow(integrity_errors$multi_list_accounts), 
        " (resolution: ", multi_list_resolution, ")\n", sep = "")
  }
  if (!is.null(integrity_errors$duplicate_posts)) {
    cat("    - Duplicate posts: ", n_distinct(integrity_errors$duplicate_posts$id), "\n", sep = "")
  }
} else {
  cat("  ✓ All checks passed - no issues found\n")
}
cat("\n")

cat("DATA CLEANING:\n")
cat("  Original posts:", n_before, "\n")
cat("  Posts removed (invalid + NA engagement):", n_removed, "\n")
cat("  Final posts:", nrow(cleaned_data), "\n")
cat("  Unique accounts:", nrow(accounts), "\n")
cat("  Accounts in both periods:", nrow(accounts_both_periods), "\n")
cat("  Surface info records:", nrow(surface_info), "\n\n")

cat("KEY VERIFICATION:\n")
cat("  ✓ Each surface.id assigned to exactly ONE main_list\n")
cat("  ✓ No duplicate post IDs\n")
cat("  ✓ All posts within study time-frame (", format(study_start_date, "%Y-%m-%d"), 
    " to ", format(study_end_date, "%Y-%m-%d"), ")\n", sep = "")
cat("  ✓ All engagement metrics are non-NA\n")
cat("  ✓ Surface info dataset has one row per account\n")
cat("  ✓ View imputation uses group-specific ratio-weighted method\n")
cat("  ✓ Weekly aggregation includes only COMPLETE WEEKS\n\n")

cat("METHODS SECTION TEXT:\n")
cat("-" %>% rep(60) %>% paste0(collapse = ""), "\n")
cat("
View counts at or below 100 are censored in the Meta Content Library API 
and returned as missing values. We imputed these censored values using a 
group-specific ratio-weighted approach. For each political group, we fit 
a power law model to bin counts in the 101-500 view range and extrapolated 
to estimate the distribution below the censoring threshold. This yielded 
group-specific parameters reflecting meaningful differences in how content 
falls below the visibility threshold. Extremists showed a left-skewed 
distribution (ratio ~1.65) indicating more very low-view content, while 
Prominent Politicians showed a right-skewed distribution (ratio ~0.66) 
suggesting their censored posts tend to be closer to the 100-view threshold. 
MPs fell between these extremes (ratio ~0.82-0.94). Sensitivity analysis 
confirmed that the maximum difference between group-specific and pooled 
imputation approaches was only 0.002%, indicating results are robust to 
imputation assumptions.

For time series analysis, we aggregated posts into weekly intervals. To ensure 
consistent temporal comparison, only complete weeks (Sunday through Saturday) 
were included in the weekly aggregation. Partial weeks at the beginning and 
end of the study period were excluded from the weekly time series but remained 
in the post-level dataset for other analyses.
\n")
cat("-" %>% rep(60) %>% paste0(collapse = ""), "\n\n")

cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n")
cat("✓ DATA VALIDATION AND CLEANING COMPLETE (v3.2 - COMPLETE WEEKS)\n")
cat("=" %>% rep(80) %>% paste0(collapse = ""), "\n\n")

[NOTICE] 2 output(s) filtered out