## Extracting and Cleaning Contact Data for Higher-Order Networks

In addition to the datasets discussed earlier, an important class of higher-order networks arises from **proximity and face-to-face contact data**. These systems are often collected via RFID sensors or Bluetooth-based technologies and record **timestamped pairwise interactions** between individuals. Unlike explicitly defined group datasets (e.g., emails or biochemical compounds), constructing hypergraphs from contact data requires careful methodological choices to infer **group interactions** and higher-order structures.

The typical workflow to extract a higher-order representation from contact data involves two main steps:

1. **Temporal aggregation:**  
   Interactions are aggregated over a fixed-length time window \( $\Delta t$ \). For each interval \($[t, t + \Delta t]$\), a snapshot \($ W(t)$ \) is created, encoding all pairwise interactions within that window.

2. **Clique-based hyperedge construction:**  
   Within each snapshot \($ W(t) $\), maximal cliques are identified and used to construct **hyperedges** that represent higher-order co-presence among individuals.

To focus on the most significant interactions, we assign to each hyperedge \($ E_i $\) a **weight** \( $\Omega_{E_i}$ \) corresponding to its frequency across all snapshots. Filtering strategies can then be applied, for example:

- Removing the bottom 10% of hyperedges by weight \($ \Omega_{E_i} $\), or  
- Setting a threshold \( C \) and retaining only hyperedges with \($ \Omega_{E_i} > C $\).  

After filtering, the remaining interactions are aggregated into a **single static hypergraph**. The resulting hypergraph exhibits varying levels of **inter-order overlap**, which depend on both the chosen time window \( \Delta t \) and the filtering threshold \( C \).

To illustrate how different parameters affect the resulting higher-order structure, we apply the **maximal-clique-with-overlap method** to real-world proximity data from the **Copenhagen Network Study**. This dataset provides **high-resolution Bluetooth proximity data** among university students. -> ref: https://www.nature.com/articles/s41597-019-0325-x

We construct hypergraphs by varying:

- The **time window** \( $\Delta t$ = \{5, 10, 15, 30\} \) minutes, and  
- The **minimum contact threshold** \( C = \{1, 3, 5, 10\} \),  

and compute the corresponding **inter-order overlap matrix \( A \)** for each parameter combination.


In [3]:
from overlap_func import * 
import numpy as np
import networkx as nx
import pandas as pd
import random
import collections
import matplotlib.pyplot as plt
import os
import json
from time import time
from contact_data_process_func import *

In [5]:
# ------------------------------------------------------------
# Parameters and Directories
# ------------------------------------------------------------

dataset_dir = 'contact_data_raw/'     # Directory containing the raw contact (pairwise interaction) data
out_dir = 'processed_contact_data/'                     # Directory to save the processed clique data
days = 10                        # Number of days of data to process
Tmax = days * 24 * 3600          # Maximum timestamp (in seconds) for analysis period

dataset = 't_edges_dbm74'        # Name of the dataset file to be processed

# Define time windows (in minutes) for temporal aggregation
time_windows = [5,10,15,30]

# Define minimum frequency thresholds for filtering hyperedges
thrs = [1, 3, 5, 10]


# ------------------------------------------------------------
# Extraction Pipeline
# ------------------------------------------------------------

# Loop through each time window (Δt) and threshold (C)
for n_minutes in time_windows:
    for thr in thrs:
        print(f'Processing time window: {n_minutes} min | Threshold: {thr}')
        
        # Step 1: Aggregate pairwise interactions into temporal snapshots
        # aggs[t] contains the interaction network for each Δt window
        aggs = extract_networks(
            dataset_dir, dataset, n_minutes, 
            original_nets=False, tmax=Tmax
        )
        
        # Step 2: Extract maximal cliques from each temporal snapshot
        # These represent potential higher-order co-presence groups
        cliques = extract_cliques(aggs)
        
        # Step 3: Compute frequency (weights) of each clique across time windows
        ws = clique_weights(cliques)
        
        # Step 4: Save weighted cliques, filtering by threshold 'thr'
        # Only cliques with weight ≥ thr are retained
        save_cliques(ws, out_dir, dataset, n_minutes, thr=thr)
        
        # --------------------------------------------------------
        # Optional (commented-out exploratory steps):
        # --------------------------------------------------------
        # Inspect clique weights:
        # print(ws)
        
        # Filter non-maximal cliques if desired:
        # maximal_cliques = clean_non_maximal(ws)
        # print(dataset, thrs)


Processing time window: 5 min | Threshold: 1
last time 2418900.0
Processing time window: 5 min | Threshold: 3
last time 2418900.0
Processing time window: 5 min | Threshold: 5
last time 2418900.0
Processing time window: 5 min | Threshold: 10
last time 2418900.0
Processing time window: 10 min | Threshold: 1
last time 2418600.0
Processing time window: 10 min | Threshold: 3
last time 2418600.0
Processing time window: 10 min | Threshold: 5
last time 2418600.0
Processing time window: 10 min | Threshold: 10
last time 2418600.0
Processing time window: 15 min | Threshold: 1
last time 2418300.0
Processing time window: 15 min | Threshold: 3
last time 2418300.0
Processing time window: 15 min | Threshold: 5
last time 2418300.0
Processing time window: 15 min | Threshold: 10
last time 2418300.0
Processing time window: 30 min | Threshold: 1
last time 2417400.0
Processing time window: 30 min | Threshold: 3
last time 2417400.0
Processing time window: 30 min | Threshold: 5
last time 2417400.0
Processing 