# Homework 3: Mining Data Streams

Implementation of the TRIÈST algorithms (BASE and IMPR) as described in the provided paper "TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size".

## Overview of the Approach

According to the paper, we will implement two primary algorithms:

TRIÈST-BASE: Uses standard reservoir sampling. It maintains a fixed memory of edges M. If an edge is kept in the sample, it updates the global and local triangle counters. It is unbiased but has higher variance.

TRIÈST-IMPR: An improved version that updates counters unconditionally (before the sampling decision) using weighted increments. It yields lower variance and higher quality estimations.


Note on Data: The implementation below is designed to ingest web-Stanford.txt. Since this is a directed graph, we treat edges as undirected for triangle counting as defined in the preliminaries of the paper.

## Imports and Setup

In [1]:
import time
import os
import random
# Import the algorithms from the local files
from src.TriestBase import TriestBase
from src.TriestImpr import TriestImpr

# Configuration
FILE_PATH = 'data/web-Stanford.txt'
MEMORY_SIZE_M = 10000  # Fixed memory size M

## Stream Processing Function

In [2]:
def load_stream_and_run(filepath, algo_base, algo_impr, limit=None):
    """
    Reads the file stream and feeds edges to both algorithms simultaneously.
    Handles the input as an edge stream (u, v).
    """
    edge_count = 0
    start_time = time.time()
    
    print(f"Reading stream from {filepath}...")
    
    try:
        with open(filepath, 'r') as f:
            for line in f:
                # Skip comments
                if line.startswith('#'):
                    continue
                
                parts = line.split()
                if len(parts) < 2:
                    continue
                
                try:
                    u, v = int(parts[0]), int(parts[1])
                except ValueError:
                    continue # Skip malformed lines
                
                # Ignore self-loops as per standard graph stream definitions
                if u == v:
                    continue
                    
                # Canonicalize edge (undirected graph assumption for TRIEST)
                if u > v:
                    u, v = v, u
                
                # Feed stream to both algorithms
                algo_base.process_edge(u, v)
                algo_impr.process_edge(u, v)
                
                edge_count += 1
                if edge_count % 100000 == 0:
                    print(f"Processed {edge_count} edges...")
                
                if limit and edge_count >= limit:
                    break
                    
    except FileNotFoundError:
        print(f"Error: File {filepath} not found.")
        return None
        
    duration = time.time() - start_time
    print(f"\n--- Processing Complete in {duration:.2f} seconds ---")
    return edge_count

## Execution and Results

In [6]:
print(f"Initializing Algorithms with Memory M = {MEMORY_SIZE_M}")
t_base = TriestBase(MEMORY_SIZE_M)
t_impr = TriestImpr(MEMORY_SIZE_M)

# Run the simulation
total_edges = load_stream_and_run(FILE_PATH, t_base, t_impr)

if total_edges is not None:
    print("=" * 40)
    print(f"Final Statistics:")
    # FIX: Access 't' via the reservoir object
    print(f"Total Edges Streamed (t): {t_base.reservoir.t}")
    print(f"Reservoir Size (M):       {MEMORY_SIZE_M}")
    print("-" * 40)
    
    # Get Estimations
    est_base = int(t_base.get_estimation())
    est_impr = int(t_impr.get_estimation())
    
    print(f"TRIEST-BASE Estimated Global Triangles: {est_base}")
    print(f"TRIEST-IMPR Estimated Global Triangles: {est_impr}")
    
    # Validation for Dummy Graph
    if FILE_PATH == "test_graph.txt":
        print("-" * 40)
        print("Ground Truth (Clique 5): 10")

Initializing Algorithms with Memory M = 10000
Reading stream from data/web-Stanford.txt...
Processed 100000 edges...
Processed 200000 edges...
Processed 300000 edges...
Processed 400000 edges...
Processed 500000 edges...
Processed 600000 edges...
Processed 700000 edges...
Processed 800000 edges...
Processed 900000 edges...
Processed 1000000 edges...
Processed 1100000 edges...
Processed 1200000 edges...
Processed 1300000 edges...
Processed 1400000 edges...
Processed 1500000 edges...
Processed 1600000 edges...
Processed 1700000 edges...
Processed 1800000 edges...
Processed 1900000 edges...
Processed 2000000 edges...
Processed 2100000 edges...
Processed 2200000 edges...
Processed 2300000 edges...

--- Processing Complete in 2.95 seconds ---
Final Statistics:
Total Edges Streamed (t): 2312497
Reservoir Size (M):       10000
----------------------------------------
TRIEST-BASE Estimated Global Triangles: 173342234974
TRIEST-IMPR Estimated Global Triangles: 17087518


## Questions

### 1. What were the challenges you faced when implementing the algorithm?



### 2. Can the algorithm be easily parallelized? If yes, how? If not, why? Explain.

### 3. Does the algorithm work for unbounded graph streams? Explain.

### 4. Does the algorithm support edge deletions? If not, what modification would it need? Explain.