# FORGE CSV Catalog Inspection Notebook

This notebook provides data quality analysis and inspection tools for FORGE seismic catalog CSV files. It performs duplicate detection and data validation on the merged catalog datasets.

## Purpose
- **Data Quality Assessment**: Check for duplicate events in seismic catalogs
- **Catalog Validation**: Ensure data integrity across multiple CSV files
- **Preprocessing Analysis**: Identify potential issues before association with DAS data

## Dataset Information
- **Source**: FORGE 16A Stimulation Catalogues (April 2024)
- **File Format**: Multiple CSV files with FORGE prefix
- **Key Field**: 'Source' column (event identifier)

## Workflow Overview
1. Load all FORGE CSV files from the catalog directory
2. Merge multiple catalog files into a unified dataset
3. Analyze the 'Source' column for uniqueness
4. Identify and examine duplicate entries
5. Provide detailed duplicate analysis

**Author**: Danilo Dordevic  
**Last Updated**: August 2025  
**Related Scripts**: `associate_catalog_dataset.py`, `check_similarity.py`

In [1]:
# ================================================================================================
# SECTION 1: DATA LOADING AND INITIAL PROCESSING
# ================================================================================================

import pandas as pd
import numpy as np
import os

# Define the folder path containing FORGE seismic catalog CSV files
# This directory contains multiple CSV files from the 16A stimulation monitoring campaign
folder_path = 'GES16Aand16BStimulationMonitoringApril2024/16AStimulationCatalogues'

print(f"📁 Target directory: {folder_path}")
print(f"📊 Loading FORGE catalog CSV files...")

# Load all CSV files from the folder that match the FORGE naming convention
# Filter criteria:
# - File extension must be .csv
# - Filename must start with 'FORGE' (official catalog files)
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv') and f.startswith('FORGE')]

print(f"📋 Found {len(csv_files)} FORGE CSV files:")
for i, file in enumerate(csv_files, 1):
    print(f"   {i}. {file}")

# Load each CSV file into a pandas DataFrame and store in a list
# This approach allows us to inspect individual files if needed before merging
df_list = [pd.read_csv(os.path.join(folder_path, file)) for file in csv_files]

# Merge all CSV files into a single comprehensive dataframe
# ignore_index=True ensures continuous indexing across all merged data
merged_df = pd.concat(df_list, ignore_index=True)

print(f"\n📈 Data Loading Summary:")
print(f"   • Total events loaded: {len(merged_df)}")
print(f"   • Number of columns: {len(merged_df.columns)}")
print(f"   • Data shape: {merged_df.shape}")

# Initial uniqueness check for the 'Source' column
# The 'Source' column should contain unique event identifiers
unique_sources = merged_df['Source'].is_unique
print(f"\n🔍 Initial Data Quality Check:")
print(f"   • Column 'Source' contains unique values: {unique_sources}")

if not unique_sources:
    print("   ⚠️  WARNING: Duplicate source IDs detected! Further analysis required.")
else:
    print("   ✅ All source IDs are unique - data quality looks good.")

📁 Target directory: GES16Aand16BStimulationMonitoringApril2024/16AStimulationCatalogues
📊 Loading FORGE catalog CSV files...
📋 Found 20 FORGE CSV files:
   1. FORGE16aApril24BackgroundStage 3R.csv
   2. FORGE16aApril24BackgroundStage 8b.csv
   3. FORGE16aApril24BackgroundStage 9D.csv
   4. FORGE16aApril24BackgroundStage 6.csv
   5. FORGE16aApril24BackgroundStage 7.csv
   6. FORGE16aApril24BackgroundStage 9B.csv
   7. FORGE16aApril24BackgroundStage 5.csv
   8. FORGE16aApril24BackgroundStage 4.csv
   9. FORGE16aApril24BackgroundStage 9C.csv
   10. FORGE16aApril24BackgroundStage 9.csv
   11. FORGE16aApril24BackgroundStage 8.csv
   12. FORGE16aApril24BackgroundStage 7C.csv
   13. FORGE16aApril24BackgroundStage 7b.csv
   14. FORGE16aApril24BackgroundStage 10.csv
   15. FORGE16aApril24BackgroundStage 10B.csv
   16. FORGE16aApril24BackgroundPost Stim D.csv
   17. FORGE16aApril24BackgroundStage 10C.csv
   18. FORGE16aApril24BackgroundPost Stim A.csv
   19. FORGE16aApril24BackgroundPost Stim B.

In [2]:
# ================================================================================================
# SECTION 2: DATA OVERVIEW AND STRUCTURE INSPECTION
# ================================================================================================

print("📊 MERGED CATALOG DATASET OVERVIEW")
print("=" * 50)
print(f"Dataset contains {len(merged_df)} seismic events from {len(csv_files)} CSV files")
print(f"Columns available: {list(merged_df.columns)}")
print("\n🔍 Data Preview:")

# Display the merged dataframe for visual inspection
# This shows the structure, data types, and sample values
merged_df

📊 MERGED CATALOG DATASET OVERVIEW
Dataset contains 2742 seismic events from 20 CSV files
Columns available: ['Source', ' Trig Date ', '    Trig Time   ', 'Origin Date', '   Origin Time  ', '    Profile  ', ' Status', ' Cluster', '      Y     ', '      X     ', '    Depth   ', ' MomMag', '   PGV  ', ' Stage', ' P S/N  ', ' S S/N  ', ' Quality', '   Error  ', ' Location   ', ' rms Noise', 'Matched File']

🔍 Data Preview:


Unnamed: 0,Source,Trig Date,Trig Time,Origin Date,Origin Time,Profile,Status,Cluster,Y,X,...,MomMag,PGV,Stage,P S/N,S S/N,Quality,Error,Location,rms Noise,Matched File
0,43.0,3/4/2024,10:37:26.708000,3/4/2024,10:37:26.520487,Primary,2.0,0.0,-1150.0,3300.0,...,-0.92,0.50,3,15.4,36.5,1.1,1000000.0,L2 msmx,6.70,
1,47.0,3/4/2024,10:37:59.106000,3/4/2024,10:37:58.875683,Primary,2.0,0.0,-1070.0,3500.0,...,-1.09,0.48,3,13.7,33.2,1.1,1000000.0,L2 msmx,6.73,
2,56.0,3/4/2024,10:38:40.201750,3/4/2024,10:38:40.012058,Primary,2.0,0.0,-940.0,3530.0,...,-0.73,0.58,3,23.0,48.7,1.1,1000000.0,L2 msmx,6.62,
3,1096.0,3/4/2024,11:57:33.040750,3/4/2024,11:57:32.886600,Primary,2.0,0.0,-1400.0,3050.0,...,0.04,167.81,3,19.6,41.8,2.7,1000000.0,L2 msmx,9.12,
4,1180.0,3/4/2024,12:02:10.987250,3/4/2024,12:02:10.838082,Primary,2.0,0.0,-1250.0,2990.0,...,0.20,145.39,3,20.2,57.3,2.1,1000000.0,L2 msmx,10.51,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2737,7719.0,8/4/2024,09:11:03.612500,8/4/2024,09:11:03.451317,Primary,2.0,0.0,-410.0,2100.0,...,0.38,187.75,Post Stim,113.5,194.4,3.2,1000000.0,L2 msmx,13.52,16B_StrainRate_20240408T151052+0000_44120.h5
2738,9510.0,8/4/2024,10:36:46.647250,8/4/2024,10:36:46.448315,Primary,2.0,0.0,-2670.0,2310.0,...,0.26,187.75,Post Stim,59.9,165.6,1.7,1000000.0,L2 msmx,8.07,16B_StrainRate_20240408T163640+0000_44549.h5
2739,9836.0,8/4/2024,10:52:05.686250,8/4/2024,10:52:05.721511,Primary,2.0,0.0,-2590.0,2240.0,...,0.25,187.75,Post Stim,6.3,17.8,1.7,1000000.0,L2 msmx,7.92,16B_StrainRate_20240408T165204+0000_44626.h5
2740,1150.0,8/4/2024,04:09:29.483500,8/4/2024,04:09:29.552265,Primary,4.0,0.0,-140.0,2260.0,...,-0.34,40.55,Post Stim,8.4,18.6,2.1,1000000.0,L2 Check,8.03,16B_StrainRate_20240408T100928+0000_42613.h5


In [3]:
# ================================================================================================
# SECTION 3: SOURCE COLUMN ANALYSIS
# ================================================================================================

print("🔍 ANALYZING SOURCE COLUMN VALUES")
print("=" * 40)

# Convert the 'Source' column to a numpy array for efficient processing
# The Source column should contain unique event identifiers (integers)
source_array = merged_df['Source'].to_numpy()

print(f"📊 Source Column Statistics:")
print(f"   • Data type: {source_array.dtype}")
print(f"   • Total values: {len(source_array)}")
print(f"   • Min value: {np.min(source_array)}")
print(f"   • Max value: {np.max(source_array)}")
print(f"   • Unique values: {len(np.unique(source_array))}")

print(f"\n📋 First 20 Source values:")
print(source_array[:20])

print(f"\n📋 Complete Source array:")
# Display the full array of source values for detailed inspection
source_array

🔍 ANALYZING SOURCE COLUMN VALUES
📊 Source Column Statistics:
   • Data type: float64
   • Total values: 2742
   • Min value: nan
   • Max value: nan
   • Unique values: 2357

📋 First 20 Source values:
[  43.   47.   56. 1096. 1180. 1455. 1623. 1852. 1905. 1949. 2119. 2304.
 2358. 2384. 2569. 2663. 2672. 2732. 2782. 2824.]

📋 Complete Source array:


array([  43.,   47.,   56., ..., 9836., 1150., 5892.], shape=(2742,))

In [4]:
# ================================================================================================
# SECTION 4: UNIQUENESS VALIDATION AND DUPLICATE DETECTION
# ================================================================================================

print("🔍 COMPREHENSIVE UNIQUENESS ANALYSIS")
print("=" * 45)

# Advanced uniqueness check using numpy operations
# This method is more efficient for large datasets than pandas operations
unique_values = np.unique(source_array)
total_values = len(source_array)
unique_count = len(unique_values)

# Calculate duplicate statistics
is_unique = unique_count == total_values
duplicate_count = total_values - unique_count
duplicate_percentage = (duplicate_count / total_values) * 100 if total_values > 0 else 0

print(f"📊 Uniqueness Statistics:")
print(f"   • Total Source values: {total_values}")
print(f"   • Unique Source values: {unique_count}")
print(f"   • Duplicate instances: {duplicate_count}")
print(f"   • Duplicate percentage: {duplicate_percentage:.2f}%")
print(f"   • All values unique: {is_unique}")

if is_unique:
    print("\n✅ RESULT: All Source values are unique - No duplicates found!")
    print("   → Data quality is excellent for event identification")
    print("   → Safe to proceed with DAS data association")
else:
    print(f"\n⚠️  RESULT: {duplicate_count} duplicate instances detected!")
    print("   → Further investigation required before data association")
    print("   → Duplicate analysis will follow in next section")

print(f"\n🎯 Final Assessment: All values in the numpy array are unique: {is_unique}")

🔍 COMPREHENSIVE UNIQUENESS ANALYSIS
📊 Uniqueness Statistics:
   • Total Source values: 2742
   • Unique Source values: 2357
   • Duplicate instances: 385
   • Duplicate percentage: 14.04%
   • All values unique: False

⚠️  RESULT: 385 duplicate instances detected!
   → Further investigation required before data association
   → Duplicate analysis will follow in next section

🎯 Final Assessment: All values in the numpy array are unique: False


In [5]:
# ================================================================================================
# SECTION 5: DETAILED DUPLICATE ANALYSIS AND DETECTION
# ================================================================================================

print("🔍 DETAILED DUPLICATE DETECTION ALGORITHM")
print("=" * 50)

# Initialize dictionary to track value occurrences and their positions
# Key: Source value, Value: List of indices where this value appears
duplicates = {}

print("🔄 Processing Source values to identify duplicates...")

# Iterate through each value and its index to build occurrence map
# This algorithm tracks every occurrence of each Source value
for index, value in enumerate(source_array):
    if value in duplicates:
        # Value already seen - add this index to existing list
        duplicates[value].append(index)
    else:
        # First occurrence - create new entry with single index
        duplicates[value] = [index]

print(f"📊 Processing complete. Analyzed {len(source_array)} values.")

# Filter to keep only values that appear more than once (true duplicates)
# This removes all unique values, leaving only problematic duplicates
original_count = len(duplicates)
duplicates = {key: value for key, value in duplicates.items() if len(value) > 1}
duplicate_source_count = len(duplicates)

print(f"\n📈 Duplicate Detection Results:")
print(f"   • Total unique Source values: {original_count}")
print(f"   • Source values with duplicates: {duplicate_source_count}")
print(f"   • Percentage of values with duplicates: {(duplicate_source_count/original_count)*100:.2f}%")

if duplicate_source_count == 0:
    print("\n✅ EXCELLENT: No duplicate Source values found!")
    print("   → All events have unique identifiers")
    print("   → Data is ready for association processing")
else:
    print(f"\n⚠️  ATTENTION: Found {duplicate_source_count} Source values with duplicates!")
    print("\n📋 Detailed Duplicate Report:")
    print("-" * 60)
    
    for i, (value, indices) in enumerate(duplicates.items(), 1):
        occurrence_count = len(indices)
        print(f"   {i}. Source ID: {value}")
        print(f"      • Appears {occurrence_count} times")
        print(f"      • At row indices: {indices}")
        print(f"      • Row numbers: {[idx + 1 for idx in indices]}")  # Convert to 1-based indexing
        print()

print("🎯 Analysis Summary:")
if duplicate_source_count > 0:
    total_duplicate_events = sum(len(indices) for indices in duplicates.values())
    print(f"   • Total events involved in duplicates: {total_duplicate_events}")
    print(f"   • Impact on dataset: {(total_duplicate_events/len(source_array))*100:.2f}% of all events")
    print("   • Recommendation: Review duplicate events before proceeding")
else:
    print("   • Dataset integrity: PASSED")
    print("   • Ready for next processing step")

🔍 DETAILED DUPLICATE DETECTION ALGORITHM
🔄 Processing Source values to identify duplicates...
📊 Processing complete. Analyzed 2742 values.

📈 Duplicate Detection Results:
   • Total unique Source values: 2418
   • Source values with duplicates: 297
   • Percentage of values with duplicates: 12.28%

⚠️  ATTENTION: Found 297 Source values with duplicates!

📋 Detailed Duplicate Report:
------------------------------------------------------------
   1. Source ID: 43.0
      • Appears 2 times
      • At row indices: [0, 1317]
      • Row numbers: [1, 1318]

   2. Source ID: 47.0
      • Appears 2 times
      • At row indices: [1, 506]
      • Row numbers: [2, 507]

   3. Source ID: 1096.0
      • Appears 2 times
      • At row indices: [3, 2567]
      • Row numbers: [4, 2568]

   4. Source ID: 1180.0
      • Appears 2 times
      • At row indices: [4, 551]
      • Row numbers: [5, 552]

   5. Source ID: 1455.0
      • Appears 2 times
      • At row indices: [5, 383]
      • Row numbers: [6,

In [6]:
# ================================================================================================
# SECTION 6: SPECIFIC DUPLICATE INVESTIGATION
# ================================================================================================

print("🔍 DETAILED EXAMINATION OF SPECIFIC DUPLICATE CASE")
print("=" * 55)

# Example investigation: Source ID 9675
# This section demonstrates how to examine specific duplicate cases in detail
target_source_id = 9675

print(f"🎯 Investigating Source ID: {target_source_id}")
print("-" * 40)

# Filter the merged dataframe to show all rows with the target Source ID
duplicate_entries = merged_df[merged_df['Source'] == target_source_id]

print(f"📊 Analysis Results for Source ID {target_source_id}:")
print(f"   • Number of occurrences: {len(duplicate_entries)}")

if len(duplicate_entries) > 0:
    print(f"   • Row indices in dataset: {duplicate_entries.index.tolist()}")
    print(f"   • This represents {(len(duplicate_entries)/len(merged_df))*100:.4f}% of total events")
    
    if len(duplicate_entries) > 1:
        print(f"\n⚠️  DUPLICATE DETECTED: Source ID {target_source_id} appears {len(duplicate_entries)} times")
        print("\n📋 Detailed comparison of duplicate entries:")
        print("   → Check if these are truly identical events or data entry errors")
        print("   → Look for differences in timestamps, locations, or magnitudes")
        print("   → Consider which entry to keep for data association")
    else:
        print(f"\n✅ UNIQUE: Source ID {target_source_id} appears only once (as expected)")
else:
    print(f"\n❌ NOT FOUND: Source ID {target_source_id} does not exist in the dataset")

print(f"\n📄 Complete record(s) for Source ID {target_source_id}:")
print("=" * 80)

# Display the full record(s) for detailed inspection
# This allows manual review of the duplicate entries to determine the cause
duplicate_entries

🔍 DETAILED EXAMINATION OF SPECIFIC DUPLICATE CASE
🎯 Investigating Source ID: 9675
----------------------------------------
📊 Analysis Results for Source ID 9675:
   • Number of occurrences: 2
   • Row indices in dataset: [724, 1068]
   • This represents 0.0729% of total events

⚠️  DUPLICATE DETECTED: Source ID 9675 appears 2 times

📋 Detailed comparison of duplicate entries:
   → Check if these are truly identical events or data entry errors
   → Look for differences in timestamps, locations, or magnitudes
   → Consider which entry to keep for data association

📄 Complete record(s) for Source ID 9675:


Unnamed: 0,Source,Trig Date,Trig Time,Origin Date,Origin Time,Profile,Status,Cluster,Y,X,...,MomMag,PGV,Stage,P S/N,S S/N,Quality,Error,Location,rms Noise,Matched File
724,9675.0,6/4/2024,12:13:26.605250,6/4/2024,12:13:26.390741,Primary,2.0,0.0,-1720.0,2290.0,...,-0.07,129.48,9,4.9,8.7,2.5,1000000.0,L2 msmx,9.98,
1068,9675.0,6/4/2024,17:49:58.150250,6/4/2024,17:49:57.924236,Primary,2.0,0.0,-2000.0,2390.0,...,0.19,187.75,9,9.7,43.7,1.9,1000000.0,L2 msmx,7.4,


## 📋 Analysis Summary and Recommendations

### Key Findings
This notebook provides comprehensive data quality assessment for FORGE seismic catalog CSV files, focusing on duplicate detection in the critical 'Source' column.

### Data Quality Metrics
- **Dataset Size**: Total number of events across all catalog files
- **Uniqueness**: Percentage of unique vs. duplicate Source IDs
- **Data Integrity**: Assessment of dataset readiness for DAS association

### Potential Issues Identified
1. **Duplicate Source IDs**: Events with identical identifiers
2. **Data Entry Errors**: Possible mistakes in catalog compilation
3. **File Merging Issues**: Problems arising from combining multiple CSV files

### Recommended Next Steps

#### If No Duplicates Found ✅
- Proceed with DAS data association using `associate_catalog_dataset.py`
- Dataset is ready for temporal matching with DAS files
- No additional preprocessing required

#### If Duplicates Found ⚠️
1. **Manual Review**: Examine duplicate entries for differences in:
   - Trigger times and dates
   - Event magnitudes and depths
   - Spatial coordinates
   - Data source/origin

2. **Data Cleaning Options**:
   - Remove exact duplicates (identical in all fields)
   - Keep most recent or most accurate entry
   - Merge information if entries are complementary
   - Flag for expert review if entries differ significantly

3. **Quality Control**:
   - Use `check_similarity.py` to compare cleaned vs. original data
   - Verify that cleaning doesn't introduce new issues
   - Document all changes made to the dataset

### Integration with FORGE Pipeline
- **Input to**: `associate_catalog_dataset.py` (catalog-DAS association)
- **Validation with**: `check_similarity.py` (data integrity verification)
- **Prerequisites**: Clean, unique Source IDs for reliable association

### Technical Notes
- Analysis uses pandas and numpy for efficient processing
- Suitable for large catalog datasets (thousands of events)
- Memory-efficient approach for multi-file processing
- Compatible with FORGE April 2024 dataset structure