## Pure Signal Generators
### ARFIMA Processes

ARFIMA (Autoregressive Fractionally Integrated Moving Average) processes are characterized by the fractional differencing parameter `d`.

In [1]:
import sys
from pathlib import Path
import numpy as np
sys.path.append(str(Path.cwd() / "src"))

from data_processing.synthetic_generator import SyntheticDataGenerator

# Initialize generator
generator = SyntheticDataGenerator(random_state=42)

# Generate ARFIMA with different d values using the new convenience method
arfima_weak = generator.generate_arfima(n=1000, d=0.1)  # Weak LRD
arfima_medium = generator.generate_arfima(n=1000, d=0.3)  # Medium LRD
arfima_strong = generator.generate_arfima(n=1000, d=0.4)  # Strong LRD

print(f"Weak LRD (d=0.1): {len(arfima_weak)} points")
print(f"Medium LRD (d=0.3): {len(arfima_medium)} points")
print(f"Strong LRD (d=0.4): {len(arfima_strong)} points")



Ensured directory exists: data
Ensured directory exists: data\raw
Ensured directory exists: data\processed
Ensured directory exists: data\metadata
Weak LRD (d=0.1): 1000 points
Medium LRD (d=0.3): 1000 points
Strong LRD (d=0.4): 1000 points


**Parameter Guide:**

- `n`: Number of data points
- `d`: Fractional differencing parameter (0 < d < 0.5)
- `ar_params`: Optional AR parameters (list of floats)
- `ma_params`: Optional MA parameters (list of floats)
- `sigma`: Noise standard deviation (default: 1.0)
- `random_state`: For reproducibility

### Fractional Brownian Motion (fBm)

fBm is a generalization of Brownian motion with Hurst exponent H.

In [2]:
# Generate fBm with different Hurst exponents using the new convenience method
fbm_anti = generator.generate_fbm(n=1000, hurst=0.3)  # Anti-persistent
fbm_random = generator.generate_fbm(n=1000, hurst=0.5)  # Random walk
fbm_persistent = generator.generate_fbm(n=1000, hurst=0.7)  # Persistent

print(f"Anti-persistent fBm (H=0.3): {len(fbm_anti)} points")
print(f"Random walk fBm (H=0.5): {len(fbm_random)} points")
print(f"Persistent fBm (H=0.7): {len(fbm_persistent)} points")

Anti-persistent fBm (H=0.3): 1000 points
Random walk fBm (H=0.5): 1000 points
Persistent fBm (H=0.7): 1000 points


**Parameter Guide:**

- `n`: Number of data points
- `hurst`: Hurst exponent (0 < H < 1)
- `random_state`: For reproducibility

### Fractional Gaussian Noise (fGn)

fGn represents the increments of fBm.

In [3]:
# Generate fGn with different Hurst exponents using the new convenience method
fgn_anti = generator.generate_fgn(n=1000, hurst=0.3)
fgn_random = generator.generate_fgn(n=1000, hurst=0.5)
fgn_persistent = generator.generate_fgn(n=1000, hurst=0.7)

print(f"Anti-persistent fGn (H=0.3): {len(fgn_anti)} points")
print(f"Random fGn (H=0.5): {len(fgn_random)} points")
print(f"Persistent fGn (H=0.7): {len(fgn_persistent)} points")

Anti-persistent fGn (H=0.3): 1000 points
Random fGn (H=0.5): 1000 points
Persistent fGn (H=0.7): 1000 points


## Data Contaminators
### Polynomial Trends

Add polynomial trends to simulate real-world non-stationarities.

In [4]:
from data_processing.synthetic_generator import DataContaminator

# Initialize contaminator
contaminator = DataContaminator(random_state=42)

# Add different polynomial trends
linear_trend = contaminator.add_polynomial_trend(arfima_medium, degree=1, amplitude=0.1)
quadratic_trend = contaminator.add_polynomial_trend(arfima_medium, degree=2, amplitude=0.05)
cubic_trend = contaminator.add_polynomial_trend(arfima_medium, degree=3, amplitude=0.02)

print(f"Original signal variance: {np.var(arfima_medium):.4f}")
print(f"Linear trend variance: {np.var(linear_trend):.4f}")
print(f"Quadratic trend variance: {np.var(quadratic_trend):.4f}")

Original signal variance: 0.0011
Linear trend variance: 0.0011
Quadratic trend variance: 0.0011


**Parameter Guide:**

- `signal`: Input time series
- `degree`: Polynomial degree (1=linear, 2=quadratic, etc.)
- `amplitude`: Trend strength relative to signal

### Periodicity

Add periodic components to simulate seasonal patterns.

In [5]:
# Add periodicity (note: frequency is a positional argument, not keyword)
periodic_signal = contaminator.add_periodicity(arfima_medium, 50, amplitude=0.2)
seasonal_signal = contaminator.add_periodicity(arfima_medium, 100, amplitude=0.15)

print(f"Periodic signal variance: {np.var(periodic_signal):.4f}")
print(f"Seasonal signal variance: {np.var(seasonal_signal):.4f}")

Periodic signal variance: 0.0011
Seasonal signal variance: 0.0011


**Parameter Guide:**

- `signal`: Input time series
- `frequency`: Period length (number of points)
- `amplitude`: Periodic component strength

### Outliers

Add outliers to test robustness of analysis methods

In [6]:
# Add different types of outliers
outlier_signal = contaminator.add_outliers(arfima_medium, fraction=0.02, magnitude=4.0)
spike_signal = contaminator.add_outliers(arfima_medium, fraction=0.01, magnitude=6.0)

print(f"Outlier signal variance: {np.var(outlier_signal):.4f}")
print(f"Spike signal variance: {np.var(spike_signal):.4f}")

Outlier signal variance: 0.0013
Spike signal variance: 0.0012


**Parameter Guide:**

- `signal`: Input time series
- `fraction`: Proportion of points to convert to outliers
- `magnitude`: Outlier strength in standard deviations

### Heavy Tails

Add heavy-tailed noise for non-Gaussian processes.

In [7]:
# Add heavy-tailed noise
heavy_tail_signal = contaminator.add_heavy_tails(arfima_medium, df=2.0, fraction=0.15)
cauchy_signal = contaminator.add_heavy_tails(arfima_medium, df=1.0, fraction=0.1)

print(f"Heavy tail signal variance: {np.var(heavy_tail_signal):.4f}")
print(f"Cauchy signal variance: {np.var(cauchy_signal):.4f}")

Heavy tail signal variance: 0.0011
Cauchy signal variance: 0.0012


**Parameter Guide:**

- `signal`: Input time series
- `df`: Degrees of freedom for t-distribution (lower = heavier tails)
- `fraction`: Proportion of points to replace with heavy-tailed noise

## Advanced Generation
### Comprehensive Dataset Generation

Generate a complete set of synthetic datasets for comprehensive testing.

In [8]:
# Generate comprehensive dataset
comprehensive_dataset = generator.generate_comprehensive_dataset(
    n=1000,
    save=True,
)

print("Generated datasets:")
print(f"Clean signals: {len(comprehensive_dataset['clean_signals'])}")
print(f"Contaminated signals: {len(comprehensive_dataset['contaminated_signals'])}")
print(f"Irregular signals: {len(comprehensive_dataset['irregular_signals'])}")

Generating comprehensive synthetic dataset...
Saved synthetic data: data\raw\arfima_d0.1.csv
Saved metadata: data\metadata\arfima_d0.1_metadata.json
Saved synthetic data: data\raw\arfima_d0.2.csv
Saved metadata: data\metadata\arfima_d0.2_metadata.json
Saved synthetic data: data\raw\arfima_d0.3.csv
Saved metadata: data\metadata\arfima_d0.3_metadata.json
Saved synthetic data: data\raw\arfima_d0.4.csv
Saved metadata: data\metadata\arfima_d0.4_metadata.json
Saved synthetic data: data\raw\fbm_H0.3.csv
Saved metadata: data\metadata\fbm_H0.3_metadata.json
Saved synthetic data: data\raw\fbm_H0.5.csv
Saved metadata: data\metadata\fbm_H0.5_metadata.json
Saved synthetic data: data\raw\fbm_H0.7.csv
Saved metadata: data\metadata\fbm_H0.7_metadata.json
Saved synthetic data: data\raw\fgn_H0.3.csv
Saved metadata: data\metadata\fgn_H0.3_metadata.json
Saved synthetic data: data\raw\fgn_H0.5.csv
Saved metadata: data\metadata\fgn_H0.5_metadata.json
Saved synthetic data: data\raw\fgn_H0.7.csv
Saved metadat

**Note**: The `generate_comprehensive_dataset` method automatically saves data to the default data directory. If you need to specify a custom data root, you can initialize the `SyntheticDataGenerator` with a custom `data_root` parameter:

In [9]:
# Initialize with custom data root
generator = SyntheticDataGenerator(data_root="custom_data", random_state=42)

# Generate comprehensive dataset
comprehensive_dataset = generator.generate_comprehensive_dataset(n=1000, save=True)

Ensured directory exists: custom_data
Ensured directory exists: custom_data\raw
Ensured directory exists: custom_data\processed
Ensured directory exists: custom_data\metadata
Generating comprehensive synthetic dataset...
Saved synthetic data: custom_data\raw\arfima_d0.1.csv
Saved metadata: custom_data\metadata\arfima_d0.1_metadata.json
Saved synthetic data: custom_data\raw\arfima_d0.2.csv
Saved metadata: custom_data\metadata\arfima_d0.2_metadata.json
Saved synthetic data: custom_data\raw\arfima_d0.3.csv
Saved metadata: custom_data\metadata\arfima_d0.3_metadata.json
Saved synthetic data: custom_data\raw\arfima_d0.4.csv
Saved metadata: custom_data\metadata\arfima_d0.4_metadata.json
Saved synthetic data: custom_data\raw\fbm_H0.3.csv
Saved metadata: custom_data\metadata\fbm_H0.3_metadata.json
Saved synthetic data: custom_data\raw\fbm_H0.5.csv
Saved metadata: custom_data\metadata\fbm_H0.5_metadata.json
Saved synthetic data: custom_data\raw\fbm_H0.7.csv
Saved metadata: custom_data\metadata\f

### Custom Signal Generation

For more control, use the underlying pure generator directly.

In [10]:
# Access the pure generator for advanced usage
pure_generator = generator.pure_generator

# Generate ARFIMA with custom parameters
custom_arfima = pure_generator.generate_arfima(
    n=1000, 
    d=0.25, 
    ar_params=[0.3, -0.1], 
    ma_params=[0.2], 
    sigma=0.8
)

print(f"Custom ARFIMA: {len(custom_arfima)} points")
print(f"AR parameters: [0.3, -0.1]")
print(f"MA parameters: [0.2]")

Custom ARFIMA: 1000 points
AR parameters: [0.3, -0.1]
MA parameters: [0.2]


## Data Quality and Validation
### Signal Properties

Check the statistical properties of generated signals.

In [11]:
import numpy as np

# First, let's create a contaminated signal for comparison
from data_processing.synthetic_generator import DataContaminator

# Initialize contaminator
contaminator = DataContaminator(random_state=42)

# Create a contaminated version of our ARFIMA signal
contaminated_signal = contaminator.add_polynomial_trend(arfima_medium, degree=1, amplitude=0.1)
contaminated_signal = contaminator.add_periodicity(contaminated_signal, 50, amplitude=0.2)
contaminated_signal = contaminator.add_outliers(contaminated_signal, fraction=0.02, magnitude=3.0)

def analyze_signal_properties(signal, name):
    """Analyze basic properties of a generated signal."""
    print(f"\n{name} Properties:")
    print(f"  Length: {len(signal)}")
    print(f"  Mean: {np.mean(signal):.4f}")
    print(f"  Std: {np.std(signal):.4f}")
    print(f"  Min: {np.min(signal):.4f}")
    print(f"  Max: {np.max(signal):.4f}")
    print(f"  Variance: {np.var(signal):.4f}")

# Analyze different signal types
signals = {
    "ARFIMA (d=0.3)": arfima_medium,
    "fBm (H=0.7)": fbm_persistent,
    "fGn (H=0.6)": fgn_persistent,
    "Contaminated": contaminated_signal
}

for name, signal in signals.items():
    analyze_signal_properties(signal, name)


ARFIMA (d=0.3) Properties:
  Length: 1000
  Mean: 0.0000
  Std: 0.0327
  Min: -0.1189
  Max: 0.0889
  Variance: 0.0011

fBm (H=0.7) Properties:
  Length: 1000
  Mean: 5.6135
  Std: 5.6212
  Min: -6.8601
  Max: 14.5699
  Variance: 31.5976

fGn (H=0.6) Properties:
  Length: 1000
  Mean: -0.0055
  Std: 0.0221
  Min: -0.0828
  Max: 0.0611
  Variance: 0.0005

Contaminated Properties:
  Length: 1000
  Mean: 0.0009
  Std: 0.0352
  Min: -0.1669
  Max: 0.2502
  Variance: 0.0012


### Long-Range Dependence Validation

Verify that generated signals exhibit the expected long-range dependence.

In [12]:
from analysis.dfa_analysis import dfa
from analysis.rs_analysis import rs_analysis

def validate_lrd(signal, name):
    """Validate long-range dependence properties."""
    print(f"\n{name} LRD Validation:")
    
    try:
        # DFA analysis
        scales, flucts, dfa_summary = dfa(signal, order=1)
        # DFA gives alpha, convert to Hurst: H = alpha/2
        dfa_hurst = dfa_summary.alpha / 2
        print(f"  DFA Alpha: {dfa_summary.alpha:.3f}")
        print(f"  DFA Hurst (H = α/2): {dfa_hurst:.3f}")
        
        # R/S analysis
        scales_rs, rs_values, rs_summary = rs_analysis(signal)
        print(f"  R/S Hurst: {rs_summary.hurst:.3f}")
        
        # Check consistency between DFA and R/S
        hurst_diff = abs(dfa_hurst - rs_summary.hurst)
        if hurst_diff < 0.1:
            print(f"  ✓ Hurst estimates consistent (diff: {hurst_diff:.3f})")
        else:
            print(f"  ⚠ Hurst estimates differ (diff: {hurst_diff:.3f})")
            
    except Exception as e:
        print(f"  ✗ Analysis failed: {e}")

# Validate all signals
for name, signal in signals.items():
    validate_lrd(signal, name)


ARFIMA (d=0.3) LRD Validation:
  DFA Alpha: 0.720
  DFA Hurst (H = α/2): 0.360
  R/S Hurst: 0.764
  ⚠ Hurst estimates differ (diff: 0.404)

fBm (H=0.7) LRD Validation:
  DFA Alpha: 1.669
  DFA Hurst (H = α/2): 0.835
  R/S Hurst: 1.011
  ⚠ Hurst estimates differ (diff: 0.177)

fGn (H=0.6) LRD Validation:
  DFA Alpha: 0.738
  DFA Hurst (H = α/2): 0.369
  R/S Hurst: 0.755
  ⚠ Hurst estimates differ (diff: 0.386)

Contaminated LRD Validation:
  DFA Alpha: 0.709
  DFA Hurst (H = α/2): 0.354
  R/S Hurst: 0.755
  ⚠ Hurst estimates differ (diff: 0.400)


**Important Note**: Different analysis methods return different measures:

- **DFA**: Returns `alpha` (scaling exponent), where Hurst exponent H = α/2
- **R/S**: Returns `hurst` directly (Hurst exponent)
- **MFDFA**: Returns `hq` array (generalized Hurst exponents for different q values)
- **Wavelet**: Returns `hurst` directly (Hurst exponent)
- **Spectral**: Returns `hurst` directly (Hurst exponent)

## Data Storage and Management
### Saving Generated Data

Save generated datasets for later use.

In [13]:
# Save individual signals
np.save("data/raw/arfima_medium.npy", arfima_medium)
np.save("data/raw/fbm_persistent.npy", fbm_persistent)

# Save comprehensive dataset
import pickle
with open("data/raw/comprehensive_dataset.pkl", "wb") as f:
    pickle.dump(comprehensive_dataset, f)

print("Data saved successfully!")

Data saved successfully!


### Loading Saved Data

In [14]:
# Load individual signals
loaded_arfima = np.load("data/raw/arfima_medium.npy")
loaded_fbm = np.load("data/raw/fbm_persistent.npy")

# Load comprehensive dataset
with open("data/raw/comprehensive_dataset.pkl", "rb") as f:
    loaded_comprehensive = pickle.load(f)

print(f"Loaded ARFIMA: {len(loaded_arfima)} points")
print(f"Loaded comprehensive dataset: {len(loaded_comprehensive['clean_signals'])} clean signals")

Loaded ARFIMA: 1000 points
Loaded comprehensive dataset: 10 clean signals


## Best Practices
### Reproducibility

- Always set `random_state` for reproducible results
- Document all generation parameters
- Use version control for generation scripts

### Data Quality

- Generate sufficient data points (recommend ≥500)
- Validate statistical properties
- Test with different contamination levels

### Performance

- Use batch generation for large datasets
- Save intermediate results
- Monitor memory usage for very long series

### Validation

- Always validate generated signals with analysis methods
- Compare with theoretical expectations
- Test robustness with contaminated data

## Troubleshooting

### Common Issues

**Issue**: Generated signals don't show expected LRD
**Solution**: Check parameter ranges and ensure sufficient data length

**Issue**: Memory errors with large datasets
**Solution**: Generate data in smaller batches or use streaming approaches

**Issue**: Inconsistent results between runs
**Solution**: Ensure random_state is set and check for global state changes

**Issue**: Contamination not visible
**Solution**: Increase amplitude parameters and check signal-to-noise ratios

**Issue**: `TypeError: ArmaProcess.generate_sample() got an unexpected keyword argument 'random_state'`
**Solution**: This issue has been fixed in the latest version. The method now properly handles reproducibility by setting the numpy random seed before calling `generate_sample()`. If you encounter this error, please update to the latest version.

**Issue**: `TypeError: generate_comprehensive_dataset() got an unexpected keyword argument 'data_root'`
**Solution**: The `generate_comprehensive_dataset()` method doesn't accept a `data_root` parameter. Use the constructor to set the data root: `SyntheticDataGenerator(data_root="custom_path", random_state=42)`.

### Recent Fixes Applied

The following issues have been resolved in recent updates:

1. **ArmaProcess Parameter Error**: Fixed `random_state` parameter issue in ARFIMA generation
2. **Method Parameter Validation**: Corrected parameter lists for all generation methods
3. **Import Path Updates**: Updated all import statements to match current codebase structure
4. **Tutorial Accuracy**: All code examples now work with the current implementation

### Getting Help

If you encounter issues not covered here:

1. **Check the project documentation**
2. **Review the API reference**
3. **Run the demo scripts**: `python scripts/demo_synthetic_data.py`
4. **Create an issue on GitHub** with:
    - Error message and traceback
    - Code that caused the error
    - Your system information (Python version, OS)
    - Expected vs. actual behaviour

## Next Steps

- **Tutorial 3**: Learn advanced analysis methods
- **Tutorial 4**: Understand statistical validation techniques
- **Tutorial 5**: Create comprehensive visualizations
- **Tutorial 6**: Submit your own models and datasets