# Indicator Association Analysis

This notebook analyzes which technical indicators are most associated with the **trade** class (vs hold) using:

- **Chi-Square Test**: Statistical dependence
- **Mutual Information**: Information content
- **Point-Biserial Correlation**: Binary-binary correlation
- **Fisher's Exact Test**: Precise 2x2 contingency
- **Apriori Association Rules**: Indicator combinations

Results can be exported as a **feature mask** for LSTM feature selection.

In [1]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from crypto_analysis import IndicatorAssociationAnalyzer

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load Data

In [2]:
# Load the DOGE dataset
df = pd.read_csv('doge.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['tradeable'].value_counts())
print(f"\nTrade percentage: {(df['tradeable'] == 'trade').mean():.2%}")

Dataset shape: (2388, 124)

Class distribution:
tradeable
hold     2052
trade     336
Name: count, dtype: int64

Trade percentage: 14.07%


In [3]:
# Show indicator columns
indicator_cols = [col for col in df.columns if col.endswith(('_gs_entry', '_gs_exit'))]
print(f"Number of indicator columns: {len(indicator_cols)}")
print(f"\nFirst 10 indicators:")
for col in indicator_cols[:10]:
    print(f"  - {col}")

Number of indicator columns: 114

First 10 indicators:
  - TRIX_gs_entry
  - TRIX_gs_exit
  - BOP_gs_entry
  - BOP_gs_exit
  - MOM_gs_entry
  - MOM_gs_exit
  - ROC_gs_entry
  - ROC_gs_exit
  - AROON_gs_entry
  - AROON_gs_exit


## 2. Initialize Analyzer

In [4]:
# Create analyzer
analyzer = IndicatorAssociationAnalyzer(
    df=df,
    target_col='tradeable',
    positive_class='trade'
)

# Print quick summary
analyzer.print_summary(top_k=15)

  rpb, prob = pearsonr(x, y)


INDICATOR ASSOCIATION ANALYSIS SUMMARY

Dataset: 2388 samples
  - Trade: 336
  - Hold: 2052

Total features: 114
  - ChiÂ² significant (p<0.05): 67
  - Fisher significant (p<0.05): 75
  - Lift > 1.0: 54
  - Lift > 1.5: 19

Top 15 Trade-Associated Indicators:
------------------------------------------------------------
   1. ATR_gs_entry              score=1.0000 lift=3.21
   2. WILLR_gs_exit             score=0.7119 lift=1.83
   3. STOCH_gs_exit             score=0.7091 lift=1.80
   4. STOCHF_gs_exit            score=0.6852 lift=1.77
   5. CMO_gs_exit               score=0.6622 lift=1.89
   6. CCI_gs_exit               score=0.6593 lift=1.70
   7. RSI_gs_exit               score=0.6491 lift=1.61
   8. STOCHRSI_gs_exit          score=0.5805 lift=1.49
   9. ULTOSC_gs_exit            score=0.5460 lift=1.70
  10. HT_TRENDLINE_gs_entry     score=0.5331 lift=2.28
  11. EMA_gs_exit               score=0.5187 lift=2.51
  12. MACDFIX_gs_entry          score=0.5157 lift=2.32
  13. STOCHRSI_gs_en

## 3. Statistical Analysis

In [None]:
# Compute all statistics
stats_df = analyzer.compute_all_statistics()

# Display top 20 features
print("Top 20 Trade-Associated Indicators:")
stats_df[['feature', 'composite_score', 'chi2_statistic', 'mutual_info', 'lift', 'trade_support']].head(20)

In [None]:
# Chi-Square results
chi2_df = analyzer.compute_chi_square()
print(f"\nChi-Square Significant Features (p < 0.05): {sum(chi2_df['chi2_pvalue'] < 0.05)}")
print(f"Chi-Square Highly Significant (p < 0.01): {sum(chi2_df['chi2_pvalue'] < 0.01)}")

In [None]:
# Mutual Information results
mi_df = analyzer.compute_mutual_information()
print(f"\nMutual Information Statistics:")
print(f"  Mean: {mi_df['mutual_info'].mean():.4f}")
print(f"  Max:  {mi_df['mutual_info'].max():.4f}")
print(f"  Min:  {mi_df['mutual_info'].min():.4f}")

## 4. Visualizations

In [None]:
# Feature importance plot
analyzer.plot_feature_importance(top_k=25, figsize=(14, 10))
plt.show()

In [None]:
# Chi-Square distribution
analyzer.plot_chi_square_distribution(figsize=(12, 5))
plt.show()

In [None]:
# Lift analysis
analyzer.plot_lift_analysis(figsize=(14, 5))
plt.show()

## 5. Association Rule Mining

Find indicator combinations that predict the **trade** class.

In [None]:
# Mine association rules
rules = analyzer.mine_association_rules(
    min_support=0.005,
    min_confidence=0.2,
    min_lift=1.0,
    max_antecedent_len=3,
    max_rules=500,
    top_k_features=100,
    max_itemsets=900000  # default, can increase if needed
)


print(f"Found {len(rules)} association rules\n")

# Display top rules
print("Top 15 Association Rules (by lift):")
print("-" * 80)
for i, rule in enumerate(rules[:15]):
    ant_str = " + ".join(rule.antecedent)
    print(f"{i+1:2d}. {ant_str}")
    print(f"    -> trade  (support={rule.support:.4f}, confidence={rule.confidence:.3f}, lift={rule.lift:.2f})")
    print()

In [None]:
# Association network visualization
try:
    analyzer.plot_association_network(min_lift=1.2, max_rules=40, figsize=(14, 12))
    plt.show()
except Exception as e:
    print(f"Network plot requires networkx: {e}")

## 6. Entry vs Exit Signal Analysis

Compare entry signals vs exit signals for each indicator.

In [None]:
# Group by indicator (combining entry/exit)
indicator_scores = {}

for _, row in stats_df.iterrows():
    feat = row['feature']
    # Extract indicator name (remove _gs_entry or _gs_exit suffix)
    if '_gs_entry' in feat:
        ind_name = feat.replace('_gs_entry', '')
        signal_type = 'entry'
    elif '_gs_exit' in feat:
        ind_name = feat.replace('_gs_exit', '')
        signal_type = 'exit'
    else:
        continue
    
    if ind_name not in indicator_scores:
        indicator_scores[ind_name] = {'entry': 0, 'exit': 0}
    indicator_scores[ind_name][signal_type] = row['composite_score']

# Convert to DataFrame
ind_df = pd.DataFrame(indicator_scores).T
ind_df['combined'] = ind_df['entry'] + ind_df['exit']
ind_df['entry_dominant'] = ind_df['entry'] > ind_df['exit']
ind_df = ind_df.sort_values('combined', ascending=False)

print("Top 20 Indicators (entry + exit combined score):")
ind_df.head(20)

In [None]:
# Plot entry vs exit
fig, ax = plt.subplots(figsize=(10, 10))

top_20 = ind_df.head(20)
colors = ['green' if x else 'red' for x in top_20['entry_dominant']]

ax.scatter(top_20['entry'], top_20['exit'], c=colors, s=100, alpha=0.7)

# Add labels
for idx, row in top_20.iterrows():
    ax.annotate(idx, (row['entry'], row['exit']), fontsize=8, ha='center')

# Diagonal line
max_val = max(top_20['entry'].max(), top_20['exit'].max())
ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.3)

ax.set_xlabel('Entry Signal Score')
ax.set_ylabel('Exit Signal Score')
ax.set_title('Entry vs Exit Signal Trade Association\n(Green=Entry dominant, Red=Exit dominant)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Complete Analysis Result

In [None]:
# Run complete analysis
result = analyzer.analyze_all(
    min_support=0.005,
    min_confidence=0.2,
    min_lift=1.0,
    max_antecedent_len=3
)

print(f"Analysis complete!")
print(f"  - Feature statistics: {len(result.feature_statistics)} features")
print(f"  - Association rules: {len(result.association_rules)} rules")
print(f"  - Top indicators: {len(result.top_trade_indicators)}")
print(f"  - Indicator combinations: {len(result.indicator_combinations)}")

In [None]:
# Top trade indicators
print("Top 20 Trade-Associated Indicators:")
for i, ind in enumerate(result.top_trade_indicators[:20]):
    print(f"  {i+1:2d}. {ind}")

In [None]:
# Best indicator combinations
print("\nBest Indicator Combinations (from association rules):")
for i, combo in enumerate(result.indicator_combinations[:15]):
    print(f"  {i+1:2d}. {' + '.join(combo)}")

## 8. Export for LSTM

Export results as feature mask for `LSTMMetaheuristicOptimizer`.

In [None]:
# Generate feature mask (top 50% of features)
feature_mask = analyzer.get_feature_mask(percentile=50.0)

print(f"Feature mask shape: {feature_mask.shape}")
print(f"Selected features: {feature_mask.sum()} / {len(feature_mask)}")
print(f"Selection ratio: {feature_mask.mean():.2%}")

In [None]:
# Export full results for LSTM
lstm_export = analyzer.export_for_lstm(percentile=50.0, top_k=30)

print("LSTM Export Contents:")
for key, value in lstm_export.items():
    if isinstance(value, list):
        print(f"  {key}: {len(value)} items")
    elif isinstance(value, dict):
        print(f"  {key}: {len(value)} entries")
    else:
        print(f"  {key}: {value}")

In [None]:
# Show selected features
print("\nSelected Features (mask=1):")
selected_features = [f for f, m in zip(lstm_export['feature_columns'], lstm_export['feature_mask']) if m == 1]
for i, feat in enumerate(selected_features[:30]):
    print(f"  {i+1:2d}. {feat}")

if len(selected_features) > 30:
    print(f"  ... and {len(selected_features) - 30} more")

In [None]:
# Save export to JSON
import json

export_path = 'indicator_association_export.json'
with open(export_path, 'w') as f:
    json.dump(lstm_export, f, indent=2)

print(f"Export saved to: {export_path}")

## 9. Generate Report

In [None]:
# Generate markdown report
analyzer.generate_report('indicator_association_report.md', format='markdown', top_k=30)

# Generate JSON report
analyzer.generate_report('indicator_association_report.json', format='json', top_k=30)

## 10. Usage with LSTMMetaheuristicOptimizer

Example of how to use the feature mask with LSTM optimization:

In [None]:
# # Example usage (uncomment to run)

# from crypto_analysis import LSTMMetaheuristicOptimizer

# # Run LSTM optimization on filtered features
# optimizer = LSTMMetaheuristicOptimizer(
#     df=df,
#     model_type='cnn_lstm',
#     pop_size=15,              # +2: Better exploration
#     iterations=100,            # +20: Still improving, allow more time
#     n_workers=25,
#     np_neighbors=2,           # +1: Better neighborhood effect
#     pf_max=0.25,              # +0.03: More exploration (params diverse)
#     epochs_per_eval=100,
#     checkpoint_interval=5,
#     elitist_selection=False,
#     elitist_constant=0.15,    # -0.02: Slightly more exploration
#     verbose=True,
#     enable_logging=True,
# )

# result = optimizer.optimize()
# result

iter:-1 indv:7 fitness:0.0267 features:63
iter:-1 indv:3 fitness:0.3736 features:55
iter:-1 indv:10 fitness:0.0541 features:62
iter:-1 indv:24 fitness:0.0699 features:60
iter:-1 indv:20 fitness:0.3159 features:59
iter:-1 indv:5 fitness:0.0000 features:61
iter:-1 indv:19 fitness:0.0013 features:57
iter:-1 indv:23 fitness:0.0141 features:61
iter:-1 indv:17 fitness:0.0053 features:61
iter:-1 indv:16 fitness:0.2277 features:70
iter:-1 indv:9 fitness:0.0000 features:58
iter:-1 indv:15 fitness:0.1462 features:59
iter:-1 indv:18 fitness:0.0555 features:62
iter:-1 indv:6 fitness:0.0016 features:62
iter:-1 indv:0 fitness:0.0680 features:54
iter:-1 indv:14 fitness:0.0644 features:51
iter:-1 indv:4 fitness:0.0000 features:55
iter:-1 indv:22 fitness:0.0833 features:63
iter:-1 indv:8 fitness:0.0000 features:60
iter:-1 indv:12 fitness:0.0347 features:57
iter:-1 indv:1 fitness:0.3124 features:57
iter:-1 indv:21 fitness:0.2727 features:64
iter:-1 indv:11 fitness:0.1745 features:67
iter:-1 indv:2 fitne

In [None]:
# Example usage (uncomment to run)

from crypto_analysis import LSTMMetaheuristicOptimizer

# Get feature mask (top 60% of features)
feature_mask = analyzer.get_feature_mask(percentile=30.0)  # Select top 60%

# Use mask to filter features before LSTM training
feature_cols = analyzer.feature_cols
selected_cols = [col for col, mask in zip(feature_cols, feature_mask) if mask == 1]

# Create filtered dataset
lstm_df = df[['date', 'tradeable'] + selected_cols].copy()

# Run LSTM optimization on filtered features
optimizer = LSTMMetaheuristicOptimizer(
    df=lstm_df,
    model_type='cnn_lstm',
    pop_size=20,              # +2: Better exploration
    iterations=100,            # +20: Still improving, allow more time
    n_workers=25,
    np_neighbors=2,           # +1: Better neighborhood effect
    pf_max=0.25,              # +0.03: More exploration (params diverse)
    epochs_per_eval=100,
    checkpoint_interval=5,
    elitist_selection=False,
    elitist_constant=0.15,    # -0.02: Slightly more exploration
    verbose=True,
    enable_logging=True,
)

result = optimizer.optimize()
result

LSTMMetaheuristicOptimizer (APO) initialized:
  - Model type: cnn_lstm
  - DataFrame mode: binary
  - Feature columns: 82
  - Hyperparameters: 13
  - Total dimension: 95
  - Population size: 20
  - Iterations: 100
  - Workers: 25
  - APO np_neighbors: 2
  - APO pf_max: 0.27
  - Elitist selection: False

Starting APO Metaheuristic Optimization
Run ID: 7fcb89b9

Evaluating initial population...


iter:-1 indv:10 fitness:0.0000 features:44
iter:-1 indv:17 fitness:0.0017 features:39
iter:-1 indv:15 fitness:0.0104 features:37
iter:-1 indv:16 fitness:0.0276 features:42
iter:-1 indv:18 fitness:0.0916 features:42
iter:-1 indv:14 fitness:0.0543 features:32
iter:-1 indv:13 fitness:0.0000 features:41
iter:-1 indv:8 fitness:0.0000 features:34
iter:-1 indv:6 fitness:0.0020 features:40
iter:-1 indv:11 fitness:0.0052 features:37
iter:-1 indv:9 fitness:0.1172 features:38
iter:-1 indv:0 fitness:0.0000 features:37
iter:-1 indv:1 fitness:0.0119 features:42
iter:-1 indv:12 fitness:0.0000 features:43
iter:-1 indv:2 fitness:0.0381 features:43
iter:-1 indv:19 fitness:0.0015 features:40
iter:-1 indv:4 fitness:0.0036 features:44
iter:-1 indv:7 fitness:0.0000 features:38
iter:-1 indv:3 fitness:0.0020 features:42
iter:-1 indv:5 fitness:0.0000 features:35

------------------------------------------------------------------------
INIT - Best Individual Per-Class Metrics (fitness: 0.117220)
---------------

## Summary

This analysis identified:
1. **Individual indicators** most associated with trade class (via chi-square, MI, lift)
2. **Indicator combinations** that predict trade (via association rules)
3. **Feature mask** for LSTM feature selection

Key outputs:
- `indicator_association_export.json`: Feature mask and scores for LSTM
- `indicator_association_report.md`: Human-readable analysis report