# Sentiment-Price Impact Analysis Pipeline

## Overview

This notebook documents the complete analysis pipeline that combines **sentiment data** with **stock price movements** to identify predictive signals.

### Research Question
> **Does aggregated news sentiment contain predictive information about future stock price movements?**

### Pipeline Components
1. **Data Aggregation** - Sentiment consensus metrics and hourly price data
2. **Feature Engineering** - Daily price bars, returns, volatility
3. **Statistical Analysis** - Correlation, regression, classification
4. **Causal Testing** - Granger causality, Monte Carlo validation
5. **Multi-ticker Analysis** - Cross-ticker pattern identification
6. **Trading Signals** - Actionable recommendations based on model strength

---

## Data Flow

```
Sentiment Data (daily)          Price Data (hourly)
       ↓                               ↓
   Aggregates              Hourly → Daily Conversion
  (sent_mean,                 (OHLCV bars)
   sent_std,              ↓
   attention,          Calculate Returns
   bull_bear)               & Range
       ↓                     ↓
  Merge on Date  ←--------→ Alignment
       ↓
Aligned Dataset (T → T+1)
       ↓
Split into Train/Test
       ↓
Train 7 Models
       ↓
Generate Results
```

---

## 1. Correlation Analysis (Pearson & Spearman)

### Purpose
**First diagnostic check**: Are sentiment and price movements even related?

### What it measures

**Pearson Correlation** ($\rho$)
- Linear relationship strength
- Sensitive to magnitude
- Formula: $\rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y}$

**Spearman Correlation** ($\rho_s$)
- Rank-based, captures nonlinear monotonic relationships
- Robust to outliers
- Better for detecting nonlinear patterns

### Interpretation

| Correlation | Interpretation |
|-------------|----------------|
| ±0.00–0.05 | Negligible - No predictive value |
| ±0.05–0.15 | Weak - Some signal present |
| ±0.15–0.30 | Moderate - Real predictive power |
| ±0.30+ | Strong - Highly predictive |

### Key Insight
If Spearman > Pearson by >0.05, the relationship is **nonlinear** and better captured by tree-based models.

---

## 2. Ridge Regression (Economic Impact)

### Purpose
**Quantify impact**: How much does each sentiment variable move prices?

### Model Specification

$$r_{t+1} = \beta_0 + \beta_1 \text{sent\_mean}_t + \beta_2 \text{sent\_std}_t + \beta_3 \text{attention}_t + \beta_4 \text{bull\_bear}_t + \epsilon$$

Where:
- $r_{t+1}$ = next-day return (%)
- $\beta_i$ = impact coefficient (% return per unit sentiment)
- $\epsilon$ = residual error

### Why Ridge Regression?
- **Interpretability**: Coefficients show direct economic impact
- **Regularization**: L2 penalty handles multicollinearity
- **Baseline**: Linear baseline for comparison with nonlinear models

### Interpreting Coefficients

Example: $\beta_{\text{sent\_mean}} = -0.00145$

> A +1 unit increase in sentiment mean (today) is associated with a **-0.145% return (tomorrow)**, holding other variables constant.

**Important**: This is correlation-based, not causal inference.

### Regression tells us:
✓ Direction of influence (positive/negative)
✓ Relative importance of features
✗ Nonlinear relationships
✗ True causality

---

## 3. Logistic Regression (Direction Prediction)

### Purpose
**Can sentiment predict UP vs DOWN days?**

### Model Specification

$$\hat{p}_{\text{UP}} = \frac{1}{1 + e^{-(\beta_0 + \sum \beta_i x_i)}}$$

Target variable:
$$y_t = \begin{cases} 1 & \text{if } r_{t+1} > 0 \text{ (UP day)} \\ 0 & \text{if } r_{t+1} \leq 0 \text{ (DOWN day)} \end{cases}$$

### Metrics

**Accuracy**
- Fraction of correct predictions
- $\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total}}$
- Baseline (random): 50%
- Threshold for "real signal": >53%

**ROC AUC Score**
- Probability that model ranks a true UP day above a DOWN day
- Baseline: 0.50
- Weak signal: 0.50–0.58
- Strong signal: >0.60

### Why compare both metrics?
- **Accuracy**: Overall correctness
- **AUC**: Ranking quality (useful for portfolio construction)

---

## 4. Random Forest Classifier (Nonlinear Effects)

### Purpose
**Do sentiment effects involve interactions or nonlinear relationships?**

### Model Specification
- **N trees**: 300
- **Max depth**: 5 (prevents overfitting)
- **Train/test split**: 70/30 (time-aware)

### Algorithm
1. Bootstrap samples from training data
2. For each sample, grow a decision tree (random feature selection)
3. Predict by averaging predictions across all trees

### Detecting Nonlinear Effects

| Comparison | Interpretation |
|-----------|----------------|
| RF Accuracy ≈ Logistic | Linear effects dominate |
| RF Accuracy >> Logistic | Nonlinear interactions present |
| RF AUC >> Logistic AUC | Tree model captures important patterns |

### Advantages over Linear Models
- Automatic feature interaction detection
- Handles outliers better
- More flexible decision boundaries

---

## 5. Lagged Returns (Control Variables)

### Purpose
**Isolate sentiment effect from price momentum**

### Why Control for Past Returns?

Markets exhibit **autocorrelation**: past returns predict future returns.

Without lagged returns in the model:
- Sentiment appears more predictive than it actually is
- Model captures momentum, not sentiment signal
- Overstates predictive power

### Implementation

Add 3 lags of returns:
$$r_{t+1} = f(\text{sent}_t, r_{t-1}, r_{t-2}, r_{t-3})$$

### Interpretation
If sentiment remains significant after adding lags:
> Sentiment contains information **beyond price momentum**

---

## 6. Granger Causality Test

### Purpose
**Does sentiment causally predict returns?** (Econometric test)

### Hypothesis Test

**Null hypothesis (H₀)**
> Sentiment does **not** improve price prediction beyond past returns alone.

### Two Models Compared

**Model 1 (Returns only)**
$$r_t = c + \sum_{i=1}^{3} \alpha_i r_{t-i} + u_t$$

**Model 2 (Returns + Sentiment)**
$$r_t = c + \sum_{i=1}^{3} \alpha_i r_{t-i} + \sum_{j=1}^{3} \gamma_j \text{sent}_{t-j} + u_t$$

### Test Statistic

$$F = \frac{(\text{RSS}_1 - \text{RSS}_2) / q}{\text{RSS}_2 / (n - k)}$$

- RSS = residual sum of squares
- q = number of restrictions (sentiment lags)
- p-value computed from F-distribution

### Interpretation

| p-value | Conclusion |
|---------|-----------|
| < 0.05 | Reject H₀: Sentiment **does** Granger-cause returns |
| > 0.05 | Fail to reject: No evidence of causality |

### Important Clarification
- **Granger causality ≠ true causality**
- Means: sentiment provides **incremental forecasting power**
- Does not mean: sentiment directly causes price changes

---

## 7. Volatility (Range) Regression

### Purpose
**Does sentiment affect price uncertainty, not just direction?**

### Target Variable

$$\text{range}_{t+1} = \frac{\text{High}_{t+1} - \text{Low}_{t+1}}{\text{Open}_{t+1}}$$

Measures intraday price movement as % of opening price.

### Model

$$\text{range}_{t+1} = \beta_0 + \beta_1 \text{sent\_mean}_t + \beta_2 \text{sent\_std}_t + \dots + \epsilon$$

### Why Volatility Matters

Sentiment often impacts:
- **Disagreement** among traders (pushes prices wider)
- **Uncertainty** (increases bid-ask spreads)
- **Trading activity** (higher volume = larger moves)

**Finding**: Volatility effects often **stronger** than return effects.

### Interpretation

$\beta_{\text{sent\_std}} = +0.0032$

> Higher sentiment disagreement (today) → +0.32% intraday range (tomorrow)

---

## 8. Monte Carlo Permutation Test

### Purpose
**Is the signal real or just random noise?**

### Question
> If sentiment were completely random, how often would we observe an effect this strong?

### Algorithm

**Step 1**: Compute real effect
- Train: $r_{t+1} = \beta \cdot \text{sent}_t + \epsilon$
- Record: $\beta_{\text{real}}$

**Step 2**: Shuffle and repeat (N=1000 iterations)
```
for i in 1 to 1000:
    Shuffle(sentiment_values)
    Train model on (shuffled_sentiment, returns)
    Record β_shuffled[i]
```

**Step 3**: Compute p-value
$$p\text{-value} = \frac{\#\{|\beta_{\text{shuffled}}| \geq |\beta_{\text{real}}|\}}{N}$$

### Interpretation

| p-value | Conclusion |
|---------|-----------|
| < 0.05 | Effect is **statistically significant** (real signal) |
| 0.05–0.20 | Weak evidence (borderline) |
| > 0.20 | Effect is likely **random noise** |

### Why Monte Carlo?
- **Non-parametric**: No distribution assumptions
- **Robust**: Works for complex relationships
- **Direct**: Answers "Could this happen by chance?"

---

## 9. Multi-Ticker Analysis Pipeline

### Comprehensive Analyzer

After running `consensus_price_comparison.py` for all tickers, the `ComprehensiveAnalyzer` aggregates results:

### Analyses Performed

1. **Correlation Summary**
   - Mean/std of Pearson correlations across tickers
   - Consensus on signal strength
   
2. **Regression Consensus**
   - Average coefficients per feature
   - Identification of consistent predictors
   
3. **Model Performance Metrics**
   - Average accuracy and AUC across tickers
   - Assessment of signal reliability
   
4. **Granger Causality Summary**
   - How many tickers show significant causality?
   - Consistency of predictive power
   
5. **Monte Carlo Summary**
   - What % of tickers have statistically significant effects?
   - Real vs. random signal

### Trading Signal Generation

Signals combine multiple factors:

$$\text{Signal\_Strength} = \text{[Correlation]} + \text{[Classification Accuracy]} + \text{[Granger Significance]}$$

| Strength | Quality | Action |
|----------|---------|--------|
| 0 factors | WEAK | Monitor, do not trade |
| 1 factor | WEAK | Use with caution |
| 2 factors | MODERATE | Consider for portfolio |
| 3 factors | STRONG | High confidence signal |

---

## 10. Output Files

The pipeline generates comprehensive outputs in `backend/analysis_results/`:

### Text Report
**File**: `analysis_report_TIMESTAMP.txt`
- Full numerical results
- Interpretation of each model
- Statistical summaries

### CSV Export
**File**: `analysis_summary_TIMESTAMP.csv`
- One row per ticker
- All correlation, regression, and accuracy metrics
- Quick reference format

### JSON Export
**File**: `analysis_detailed_TIMESTAMP.json`
- Complete results for programmatic access
- All model outputs preserved
- Enables downstream analysis

### Visualizations
- **Correlation heatmap**: Feature importance across tickers
- **Regression coefficients**: Economic impact by feature
- **Model performance**: Accuracy/AUC comparison
- **Granger causality**: Lag structure visualization

---

## 11. Complete Workflow

```
RUN: python backend/scripts/consensus_price_comparison.py

├─ For each ticker:
│  ├─ [1] Load sentiment aggregates (daily)
│  ├─ [2] Load hourly prices
│  ├─ [3] Aggregate to daily bars
│  ├─ [4] Add lagged returns (control variables)
│  ├─ [5] Align sentiment T → price T+1
│  ├─ [6] Correlation Analysis (Pearson & Spearman)
│  ├─ [7] Ridge Regression (impact quantification)
│  ├─ [8] Logistic Regression (direction prediction)
│  ├─ [9] Random Forest (nonlinear detection)
│  ├─ [10] Granger Causality (incremental info)
│  ├─ [11] Volatility Regression (uncertainty)
│  ├─ [12] Monte Carlo Test (significance)
│  └─ Collect all results
│
└─ RUN: ComprehensiveAnalyzer
   ├─ [1] Aggregate correlations across tickers
   ├─ [2] Consensus on regression coefficients
   ├─ [3] Compare model performance
   ├─ [4] Statistical significance testing
   ├─ [5] Generate trading signals
   ├─ [6] Create visualizations
   ├─ [7] Export CSV/JSON/TXT reports
   └─ Generate executive summary

OUTPUT: analysis_results/ directory with all reports & visualizations
```

---

## 12. Model Decision Tree

### When to use each model:

**Correlation Analysis**
- First step, always run
- If negligible → stop here, no signal
- If significant → continue

**Ridge Regression**
- Quantify economic impact
- Understand relative importance
- Create baseline predictions

**Logistic Regression**
- Binary classification baseline
- Compare vs Random Forest
- Portfolio construction signals

**Random Forest**
- Detect complex patterns
- Identify interactions
- Improve over linear baseline?

**Granger Causality**
- Test statistical significance
- Incremental information?
- Formal hypothesis test

**Monte Carlo Test**
- Validate all findings
- p-value < 0.05 = real signal
- Final gate before trading

**Volatility Regression**
- Complementary analysis
- Risk/uncertainty measurement
- Portfolio volatility forecasts

---

## 13. Key Takeaways

1. **No single model answers all questions** - Use multiple models for triangulation
2. **Correlation ≠ prediction** - Correlation is necessary but not sufficient
3. **Control for momentum** - Lagged returns are critical control variables
4. **Nonlinearity matters** - Compare linear vs tree-based model performance
5. **Validate everything** - Monte Carlo testing separates signal from noise
6. **Multi-ticker consensus** - Real signals should generalize across stocks
7. **Trading signals require multiple confirmations** - Not just one model
8. **Volatility often stronger than returns** - Consider both dimensions
9. **Granger ≠ causality** - Means "predictive precedence" only
10. **Results are probabilistic** - All models capture associations, not absolute truths