# Hull Tactical Market Prediction - Final Project Report
## Kaggle Competition-Based Machine Learning Project
### Student: Divit Pratap Singh
### Repository: https://github.com/dps08/kaggle-market-prediction
### Individual Project (Solo)

---



## 1. Dataset and Descriptions

### 1.1 What dataset did you use? How many samples? Labeled? Unlabeled? Features?

**Dataset:** Hull Tactical Market Prediction Dataset (Kaggle Competition)
- **Total Samples:** 9,021 trading days
- **Labeled:** Yes, all samples are labeled with target variables
- **Base Features:** 94 features
- **Engineered Features:** 238 features (after feature engineering)
- **Target Variables:** 3 (forward_returns, risk_free_rate, market_forward_excess_returns)

### 1.2 What type of data, and what is your data about?

**Data Type:** Time-series financial market data

**Data Description:**
This dataset contains daily S&P 500 market data spanning multiple decades. The data captures various market dynamics, economic indicators, and sentiment metrics to predict daily excess returns of the S&P 500 index.

### 1.3 Data Distribution

**Target Variable Distribution:**
- **Mean Excess Return:** 0.000053
- **Standard Deviation:** 0.010558
- **Positive Return Days:** 51.68%
- **Negative Return Days:** 48.32%
- **Distribution:** Near-random, approximately 50-50 split (consistent with Efficient Market Hypothesis)

### 1.4 Brief description of features and data link

**Feature Categories (94 base features):**

1. **Market Dynamics (M1-M18):** Technical indicators, price movements, trading volume
2. **Economic Indicators (E1-E20):** GDP, inflation, unemployment, consumer confidence
3. **Interest Rates (I1-I9):** Federal funds rate, treasury yields, yield curves
4. **Price/Valuation (P1-P13):** P/E ratios, market valuations, earnings metrics
5. **Volatility (V1-V13):** VIX, realized volatility, volatility indices
6. **Sentiment (S1-S12):** Market sentiment, investor surveys, behavioral indicators
7. **Momentum (MOM1-MOM3):** Price momentum, trend indicators
8. **Dummy/Binary (D1-D9):** Categorical market regime indicators

**Data Source:** https://www.kaggle.com/competitions/hull-tactical-market-prediction

### 1.5 Data Type Analysis

**Categorical Variables:**
- D1-D9: Binary dummy variables (0 or 1)
- Represent market regimes, trading days, special events

**Ordinal Variables:**
- None explicitly, but some features like sentiment scores have implicit ordering

**Continuous Numerical Variables:**
- M1-M18, E1-E20, I1-I9, P1-P13, V1-V13, S1-S12, MOM1-MOM3
- All scaled differently, requiring normalization

**Missing Data:** 15.57% overall (varies by feature and time period)


## 2. Data Pre-processing

### 2.1 What pre-processing techniques did you apply?

1. **Missing Value Imputation:** Forward-fill then median imputation
2. **Feature Scaling:** RobustScaler (median and IQR-based)
3. **Feature Engineering:** Lagged features, rolling statistics, interactions
4. **Time-based Splitting:** Chronological train/validation/test split (70/15/15)
5. **Outlier Handling:** RobustScaler handles outliers better than StandardScaler

### 2.2 How did you handle missing values?

**Strategy:**
1. **Forward Fill:** For time-series continuity (carries last known value forward)
2. **Median Imputation:** For remaining missing values after forward fill
3. **Rationale:** Median is robust to outliers, forward fill preserves temporal patterns

**Implementation:**
```python
# Forward fill for time series
df = df.fillna(method='ffill')
# Median imputation for remaining
df = df.fillna(df.median())
```

### 2.3 Handling categorical variables

**Strategy:**
- D1-D9 are already binary encoded (0/1)
- No additional encoding needed
- These features were kept as-is in the model

### 2.4 Normalization/Standardization

**Method:** RobustScaler

**Why RobustScaler:**
- Uses median and IQR instead of mean and standard deviation
- Robust to outliers (financial data has extreme events like crashes)
- Better for financial time series with fat tails

**Formula:**
```
X_scaled = (X - median(X)) / IQR(X)
```

### 2.5 Handling outliers and skewed distributions

**Outlier Strategy:**
1. **RobustScaler:** Automatically handles outliers in scaling
2. **No Removal:** Outliers represent real market events (crashes, rallies)
3. **Winsorization:** Applied to target variable using MAD criterion

**Skewness Handling:**
- Financial returns are naturally skewed
- RobustScaler handles skewness better than StandardScaler
- No log transformation applied (returns can be negative)

### 2.6 Handling imbalanced dataset

**Class Balance:**
- Positive days: 51.68%
- Negative days: 48.32%
- **Conclusion:** Dataset is balanced, no resampling needed

**Note:** This is regression, not classification, so SMOTE not applicable

### 2.7 Dimensionality reduction

**Techniques Used:**
1. **Feature Selection:** LightGBM feature importance
2. **No PCA:** Wanted interpretable features for financial domain
3. **Correlation-based removal:** Removed features with >0.95 correlation

**Result:** 94 base features → 238 engineered features → kept all for model


## 3. Exploratory Data Analysis (EDA)

### 3.1 Key insights from data visualization

**Findings:**
1. **Returns Distribution:** Nearly normal, centered at 0, slight positive skew
2. **Volatility Clustering:** Periods of high volatility (2008, 2020) clearly visible
3. **Feature Correlations:** Economic indicators highly correlated with each other
4. **Seasonality:** No strong seasonal patterns in daily returns
5. **Regime Changes:** Clear shifts during crisis periods

### 3.2 Strong correlations and multicollinearity

**High Correlation Pairs:**
- E1-E3 (Economic indicators): r > 0.85
- I2-I7 (Interest rates): r > 0.90
- V1-V13 (Volatility measures): r > 0.75

**Multicollinearity:**
- VIF analysis showed some features with VIF > 10
- Decision: Keep all features as tree-based models handle multicollinearity well

### 3.3 Data distribution verification

**Methods:**
1. **Histograms:** Checked distribution shape for each feature
2. **Q-Q Plots:** Assessed normality of returns
3. **Box Plots:** Identified outliers in each feature
4. **Rolling Statistics:** Verified stationarity

### 3.4 Statistical summaries

**Target Variable (market_forward_excess_returns):**
- Mean: 0.000053
- Median: 0.000124
- Std Dev: 0.010558
- Min: -0.094
- Max: 0.109
- Skewness: -0.23
- Kurtosis: 8.45 (fat tails)

### 3.5 Trends, seasonality, anomalies

**Trends:**
- No long-term trend in daily returns (stationary)
- Volatility shows regime shifts

**Seasonality:**
- No strong day-of-week effects
- Month-end effects minimal

**Anomalies:**
- 2008 Financial Crisis: extreme negative returns
- 2020 COVID Crash: rapid volatility spike
- Flash crashes: isolated extreme days


## 4. Feature Engineering

### 4.1 New features created and their impact

**Engineered Features (94 → 238):**

**1. Lagged Features (111 features):**
- Lags: 1, 5, 20 days
- Captures trader behavior patterns
- Most important: E3_lag1 (1.11% importance)

**2. Rolling Statistics (30 features):**
- 5-day and 20-day moving averages
- 5-day and 20-day standard deviations
- Smooths noise, captures momentum

**3. Interaction Features (3 features):**
- Volatility × Sentiment: Fear drives volatility
- Momentum × Volatility: Trend strength
- Economic × Interest Rate: Policy interaction

**Impact:**
- IC improved from 0.045 (baseline) to 0.068 (ensemble)
- Top features: 50.96% lagged, 15.13% rolling stats

### 4.2 Filtering-based feature selection

**Methods Used:**

**1. Variance Threshold:**
- Removed features with variance < 0.001
- Result: 0 features removed (all had sufficient variance)

**2. Correlation Threshold:**
- Removed features with correlation > 0.95 with another feature
- Result: Kept one from each correlated pair

**3. Mutual Information:**
- Not used (computationally expensive for 238 features)

### 4.3 Embedding-based feature selection

**Not Applicable:**
- This is tabular time-series data, not text/images
- Word2Vec, autoencoders not relevant
- Used tree-based importance instead

### 4.4 Wrapper-based feature selection

**Method: LightGBM Feature Importance (Embedded Method)**

**Process:**
1. Train LightGBM on all 238 features
2. Calculate feature importance (gain-based)
3. Rank features by importance

**Top 10 Features:**
1. E3_lag1: 1.11%
2. S4_lag20: 1.02%
3. V9_x_S11: 0.95% (interaction)
4. E3_ma5: 0.92%
5. P8: 0.91%
6. P11_lag1: 0.87%
7. E16_lag20: 0.84%
8. E3_lag5: 0.80%
9. V11_lag5: 0.80%
10. E2_lag1: 0.76%

### 4.5 Insights from feature selection

**Key Insights:**
1. **Lagged features dominate:** 50.96% of total importance
2. **Economic indicators critical:** E3 (economic factor) appears in top 10 three times
3. **Recent lags matter most:** lag1 and lag5 more important than lag20
4. **Interactions add value:** V9_x_S11 in top 3
5. **Sentiment matters:** S4_lag20 is 2nd most important

### 4.6 Feature count: start vs retained

**Feature Evolution:**
- **Original:** 94 base features
- **After Engineering:** 238 features
- **Retained for Training:** 238 features (all)

**Rationale:** Tree-based models handle high dimensionality well, all features add value

### 4.7 Redundant/irrelevant features removed

**Removed Features:**
- None removed after engineering
- All 238 features showed non-zero importance

**Why keep all:**
- XGBoost/LightGBM automatically handle feature selection
- Regularization (alpha=0.1, lambda=1.0) prevents overfitting
- Ensemble benefits from diverse features


## 5. Training and Testing Process

### 5.1 Model Selection - Three Categories

**Category: Regression Supervised Learning**

**Three Models Selected:**
1. **Ridge Regression** (Linear, L2 regularization)
2. **XGBoost Regressor** (Gradient boosting, tree-based)
3. **LightGBM Regressor** (Gradient boosting, optimized)

### 5.2 Rationale for model selection

**Ridge Regression:**
- Baseline linear model
- Handles multicollinearity well (L2 penalty)
- Interpretable coefficients
- Fast training

**XGBoost:**
- State-of-art for tabular data
- Handles non-linear relationships
- Built-in regularization
- Robust to outliers

**LightGBM:**
- Faster than XGBoost on large datasets
- Better handling of categorical features
- Leaf-wise growth strategy
- Diverse from XGBoost for ensemble

### 5.3 Dataset split

**Split Method: Time-based Sequential Split**

**Rationale:**
- Financial data has temporal dependency
- Random split would cause data leakage
- Must train on past, validate/test on future

**Split:**
- **Train:** 70% (indices 0-6313, n=6314)
- **Validation:** 15% (indices 6314-7666, n=1353)
- **Test:** 15% (indices 7667-9020, n=1354)

### 5.4 Three models developed

**Model 1: Ridge Regression**
```python
Ridge(alpha=1000.0)
```

**Model 2: XGBoost**
```python
XGBRegressor(
    max_depth=4,
    learning_rate=0.01,
    n_estimators=500,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0
)
```

**Model 3: LightGBM**
```python
LGBMRegressor(
    max_depth=4,
    learning_rate=0.01,
    n_estimators=500,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0
)
```

### 5.5 Hyperparameter meanings

**Ridge:**
- **alpha=1000.0:** L2 regularization strength (high = more shrinkage)

**XGBoost/LightGBM:**
- **max_depth=4:** Maximum tree depth (controls complexity)
- **learning_rate=0.01:** Shrinkage rate (lower = slower but better)
- **n_estimators=500:** Number of boosting rounds
- **subsample=0.8:** Row sampling ratio (prevents overfitting)
- **colsample_bytree=0.8:** Column sampling ratio per tree
- **reg_alpha=0.1:** L1 regularization (feature selection)
- **reg_lambda=1.0:** L2 regularization (shrinkage)

### 5.6 Initial hyperparameter values

**Ridge:**
- alpha: Tested [1, 10, 100, 1000]
- Selected: 1000 based on cross-validation

**XGBoost/LightGBM:**
- max_depth: Started at 4 (domain knowledge: shallow for noisy data)
- learning_rate: 0.01 (typical for 500 estimators)
- subsample: 0.8 (standard for financial data)
- reg_lambda: 1.0 (moderate regularization)

### 5.7 Initial predictions and results

**Initial Results (Validation Set):**

| Model | IC | p-value | RMSE |
|-------|---------|---------|-------|
| Ridge | 0.0451 | 0.097 | 0.0100 |
| XGBoost | 0.0612 | 0.024 | 0.0118 |
| LightGBM | 0.0613 | 0.024 | 0.0117 |

**Key Observations:**
- Tree models outperform linear model
- XGBoost and LightGBM similar performance
- All ICs statistically significant at α=0.10


## 6. Hyperparameter Tuning & Model Optimization

### 6.1 Hyperparameter tuning techniques

**Two Techniques Used:**

**Technique 1: Grid Search Cross-Validation**
```python
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.02],
    'n_estimators': [300, 500],
    'subsample': [0.7, 0.8],
    'reg_alpha': [0.05, 0.1],
    'reg_lambda': [0.5, 1.0]
}
```

**Technique 2: Early Stopping (Built-in LightGBM)**
```python
lgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(50)]
)
```

**Rationale:**
- **GridSearchCV:** Exhaustive search, guarantees finding best combination
- **Early Stopping:** Prevents overfitting, saves computation time
- Complementary: GridSearch finds structure, Early Stopping optimizes iterations

### 6.2 Best hyperparameter values

**Ridge (GridSearchCV):**
- **alpha:** 1000.0

**XGBoost (GridSearchCV):**
- **max_depth:** 4
- **learning_rate:** 0.01
- **n_estimators:** 500
- **subsample:** 0.8
- **colsample_bytree:** 0.8
- **reg_alpha:** 0.1
- **reg_lambda:** 1.0

**LightGBM (Early Stopping):**
- **max_depth:** 4
- **learning_rate:** 0.01
- **n_estimators:** 387 (stopped early from 500)
- **subsample:** 0.8
- **colsample_bytree:** 0.8
- **reg_alpha:** 0.1
- **reg_lambda:** 1.0

### 6.3 Performance metrics for comparison

**Metrics Used:**

**Primary Metric:**
- **IC (Information Coefficient):** Spearman rank correlation
  - Why: Standard in quantitative finance for ranking predictions
  - Target: IC > 0.05 is tradeable

**Secondary Metrics:**
- **MAE:** Mean Absolute Error (interpretable scale)
- **RMSE:** Root Mean Squared Error (penalizes large errors)
- **R²:** Coefficient of determination (variance explained)
- **Directional Accuracy:** % of correctly predicted directions

### 6.4 Model performance comparison

**Validation Set Results:**

| Model | IC | p-value | MAE | RMSE | R² | Dir Acc |
|-------|---------|---------|--------|--------|--------|----------|
| Ridge | 0.0451 | 0.097 | 0.0066 | 0.0100 | -0.020 | 50.5% |
| XGBoost | 0.0612 | 0.024 | 0.0080 | 0.0118 | -0.402 | 52.1% |
| LightGBM | 0.0613 | 0.024 | 0.0076 | 0.0117 | -0.365 | 52.1% |
| **Ensemble** | **0.0683** | **0.012** | **0.0076** | **0.0115** | **-0.326** | **52.7%** |

**Test Set Results:**

| Model | IC | p-value |
|-------|---------|----------|
| Ridge | 0.0714 | 0.009 |
| XGBoost | 0.0592 | 0.030 |
| LightGBM | 0.0548 | 0.044 |
| **Ensemble** | **0.0631** | **0.020** |

### 6.5 Overfitting/Underfitting

**Analysis:**

**No Overfitting:**
- Validation IC (0.0683) similar to Test IC (0.0631)
- Gap = 0.0052 (acceptable for financial data)
- R² consistently negative (expected for near-random daily returns)

**Evidence:**
- Train R² ≈ Val R² ≈ Test R² (all around -0.3)
- Performance stable across time periods

**Underfitting Check:**
- IC > 0.05 target achieved
- Statistically significant (p < 0.05)
- Conclusion: Model captures signal without overfitting

### 6.6 Regularization techniques

**L1 Regularization (Lasso):**
- **reg_alpha=0.1** in XGBoost/LightGBM
- Encourages feature selection
- Reduces model complexity

**L2 Regularization (Ridge):**
- **alpha=1000.0** in Ridge
- **reg_lambda=1.0** in XGBoost/LightGBM
- Shrinks coefficients, reduces overfitting

**Other Techniques:**
- **Subsample=0.8:** Row sampling (like bagging)
- **Colsample_bytree=0.8:** Feature sampling per tree
- **Max_depth=4:** Limits tree complexity
- **Early Stopping:** Stops training when validation metric plateaus

### 6.7 Cross-validation vs without

**Time Series Cross-Validation:**
- Used TimeSeriesSplit with 5 folds
- Each fold: train on past, validate on future

**Results Comparison:**

| Method | Mean IC | Std IC |
|--------|---------|--------|
| Single Split | 0.0683 | - |
| 5-Fold CV | 0.0641 | 0.0089 |

**Difference:**
- CV gives more robust estimate (±0.0089 confidence)
- Single split slightly optimistic
- CV confirms model generalizes across different time periods

### 6.8 Effect of hyperparameter tuning

**Comparison: Two Tuning Techniques**

**Before Tuning (Default Parameters):**
- XGBoost IC: 0.0521
- LightGBM IC: 0.0518

**After GridSearchCV:**
- XGBoost IC: 0.0612 (+17.5%)
- Optimal: max_depth=4, learning_rate=0.01

**After Early Stopping:**
- LightGBM IC: 0.0613 (+18.3%)
- Stopped at iteration 387 (saved 113 iterations)

**Key Differences:**
- **GridSearchCV:** Better final performance, longer training time
- **Early Stopping:** Faster, prevents overfitting, slight performance loss
- **Best Approach:** Combine both (Grid search structure + early stopping iterations)

### 6.9 Results before vs after tuning

**Summary:**

| Stage | IC (Val) | Improvement |
|-------|----------|-------------|
| Baseline (Ridge) | 0.0451 | - |
| Default XGB/LGB | 0.0520 | +15.3% |
| After Tuning | 0.0612 | +35.7% |
| Ensemble | 0.0683 | +51.4% |

**Impact:**
- Hyperparameter tuning critical for performance
- Ensemble of tuned models > single tuned model
- Tuning improved IC from non-significant to highly significant


## 7. Model Evaluation and Analysis

### 7.1 Key takeaways from results

**Major Findings:**

1. **IC = 0.068 Achieved:** Exceeds 0.05 target, statistically significant (p=0.012)
2. **Negative R² Expected:** Daily returns are near-random (50-50 split)
3. **Ranking > Prediction:** IC positive while R² negative shows ranking ability
4. **Lagged Features Critical:** 51% of feature importance from lags
5. **Ensemble Wins:** Combining models improves IC by 11% over single model

### 7.2 Interesting patterns discovered

**Pattern 1: Economic Lag-1 Dominance**
- E3_lag1 is most important feature (1.11%)
- Economic indicators with 1-day lag consistently top-ranked
- Suggests markets react to economic data with 1-day delay

**Pattern 2: Volatility-Sentiment Interaction**
- V9_x_S11 interaction in top 3 features
- Fear (sentiment) amplifies volatility impact
- Confirms behavioral finance theory

**Pattern 3: Long-term Sentiment Matters**
- S4_lag20 (20-day lagged sentiment) is 2nd most important
- Market sentiment has lasting impact beyond short-term

### 7.3 Kaggle leaderboard comparison

**My Results:**
- Kaggle Score: 3.489 (modified Sharpe ratio)
- Rank: Top 30% of 4,105 participants

**Top Leaderboard:**
- Top Score: 17.0
- My Score: 3.489

**Gap Analysis:**
- Prediction quality good (IC=0.068)
- Allocation strategy needs improvement
- Top scorers likely optimize portfolio allocation, not just predictions

### 7.4 Impact of feature engineering strategies

**Ablation Study:**

| Features | IC (Val) | Improvement |
|----------|----------|-------------|
| Base only (94) | 0.0451 | Baseline |
| + Lagged (205) | 0.0589 | +30.6% |
| + Rolling (235) | 0.0647 | +43.5% |
| + Interactions (238) | 0.0683 | +51.4% |

**Conclusion:**
- Lagged features: Biggest single improvement (+30.6%)
- Rolling statistics: Added +13% on top
- Interactions: Final +8% boost

### 7.5 Biggest challenges and solutions

**Challenge 1: Near-Random Daily Returns**
- Problem: 51.68% vs 48.32% positive/negative days
- Solution: Focus on ranking (IC) not prediction accuracy (R²)

**Challenge 2: Time-Series Leakage**
- Problem: Easy to accidentally use future data
- Solution: Strict chronological splits, only lagged features

**Challenge 3: High Dimensionality**
- Problem: 238 features, 9021 samples
- Solution: Regularization (L1+L2), tree-based models handle it

**Challenge 4: Outliers**
- Problem: Market crashes create extreme values
- Solution: RobustScaler, winsorization of target

### 7.6 Model outperformance analysis

**Best Model: Ensemble**

**Why Ensemble Wins:**
1. **Diversity:** XGBoost (depth-first) + LightGBM (leaf-first) capture different patterns
2. **Error Averaging:** Uncorrelated errors cancel out
3. **Robustness:** Ensemble handles different market regimes better

**Individual Model Comparison:**
- **Ridge:** Linear, misses non-linear patterns
- **XGBoost:** Best single model, slightly overfits
- **LightGBM:** Faster, similar performance to XGBoost
- **Ensemble:** Combines strengths, IC +11% over XGBoost

### 7.7 Dataset biases and their effects

**Bias 1: Survivorship Bias**
- S&P 500 only includes surviving companies
- May underestimate tail risk

**Bias 2: Recency Effect**
- Recent data (2020s) overrepresented in validation/test
- Model may be optimized for current regime

**Bias 3: Missing Data Bias**
- Earlier years (1990s) have more missing features
- Forward-fill may propagate stale information

**Impact on Predictions:**
- Model works better in recent, complete data periods
- Performance may degrade in extreme regimes not seen in training


## 8. Conclusions

### 8.1 What would you do differently?

**Improvements:**

1. **More Feature Engineering:**
   - Add regime-switching features (bull/bear market indicators)
   - Create macro factors (PCA on economic indicators)
   - Add cross-sectional features (sector rotations)

2. **Advanced Models:**
   - Try CatBoost (better categorical handling)
   - Neural networks (LSTM for time series)
   - Stacking ensemble (meta-learner on top)

3. **Better Validation:**
   - Walk-forward optimization (monthly retraining)
   - Out-of-sample testing on different time periods
   - Stress testing on crisis periods

4. **Allocation Optimization:**
   - Optimize for Sharpe ratio directly
   - Kelly criterion for position sizing
   - Dynamic allocation based on prediction confidence

### 8.2 How could the model be further improved?

**Short-term Improvements:**
1. **Hyperparameter tuning:** Bayesian optimization instead of GridSearch
2. **Feature selection:** SHAP values for better interpretability
3. **Ensemble weights:** Learn optimal weights instead of equal weighting

**Medium-term Improvements:**
1. **Alternative data:** Add social media sentiment, news sentiment
2. **Higher frequency:** Use intraday data if available
3. **Regime detection:** Different models for bull/bear markets

**Long-term Improvements:**
1. **Reinforcement learning:** Learn optimal trading policy
2. **Online learning:** Update model daily with new data
3. **Multi-asset:** Extend to bonds, commodities, currencies

### 8.3 Experiments with more data/resources

**With More Data:**
1. **Longer history:** Train on 50+ years instead of ~30 years
2. **Higher frequency:** Hourly or minute-level predictions
3. **More assets:** Cross-sectional predictions across stocks
4. **Alternative data:** Satellite imagery, credit card data, web traffic

**With More Compute:**
1. **Deep learning:** Transformer models for time series
2. **Larger ensembles:** 50+ models instead of 3
3. **Extensive tuning:** Grid search over 1000+ combinations
4. **Monte Carlo:** Simulate 10,000 different market scenarios

### 8.4 Ethical considerations

**Data Privacy:**
- All data is public market data (no privacy concerns)
- No personal information used

**Market Impact:**
- Model designed for institutional use (not retail)
- Could contribute to market efficiency
- Risk: If widely adopted, signal may disappear (alpha decay)

**Fairness:**
- No demographic biases (market data only)
- Accessible to institutional investors only (requires infrastructure)

**Transparency:**
- Black-box models (XGBoost/LightGBM) hard to explain
- SHAP values provide some interpretability
- Should disclose limitations to users

**Financial Stability:**
- Algorithmic trading can amplify volatility
- Model should include circuit breakers
- Risk management critical


## Team Contributions

### Individual Project

**Team Member:** Divit Pratap Singh (Solo)

**Contributions:**
- Data collection and preprocessing: 100%
- Exploratory data analysis: 100%
- Feature engineering: 100%
- Model development: 100%
- Hyperparameter tuning: 100%
- Model evaluation: 100%
- Report writing: 100%
- Code documentation: 100%

**Total:** Individual project, all work completed independently.


## Reproducibility

### Code Repository
**Location:** `/Users/divitpratapsingh/project-market-prediction/`

### Key Files:
- `notebooks/market_prediction_analysis.ipynb` - Main analysis notebook
- `results/final_best_model.csv` - Model performance results
- `results/feature_importance.csv` - Feature rankings
- `results/all_model_results.csv` - Comprehensive results

### Random Seeds:
All models trained with `random_state=42` for reproducibility

### Dependencies:
```
python==3.11
pandas==2.0.3
numpy==1.24.3
scikit-learn==1.3.0
xgboost==2.0.0
lightgbm==4.0.0
scipy==1.11.2
```
