# Getting Started with Stock Market Prediction

This notebook provides a quick introduction to the repository and demonstrates basic usage of different approaches.

## What You'll Learn

1. Loading and exploring stock market data
2. Basic feature engineering
3. Training a simple model
4. Evaluating predictions
5. Overview of different approaches

## Prerequisites

Make sure you've installed all dependencies:
```bash
pip install -r requirements.txt
```

**Note**: This notebook automatically configures the Python path to import from the `src/` directory. No additional setup needed!

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')

print("Libraries imported successfully!")

Libraries imported successfully!


In [1]:
# Setup: Add project root to Python path
import sys
from pathlib import Path

# Get the project root directory (parent of notebooks/)
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")
print("✓ Python path configured successfully!")

Project root: /home/d/Documents/stock_market
✓ Python path configured successfully!


## 1. Loading Stock Market Data

Let's start by loading some stock price data. We'll use Apple (AAPL) as an example.

In [2]:
from src.utils.data_loader import load_stock_data

# Load Apple stock data
data = load_stock_data('AAPL', start='2020-01-01', end='2023-12-31')

print(f"Data shape: {data.shape}")
print(f"\nFirst few rows:")
data.head()

Downloading AAPL data from 2020-01-01 to 2023-12-31...
Successfully loaded 1006 rows of data
Data shape: (1006, 7)

First few rows:
Successfully loaded 1006 rows of data
Data shape: (1006, 7)

First few rows:


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-02,71.545882,72.598884,71.292296,72.538506,135480400,0.0,0.0
2020-01-03,71.765674,72.594063,71.608692,71.833298,146322800,0.0,0.0
2020-01-06,70.954218,72.444351,70.703042,72.405708,118387200,0.0,0.0
2020-01-07,72.415337,72.671341,71.845369,72.065147,108872000,0.0,0.0
2020-01-08,71.768086,73.526303,71.768086,73.224411,132079200,0.0,0.0


In [3]:
# Basic statistics
print("Basic Statistics:")
print(data.describe())

Basic Statistics:


NameError: name 'data' is not defined

### Understanding the Data

- **Open**: Price at market open
- **High**: Highest price during the day
- **Low**: Lowest price during the day
- **Close**: Price at market close
- **Volume**: Number of shares traded
- **Adj Close**: Adjusted closing price (accounts for splits/dividends)

For modeling, we typically use **Adj Close** as it reflects the true value over time.

In [None]:
# Visualize stock price
from src.utils.visualization import plot_stock_price

plot_stock_price(data, title='Apple (AAPL) Stock Price')

## 2. Feature Engineering

Raw prices alone aren't always the best features. Let's add technical indicators that capture trends, momentum, and volatility.

In [3]:
from src.utils.preprocessing import calculate_technical_indicators, calculate_returns

# Calculate returns (percentage changes)
data = calculate_returns(data, periods=[1, 5, 10])

# Add technical indicators
data = calculate_technical_indicators(data)

print(f"Data shape after feature engineering: {data.shape}")
print(f"\nNew columns: {data.shape[1] - 6} features added")
print(f"\nSample of new features:")
data[['Close', 'Returns_1', 'RSI', 'MACD', 'SMA_20', 'BB_Position']].tail()

Data shape after feature engineering: (1006, 44)

New columns: 38 features added

Sample of new features:


Unnamed: 0_level_0,Close,Returns_1,RSI,MACD,SMA_20,BB_Position
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-12-22,191.974686,-0.005548,59.2462,2.635696,192.042602,0.494118
2023-12-26,191.429306,-0.002841,49.031989,2.306987,192.204234,0.429476
2023-12-27,191.528442,0.000518,52.291459,2.031071,192.340578,0.422936
2023-12-28,191.954849,0.002226,47.920446,1.825766,192.549313,0.438336
2023-12-29,190.913651,-0.005424,40.185206,1.561049,192.67723,0.302953


In [None]:
# Visualize some technical indicators
from src.utils.visualization import plot_bollinger_bands

plot_bollinger_bands(data, n_days=200, title='AAPL with Bollinger Bands')

## 3. Training a Simple Model

Let's train a Random Forest classifier to predict whether the stock will go up or down tomorrow.

In [4]:
from src.utils.preprocessing import (
    create_lagged_features, 
    create_target_variable,
    train_test_split_temporal
)

# Add lagged features (yesterday's values)
data = create_lagged_features(
    data,
    columns=['Close', 'Volume', 'RSI', 'MACD'],
    lags=[1, 2, 3, 5]
)

# Create target: Will price go up tomorrow? (1=Yes, 0=No)
data = create_target_variable(data, target_type='direction', horizon=1)

# Remove rows with NaN values
data = data.dropna()

print(f"Clean data shape: {data.shape}")
print(f"\nTarget distribution:")
print(data['Target'].value_counts())
print(f"Up days: {data['Target'].mean()*100:.1f}%")

Clean data shape: (807, 61)

Target distribution:
Target
1    418
0    389
Name: count, dtype: int64
Up days: 51.8%


In [6]:
# Prepare features and target
feature_cols = [col for col in data.columns if col not in ['Target']]
X = data[feature_cols]
y = data['Target']

# Split data (preserving temporal order!)
train_data, val_data, test_data = train_test_split_temporal(
    data, test_size=0.2, validation_size=0.1
)

X_train = train_data[feature_cols]
y_train = train_data['Target']
X_test = test_data[feature_cols]
y_test = test_data['Target']

Data split:
  Train: 564 samples (69.9%)
  Validation: 81 samples (10.0%)
  Test: 162 samples (20.1%)


In [7]:
from src.classical_ml.random_forest import RandomForestPredictor

# Train Random Forest classifier
model = RandomForestPredictor(
    task='classification',
    n_estimators=100,
    max_depth=10
)

model.fit(X_train, y_train, feature_names=feature_cols)

Training Random Forest classification model...
Features: 60, Samples: 564
Model trained successfully!
Training classification score: 1.0000


<src.classical_ml.random_forest.RandomForestPredictor at 0x73b9610bd760>

## 4. Evaluating the Model

In [8]:
from src.utils.evaluation import evaluate_classification

# Make predictions
predictions = model.predict(X_test)

# Evaluate
metrics = evaluate_classification(y_test.values, predictions)

Classification Metrics:
----------------------------------------
Accuracy:  0.4753 (47.53%)
Precision: 0.8571
Recall:    0.0667
F1 Score:  0.1237

Confusion Matrix:
[[71  1]
 [84  6]]

Classification Report:
              precision    recall  f1-score   support

        Down       0.46      0.99      0.63        72
          Up       0.86      0.07      0.12        90

    accuracy                           0.48       162
   macro avg       0.66      0.53      0.37       162
weighted avg       0.68      0.48      0.35       162



In [None]:
# Feature importance
model.plot_feature_importance(top_n=15)

## 5. Overview of Approaches

This repository covers many approaches. Here's a quick guide:

### Traditional Statistical Methods
- **ARIMA**: Best for short-term forecasting, interpretable
- **GARCH**: Focuses on volatility modeling
- Start here if you want interpretable results

### Classical Machine Learning  
- **Random Forest**: Robust, good baseline
- **XGBoost/LightGBM**: Often wins competitions
- **SVM**: Good for classification
- Start here for solid performance without deep learning complexity

### Deep Learning
- **LSTM**: Captures long-term patterns in sequences
- **GRU**: Simpler alternative to LSTM
- **CNN-LSTM**: Combines pattern detection with sequences
- Use when you have lots of data and computational resources

### Modern Approaches
- **Transformers**: State-of-the-art for sequences
- **Reinforcement Learning**: Learns complete trading strategies
- Use for cutting-edge research or when other methods plateau

## Next Steps

1. **Explore other notebooks**:
   - `02_statistical_methods.ipynb` - Traditional time series
   - `03_classical_ml.ipynb` - More ML algorithms
   - `04_deep_learning.ipynb` - Neural networks
   - `05_transformers.ipynb` - Modern architectures
   - `06_reinforcement_learning.ipynb` - RL trading agents

2. **Read the GLOSSARY.md** for financial term definitions

3. **Experiment with different**:
   - Stocks and time periods
   - Feature engineering approaches
   - Model architectures
   - Evaluation metrics

4. **Important reminders**:
   - Past performance ≠ future results
   - Always validate thoroughly
   - Consider transaction costs
   - This is for education, not financial advice!

## Quick Comparison of Models

Let's try a few different approaches and compare them:

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from src.utils.evaluation import compare_models

# Train multiple models
models = {
    'Random Forest': RandomForestPredictor(task='classification', n_estimators=100),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
}

results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    if name == 'Random Forest':
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
    else:
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
    
    # Calculate metrics
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    
    results[name] = {
        'Accuracy': accuracy_score(y_test, predictions),
        'Precision': precision_score(y_test, predictions),
        'Recall': recall_score(y_test, predictions),
        'F1': f1_score(y_test, predictions)
    }

# Compare
comparison_df = compare_models(results)
print("\n" + "="*60)
print("Model Comparison:")
print("="*60)
print(comparison_df)

In [None]:
# Visualize comparison
comparison_df.plot(kind='bar', figsize=(12, 6))
plt.title('Model Performance Comparison')
plt.xlabel('Model')
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Conclusion

You've now:
- ✅ Loaded stock market data
- ✅ Created technical indicators
- ✅ Trained machine learning models
- ✅ Evaluated predictions
- ✅ Compared different approaches

This is just the beginning! Explore the other notebooks to dive deeper into specific methods.

**Remember**: Stock market prediction is extremely challenging. Use these tools for learning and always validate thoroughly before considering any real-world application.