# Improved Stock Price Prediction using LSTM

This notebook provides a fixed and improved version of the stock price prediction model using LSTM neural networks. All the issues from the original version have been corrected.

## Key Improvements:
- Fixed variable scope issues
- Proper data preprocessing
- Complete LSTM model implementation
- Enhanced technical indicators
- Better error handling
- Comprehensive evaluation metrics

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
import warnings
from datetime import datetime, timedelta
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# TensorFlow and Keras imports
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
sns.set_style('whitegrid')

print("All libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")

In [None]:
# Fetch stock data
def fetch_stock_data(symbols, years=3):
    """
    Fetch stock data for multiple symbols.
    """
    print("Fetching stock data...")
    end_date = datetime.now()
    start_date = end_date - timedelta(days=years*365)
    
    stock_data = {}
    company_names = {
        'AAPL': 'Apple Inc.',
        'GOOGL': 'Alphabet Inc.',
        'MSFT': 'Microsoft Corporation',
        'AMZN': 'Amazon.com Inc.',
        'TSLA': 'Tesla Inc.',
        'META': 'Meta Platforms Inc.',
        'NVDA': 'NVIDIA Corporation'
    }
    
    for symbol in symbols:
        try:
            print(f"Fetching data for {symbol}...")
            data = yf.download(symbol, start=start_date, end=end_date)
            if not data.empty:
                data['Symbol'] = symbol
                data['Company'] = company_names.get(symbol, symbol)
                stock_data[symbol] = data
                print(f"Successfully fetched {len(data)} records for {symbol}")
            else:
                print(f"Warning: No data found for {symbol}")
        except Exception as e:
            print(f"Error fetching data for {symbol}: {str(e)}")
            
    return stock_data

# List of stocks to analyze
stocks = ['AAPL', 'GOOGL', 'MSFT', 'AMZN']

# Fetch the data
stock_data = fetch_stock_data(stocks, years=3)

# Display basic info about the fetched data
for symbol, data in stock_data.items():
    print(f"\n{symbol} data shape: {data.shape}")
    print(f"Date range: {data.index.min().date()} to {data.index.max().date()}")

In [None]:
# Display the data for Apple (AAPL) as an example
if 'AAPL' in stock_data:
    print("Apple Stock Data (First 5 rows):")
    print(stock_data['AAPL'].head())
    
    print("\nApple Stock Data (Last 5 rows):")
    print(stock_data['AAPL'].tail())
    
    print("\nBasic Statistics:")
    print(stock_data['AAPL'][['Open', 'High', 'Low', 'Close', 'Volume']].describe())

In [None]:
def add_technical_indicators(data):
    """
    Add technical indicators to the stock data.
    """
    df = data.copy()
    
    # Moving averages
    df['MA_7'] = df['Close'].rolling(window=7).mean()
    df['MA_21'] = df['Close'].rolling(window=21).mean()
    df['MA_50'] = df['Close'].rolling(window=50).mean()
    
    # Exponential moving averages
    df['EMA_12'] = df['Close'].ewm(span=12).mean()
    df['EMA_26'] = df['Close'].ewm(span=26).mean()
    
    # MACD
    df['MACD'] = df['EMA_12'] - df['EMA_26']
    df['MACD_Signal'] = df['MACD'].ewm(span=9).mean()
    df['MACD_Histogram'] = df['MACD'] - df['MACD_Signal']
    
    # RSI
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    df['RSI'] = 100 - (100 / (1 + rs))
    
    # Bollinger Bands
    df['BB_Middle'] = df['Close'].rolling(window=20).mean()
    bb_std = df['Close'].rolling(window=20).std()
    df['BB_Upper'] = df['BB_Middle'] + (bb_std * 2)
    df['BB_Lower'] = df['BB_Middle'] - (bb_std * 2)
    df['BB_Width'] = df['BB_Upper'] - df['BB_Lower']
    df['BB_Position'] = (df['Close'] - df['BB_Lower']) / df['BB_Width']
    
    # Volume indicators
    df['Volume_MA'] = df['Volume'].rolling(window=20).mean()
    df['Volume_Ratio'] = df['Volume'] / df['Volume_MA']
    
    # Price features
    df['High_Low_Pct'] = (df['High'] - df['Low']) / df['Close'] * 100
    df['Price_Change'] = df['Close'].pct_change()
    df['Price_Range'] = df['High'] - df['Low']
    
    # Volatility
    df['Volatility'] = df['Price_Change'].rolling(window=20).std()
    
    return df

# Add technical indicators to Apple data
if 'AAPL' in stock_data:
    aapl_enhanced = add_technical_indicators(stock_data['AAPL'])
    print("Enhanced Apple data with technical indicators:")
    print(f"Shape: {aapl_enhanced.shape}")
    print("\nNew columns added:")
    new_columns = [col for col in aapl_enhanced.columns if col not in stock_data['AAPL'].columns]
    print(new_columns)
    
    # Display sample data
    print("\nSample technical indicators (last 5 rows):")
    print(aapl_enhanced[['Close', 'MA_7', 'MA_21', 'RSI', 'MACD', 'BB_Position']].tail())

In [None]:
# Visualize the stock data with moving averages
if 'AAPL' in stock_data:
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot 1: Stock price with moving averages
    ax1 = axes[0, 0]
    ax1.plot(aapl_enhanced.index, aapl_enhanced['Close'], label='Close Price', linewidth=2)
    ax1.plot(aapl_enhanced.index, aapl_enhanced['MA_7'], label='7-day MA', alpha=0.7)
    ax1.plot(aapl_enhanced.index, aapl_enhanced['MA_21'], label='21-day MA', alpha=0.7)
    ax1.plot(aapl_enhanced.index, aapl_enhanced['MA_50'], label='50-day MA', alpha=0.7)
    ax1.set_title('AAPL Stock Price with Moving Averages')
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Price ($)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: RSI
    ax2 = axes[0, 1]
    ax2.plot(aapl_enhanced.index, aapl_enhanced['RSI'], label='RSI', color='orange')
    ax2.axhline(y=70, color='r', linestyle='--', alpha=0.7, label='Overbought (70)')
    ax2.axhline(y=30, color='g', linestyle='--', alpha=0.7, label='Oversold (30)')
    ax2.set_title('RSI (Relative Strength Index)')
    ax2.set_xlabel('Date')
    ax2.set_ylabel('RSI')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Plot 3: MACD
    ax3 = axes[1, 0]
    ax3.plot(aapl_enhanced.index, aapl_enhanced['MACD'], label='MACD', color='blue')
    ax3.plot(aapl_enhanced.index, aapl_enhanced['MACD_Signal'], label='Signal', color='red')
    ax3.bar(aapl_enhanced.index, aapl_enhanced['MACD_Histogram'], label='Histogram', alpha=0.3, color='green')
    ax3.set_title('MACD')
    ax3.set_xlabel('Date')
    ax3.set_ylabel('MACD')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Volume
    ax4 = axes[1, 1]
    ax4.bar(aapl_enhanced.index, aapl_enhanced['Volume'], alpha=0.7, color='purple')
    ax4.plot(aapl_enhanced.index, aapl_enhanced['Volume_MA'], label='Volume MA', color='red', linewidth=2)
    ax4.set_title('Volume with Moving Average')
    ax4.set_xlabel('Date')
    ax4.set_ylabel('Volume')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

In [None]:
def prepare_lstm_data(data, lookback_window=60, test_size=0.2, target_column='Close'):
    """
    Prepare data for LSTM model training.
    """
    print(f"Preparing LSTM data with lookback window: {lookback_window}")
    
    # Add technical indicators
    enhanced_data = add_technical_indicators(data)
    
    # Select features for prediction
    feature_columns = [
        'Open', 'High', 'Low', 'Close', 'Volume',
        'MA_7', 'MA_21', 'MA_50', 'EMA_12', 'EMA_26',
        'MACD', 'MACD_Signal', 'RSI', 'BB_Position',
        'Volume_Ratio', 'High_Low_Pct', 'Volatility'
    ]
    
    # Remove columns with too many NaN values
    available_columns = []
    for col in feature_columns:
        if col in enhanced_data.columns:
            # Check if column has enough valid data
            valid_ratio = enhanced_data[col].notna().sum() / len(enhanced_data)
            if valid_ratio > 0.8:  # At least 80% valid data
                available_columns.append(col)
    
    print(f"Using {len(available_columns)} features: {available_columns}")
    
    # Create dataset with selected features
    dataset = enhanced_data[available_columns].copy()
    
    # Forward fill missing values and then drop remaining NaN
    dataset = dataset.fillna(method='ffill').dropna()
    
    if len(dataset) < lookback_window + 50:  # Need minimum data
        raise ValueError(f"Not enough data after cleaning. Got {len(dataset)} rows, need at least {lookback_window + 50}")
    
    print(f"Dataset shape after cleaning: {dataset.shape}")
    
    # Scale the data
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled_data = scaler.fit_transform(dataset.values)
    
    # Create sequences for LSTM
    X, y = [], []
    target_idx = available_columns.index(target_column)
    
    for i in range(lookback_window, len(scaled_data)):
        X.append(scaled_data[i-lookback_window:i])
        y.append(scaled_data[i, target_idx])
    
    X, y = np.array(X), np.array(y)
    
    # Split into train and test sets
    split_idx = int(len(X) * (1 - test_size))
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    print(f"Training data shape: {X_train.shape}")
    print(f"Testing data shape: {X_test.shape}")
    
    return X_train, X_test, y_train, y_test, scaler, available_columns, dataset

# Prepare the data
if 'AAPL' in stock_data:
    try:
        X_train, X_test, y_train, y_test, scaler, feature_columns, dataset = prepare_lstm_data(
            stock_data['AAPL'], 
            lookback_window=60, 
            test_size=0.2
        )
        print("\nData preparation completed successfully!")
    except Exception as e:
        print(f"Error in data preparation: {str(e)}")

In [None]:
def build_lstm_model(input_shape, lstm_units=[50, 50, 50], dropout_rate=0.2, learning_rate=0.001):
    """
    Build and compile the LSTM model.
    """
    print("Building LSTM model...")
    
    model = Sequential()
    
    # First LSTM layer
    model.add(LSTM(units=lstm_units[0], 
                  return_sequences=True, 
                  input_shape=input_shape))
    model.add(Dropout(dropout_rate))
    model.add(BatchNormalization())
    
    # Additional LSTM layers
    for i in range(1, len(lstm_units)):
        return_sequences = i < len(lstm_units) - 1
        model.add(LSTM(units=lstm_units[i], return_sequences=return_sequences))
        model.add(Dropout(dropout_rate))
        if return_sequences:
            model.add(BatchNormalization())
    
    # Dense layers
    model.add(Dense(25, activation='relu'))
    model.add(Dropout(dropout_rate/2))
    model.add(Dense(1))
    
    # Compile model
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, 
                 loss='mean_squared_error',
                 metrics=['mae'])
    
    print(f"Model built with {model.count_params():,} parameters")
    print("\nModel Architecture:")
    model.summary()
    
    return model

# Build the model
if 'X_train' in locals():
    input_shape = (X_train.shape[1], X_train.shape[2])
    model = build_lstm_model(
        input_shape=input_shape,
        lstm_units=[64, 32, 16],
        dropout_rate=0.2,
        learning_rate=0.001
    )
else:
    print("Training data not available. Please run the data preparation cell first.")

In [None]:
# Train the model
if 'model' in locals() and 'X_train' in locals():
    print("Training LSTM model...")
    
    # Callbacks
    callbacks = [
        EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True),
        ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-7),
        ModelCheckpoint('best_stock_model.h5', monitor='val_loss', save_best_only=True)
    ]
    
    # Train model
    history = model.fit(
        X_train, y_train,
        epochs=50,  # Reduced for demonstration
        batch_size=32,
        validation_split=0.1,
        callbacks=callbacks,
        verbose=1
    )
    
    print("\nTraining completed!")
    
    # Plot training history
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Loss plot
    axes[0].plot(history.history['loss'], label='Training Loss')
    axes[0].plot(history.history['val_loss'], label='Validation Loss')
    axes[0].set_title('Model Loss')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # MAE plot
    axes[1].plot(history.history['mae'], label='Training MAE')
    axes[1].plot(history.history['val_mae'], label='Validation MAE')
    axes[1].set_title('Model MAE')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('MAE')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("Model or training data not available. Please run previous cells first.")

In [None]:
# Make predictions and evaluate the model
if 'model' in locals() and 'X_test' in locals():
    print("Making predictions...")
    
    # Make predictions
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    # Inverse transform predictions (only for the target column)
    # Create dummy arrays with the right shape for inverse transform
    dummy_train = np.zeros((train_pred.shape[0], len(feature_columns)))
    dummy_test = np.zeros((test_pred.shape[0], len(feature_columns)))
    
    # Place predictions in the target column position
    target_idx = feature_columns.index('Close')
    dummy_train[:, target_idx] = train_pred.flatten()
    dummy_test[:, target_idx] = test_pred.flatten()
    
    # Inverse transform
    train_pred_scaled = scaler.inverse_transform(dummy_train)[:, target_idx]
    test_pred_scaled = scaler.inverse_transform(dummy_test)[:, target_idx]
    
    # Also inverse transform actual values
    dummy_train_actual = np.zeros((len(y_train), len(feature_columns)))
    dummy_test_actual = np.zeros((len(y_test), len(feature_columns)))
    
    dummy_train_actual[:, target_idx] = y_train
    dummy_test_actual[:, target_idx] = y_test
    
    y_train_scaled = scaler.inverse_transform(dummy_train_actual)[:, target_idx]
    y_test_scaled = scaler.inverse_transform(dummy_test_actual)[:, target_idx]
    
    # Calculate metrics
    train_rmse = np.sqrt(mean_squared_error(y_train_scaled, train_pred_scaled))
    test_rmse = np.sqrt(mean_squared_error(y_test_scaled, test_pred_scaled))
    train_mae = mean_absolute_error(y_train_scaled, train_pred_scaled)
    test_mae = mean_absolute_error(y_test_scaled, test_pred_scaled)
    train_r2 = r2_score(y_train_scaled, train_pred_scaled)
    test_r2 = r2_score(y_test_scaled, test_pred_scaled)
    
    # Calculate accuracy as percentage
    test_accuracy = max(0, 100 - (test_rmse / np.mean(y_test_scaled) * 100))
    
    print(f"\n=== Model Evaluation Results ===")
    print(f"Train RMSE: ${train_rmse:.2f}")
    print(f"Test RMSE: ${test_rmse:.2f}")
    print(f"Test MAE: ${test_mae:.2f}")
    print(f"Test R²: {test_r2:.4f}")
    print(f"Test Accuracy: {test_accuracy:.2f}%")
    
    print("\nPredictions completed successfully!")
else:
    print("Model or test data not available. Please run previous cells first.")

In [None]:
# Visualize the results
if 'test_pred_scaled' in locals():
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot 1: Training and Test Predictions
    ax1 = axes[0, 0]
    ax1.plot(y_train_scaled, label='Training Actual', alpha=0.7)
    ax1.plot(train_pred_scaled, label='Training Predicted', alpha=0.7)
    ax1.plot(range(len(y_train_scaled), len(y_train_scaled) + len(y_test_scaled)), 
            y_test_scaled, label='Test Actual', alpha=0.7)
    ax1.plot(range(len(y_train_scaled), len(y_train_scaled) + len(test_pred_scaled)), 
            test_pred_scaled, label='Test Predicted', alpha=0.7)
    ax1.set_title('AAPL Stock Price Prediction')
    ax1.set_xlabel('Time')
    ax1.set_ylabel('Price ($)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Test Set Zoom
    ax2 = axes[0, 1]
    ax2.plot(y_test_scaled, label='Actual', linewidth=2, color='blue')
    ax2.plot(test_pred_scaled, label='Predicted', linewidth=2, alpha=0.8, color='red')
    ax2.set_title('Test Set Predictions (Zoomed)')
    ax2.set_xlabel('Time')
    ax2.set_ylabel('Price ($)')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Plot 3: Residuals
    ax3 = axes[1, 0]
    residuals = y_test_scaled - test_pred_scaled
    ax3.scatter(test_pred_scaled, residuals, alpha=0.6)
    ax3.axhline(y=0, color='r', linestyle='--')
    ax3.set_title('Residual Plot')
    ax3.set_xlabel('Predicted Price ($)')
    ax3.set_ylabel('Residuals ($)')
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Metrics Bar Chart
    ax4 = axes[1, 1]
    metric_names = ['RMSE', 'MAE', 'R²', 'Accuracy (%)']
    metric_values = [test_rmse, test_mae, test_r2, test_accuracy]
    
    colors = ['red', 'orange', 'green', 'blue']
    bars = ax4.bar(metric_names, metric_values, color=colors, alpha=0.7)
    ax4.set_title('Model Performance Metrics')
    ax4.set_ylabel('Value')
    
    # Add value labels on bars
    for bar, value in zip(bars, metric_values):
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height,
                f'{value:.2f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    # Additional residual analysis
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    plt.hist(residuals, bins=30, alpha=0.7, color='purple', edgecolor='black')
    plt.title('Residual Distribution')
    plt.xlabel('Residual Value ($)')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 2, 2)
    plt.plot(residuals, color='purple', alpha=0.7)
    plt.title('Residuals Over Time')
    plt.xlabel('Time')
    plt.ylabel('Residual Value ($)')
    plt.axhline(y=0, color='r', linestyle='--')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("Predictions not available. Please run the prediction cell first.")

In [None]:
# Create a comparison dataframe for easier analysis
if 'test_pred_scaled' in locals():
    # Create a dataframe with actual vs predicted values
    comparison_df = pd.DataFrame({
        'Actual': y_test_scaled,
        'Predicted': test_pred_scaled,
        'Residual': y_test_scaled - test_pred_scaled,
        'Absolute_Error': np.abs(y_test_scaled - test_pred_scaled),
        'Percentage_Error': np.abs((y_test_scaled - test_pred_scaled) / y_test_scaled) * 100
    })
    
    print("=== Prediction Analysis ===")
    print(f"\nFirst 10 predictions:")
    print(comparison_df.head(10).round(2))
    
    print(f"\nLast 10 predictions:")
    print(comparison_df.tail(10).round(2))
    
    print(f"\nSummary Statistics:")
    print(comparison_df.describe().round(2))
    
    # Find best and worst predictions
    best_prediction_idx = comparison_df['Absolute_Error'].idxmin()
    worst_prediction_idx = comparison_df['Absolute_Error'].idxmax()
    
    print(f"\nBest Prediction (lowest absolute error):")
    print(f"Actual: ${comparison_df.loc[best_prediction_idx, 'Actual']:.2f}")
    print(f"Predicted: ${comparison_df.loc[best_prediction_idx, 'Predicted']:.2f}")
    print(f"Error: ${comparison_df.loc[best_prediction_idx, 'Absolute_Error']:.2f}")
    
    print(f"\nWorst Prediction (highest absolute error):")
    print(f"Actual: ${comparison_df.loc[worst_prediction_idx, 'Actual']:.2f}")
    print(f"Predicted: ${comparison_df.loc[worst_prediction_idx, 'Predicted']:.2f}")
    print(f"Error: ${comparison_df.loc[worst_prediction_idx, 'Absolute_Error']:.2f}")
else:
    print("Predictions not available. Please run the prediction cell first.")

In [None]:
# Save the model and scaler for future use
if 'model' in locals() and 'scaler' in locals():
    import joblib
    
    # Save the model
    model.save('improved_stock_lstm_model.h5')
    print("Model saved as 'improved_stock_lstm_model.h5'")
    
    # Save the scaler and feature columns
    model_data = {
        'scaler': scaler,
        'feature_columns': feature_columns,
        'lookback_window': 60
    }
    joblib.dump(model_data, 'model_data.pkl')
    print("Scaler and feature information saved as 'model_data.pkl'")
    
    print("\n=== Model Training and Evaluation Complete ===")
    print(f"Final Test Accuracy: {test_accuracy:.2f}%")
    print(f"Final Test RMSE: ${test_rmse:.2f}")
    print(f"Final Test R²: {test_r2:.4f}")
else:
    print("Model or scaler not available for saving.")

## Summary

This improved notebook has successfully addressed all the issues from the original implementation:

### Fixed Issues:
1. **Variable Scope**: All variables are properly defined in the correct order
2. **Missing Imports**: All required libraries are imported at the beginning
3. **Data Preprocessing**: Enhanced with proper handling of NaN values and feature selection
4. **Model Architecture**: Complete LSTM implementation with proper layers
5. **Evaluation Metrics**: Comprehensive evaluation with multiple metrics
6. **Error Handling**: Proper exception handling and data validation

### Enhancements:
1. **Technical Indicators**: Added RSI, MACD, Bollinger Bands, and volume indicators
2. **Better Visualizations**: Multiple plots for comprehensive analysis
3. **Model Callbacks**: Early stopping, learning rate reduction, and model checkpointing
4. **Residual Analysis**: Detailed error analysis and distribution plots
5. **Model Persistence**: Ability to save and load trained models
6. **Comprehensive Metrics**: RMSE, MAE, R², and accuracy calculations

### Usage:
Run all cells in sequence to:
1. Fetch stock data
2. Add technical indicators
3. Prepare data for LSTM
4. Build and train the model
5. Evaluate performance
6. Visualize results
7. Save the model for future use

The model should now provide much better performance and reliability compared to the original implementation.