# Stock Price Predictor

**Author:** Anik Tahabilder  
**Project:** 9 of 22 - Kaggle ML Portfolio  
**Dataset:** Yahoo Finance Stock Data  
**Difficulty:** 6/10 | **Learning Value:** 8/10

---

## What is Time Series Forecasting?

**Time Series Forecasting** is a technique used to predict future values based on historical data points collected over time. Unlike regular ML where data points are independent, time series data has **temporal dependencies** - what happened yesterday affects today!

### Time Series vs Regular ML:

| Aspect | Regular ML | Time Series ML |
|--------|-----------|----------------|
| **Data Order** | Random order OK | Order matters! |
| **Train/Test Split** | Random split | Chronological split |
| **Independence** | Samples are independent | Samples are dependent |
| **Features** | Static features | Lag features, rolling stats |
| **Validation** | K-Fold CV | Time Series CV |

### Why Stock Prediction is Challenging:

| Challenge | Description | Impact |
|-----------|-------------|--------|
| **Non-Stationary** | Mean/variance change over time | Models trained on old patterns may fail |
| **High Noise** | Random fluctuations dominate | Hard to separate signal from noise |
| **External Factors** | News, earnings, politics | Unpredictable events move prices |
| **Efficient Market Hypothesis** | Prices reflect all known info | True patterns are already priced in |
| **Regime Changes** | Bull/bear markets behave differently | One model may not fit all periods |

### Types of Time Series Models:

| Model Type | Examples | Best For | Limitations |
|------------|----------|----------|-------------|
| **Statistical** | ARIMA, SARIMA, ETS | Linear trends, seasonality | Can't capture complex patterns |
| **Machine Learning** | RF, XGBoost, SVR | Non-linear patterns, many features | Ignores temporal order |
| **Deep Learning** | LSTM, GRU, Transformer | Long-term dependencies | Needs lots of data, overfits easily |
| **Hybrid** | ML + Technical Indicators | Best of both worlds | Requires domain knowledge |

### Our Approach: ML with Technical Indicators

```
Raw OHLCV Data → Technical Indicators → Lag Features → ML Models → Next Day Price
```

**Why this approach?**
1. Technical indicators encode domain knowledge (what traders look at)
2. Lag features capture autocorrelation
3. ML models find non-linear patterns
4. More interpretable than deep learning

---

## Table of Contents

1. [Part 1: Setup and Data Loading](#part1)
2. [Part 2: Understanding Stock Data](#part2)
3. [Part 3: Exploratory Data Analysis](#part3)
4. [Part 4: Technical Indicators (Feature Engineering)](#part4)
5. [Part 5: Data Preprocessing](#part5)
6. [Part 6: Model Selection & Training](#part6)
7. [Part 7: Model Evaluation & Comparison](#part7)
8. [Part 8: Hyperparameter Tuning](#part8)
9. [Part 9: Final Predictions & Visualization](#part9)
10. [Part 10: Summary and Conclusions](#part10)

---

<a id='part1'></a>
# Part 1: Setup and Data Loading

---

## 1.1 Required Libraries

| Library | Purpose | Why We Need It |
|---------|---------|----------------|
| **pandas** | Data manipulation | Handle time-indexed DataFrames |
| **numpy** | Numerical operations | Fast array computations |
| **matplotlib/seaborn** | Visualization | Plot time series, distributions |
| **yfinance** | Stock data API | Download OHLCV data from Yahoo |
| **sklearn** | ML algorithms | Regression models, metrics, preprocessing |
| **statsmodels** | Time series | Seasonal decomposition, statistical tests |

### Installation (if needed):
```bash
pip install yfinance pandas numpy matplotlib seaborn scikit-learn statsmodels
```

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Stock data
try:
    import yfinance as yf
    YFINANCE_AVAILABLE = True
except ImportError:
    YFINANCE_AVAILABLE = False
    print("Install yfinance: pip install yfinance")

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Machine Learning - Regression Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error

# Date handling
from datetime import datetime, timedelta

# Time Series Decomposition
try:
    from statsmodels.tsa.seasonal import seasonal_decompose
    STATSMODELS_AVAILABLE = True
except ImportError:
    STATSMODELS_AVAILABLE = False
    print('Install statsmodels: pip install statsmodels')

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.precision', 4)
pd.set_option('display.max_columns', None)

print("="*70)
print("LIBRARIES LOADED SUCCESSFULLY")
print("="*70)
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Scikit-learn version: {__import__('sklearn').__version__}")
print(f"yfinance available: {YFINANCE_AVAILABLE}")

## 1.2 Download Stock Data

### Where Does the Data Come From?

We use **yfinance** - a Python library that pulls **live stock data** from Yahoo Finance.

| Data Source | Library | Cost | API Key | Data Type |
|-------------|---------|------|---------|-----------|
| **Yahoo Finance** | `yfinance` | Free | No | Live OHLCV |
| **Alpha Vantage** | `alpha_vantage` | Free tier | Yes | Live OHLCV |
| **Kaggle** | `pandas` | Free | No | Static CSV |

### Running on Kaggle?

**yfinance works on Kaggle!** Just enable Internet access:
1. Open your Kaggle notebook
2. Click **Settings** (right sidebar)
3. Toggle **Internet** to **ON**

If yfinance doesn't work, set `USE_KAGGLE_DATASET = True` and use a CSV dataset from Kaggle.

---

### What is OHLCV Data?

Stock data comes in **OHLCV** format - the standard representation of price movement:

| Column | Full Name | Description | Data Type |
|--------|-----------|-------------|----------|
| **O** | Open | First trade price of the day | float64 |
| **H** | High | Maximum price during the day | float64 |
| **L** | Low | Minimum price during the day | float64 |
| **C** | Close | Last trade price of the day | float64 |
| **V** | Volume | Number of shares traded | int64 |
| **Adj Close** | Adjusted Close | Close price adjusted for dividends/splits | float64 |

### Why Adjusted Close Matters:

When a company issues dividends or stock splits:
- **Raw Close** shows the actual trading price
- **Adj Close** adjusts historical prices to reflect these events

For **prediction**, we typically use **Adj Close** for historical analysis, but **Close** for recent data.

In [None]:
# Configuration
TICKER = 'AAPL'  # Stock ticker symbol (Apple Inc.)
START_DATE = '2019-01-01'  # Start date for historical data
END_DATE = datetime.now().strftime('%Y-%m-%d')  # Up to today

print("="*70)
print("DOWNLOADING STOCK DATA")
print("="*70)

# About the stock
print(f"\nStock: {TICKER}")
print(f"Period: {START_DATE} to {END_DATE}")

# ============================================================
# DATA SOURCE OPTIONS (Choose ONE)
# ============================================================
# Option 1: yfinance (LIVE DATA) - Recommended
# Option 2: Kaggle Dataset (STATIC DATA) - Fallback
# ============================================================

USE_KAGGLE_DATASET = False  # Set True to use Kaggle CSV instead of yfinance

if USE_KAGGLE_DATASET:
    # ============================================================
    # OPTION 2: KAGGLE DATASET (Fallback)
    # ============================================================
    # Download from: https://www.kaggle.com/datasets/varpit94/apple-stock-data-updated-till-22jun2021
    # Or search "AAPL stock" on Kaggle and download any recent dataset
    
    KAGGLE_CSV_PATH = '/kaggle/input/apple-stock-data-updated-till-22jun2021/AAPL.csv'  # Update path!
    
    print("\nUsing KAGGLE DATASET (static CSV)")
    print(f"Path: {KAGGLE_CSV_PATH}")
    
    try:
        df = pd.read_csv(KAGGLE_CSV_PATH, parse_dates=['Date'], index_col='Date')
        df = df.sort_index()
        print(f"\nLoaded from CSV!")
    except FileNotFoundError:
        print("ERROR: CSV file not found. Please update KAGGLE_CSV_PATH")
        print("Download a stock dataset from Kaggle and update the path.")

elif YFINANCE_AVAILABLE:
    # ============================================================
    # OPTION 1: YFINANCE (Live Data) - Recommended
    # ============================================================
    print("\nUsing YFINANCE (live data from Yahoo Finance)")
    print("Note: On Kaggle, enable 'Internet' in Settings (right sidebar)")
    
    # Download historical data
    df = yf.download(TICKER, start=START_DATE, end=END_DATE, progress=False)
    
    # Flatten multi-level columns if present
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = df.columns.get_level_values(0)
    
    print(f"\nDownload successful!")
else:
    print("ERROR: yfinance not available and USE_KAGGLE_DATASET is False")
    print("Please either:")
    print("  1. Install yfinance: pip install yfinance")
    print("  2. Set USE_KAGGLE_DATASET = True and provide a CSV path")

# Display dataset info
if 'df' in dir() and len(df) > 0:
    print(f"\nShape: {df.shape[0]} trading days x {df.shape[1]} columns")
    print(f"Date range: {df.index[0].strftime('%Y-%m-%d')} to {df.index[-1].strftime('%Y-%m-%d')}")
    print(f"\nNote: Stock markets are closed on weekends and holidays.")
    print(f"      That's why we have ~252 trading days per year, not 365.")

---

<a id='part2'></a>
# Part 2: Understanding Stock Data

---

## 2.1 Data Structure and Types

Before any analysis, we must understand our data!

In [None]:
# View first rows
print("="*70)
print("FIRST 10 ROWS OF DATA")
print("="*70)
df.head(10)

In [None]:
# Data types and structure
print("="*70)
print("DATA TYPES AND STRUCTURE")
print("="*70)

print("\n1. DataFrame Info:")
df.info()

print("\n" + "="*70)
print("2. Column Data Types Explained:")
print("="*70)

for col in df.columns:
    dtype = df[col].dtype
    sample = df[col].iloc[0]
    print(f"\n  {col}:")
    print(f"    Type: {dtype}")
    print(f"    Sample: {sample}")
    if dtype == 'float64':
        print(f"    Range: {df[col].min():.2f} to {df[col].max():.2f}")
    elif dtype == 'int64':
        print(f"    Range: {df[col].min():,} to {df[col].max():,}")

print("\n" + "="*70)
print("3. Index (DatetimeIndex):")
print("="*70)
print(f"  Type: {type(df.index).__name__}")
print(f"  Timezone: {df.index.tz}")
print(f"  Frequency: {df.index.freq} (None = irregular, due to weekends/holidays)")

In [None]:
# Statistical summary
print("="*70)
print("STATISTICAL SUMMARY")
print("="*70)
df.describe()

In [None]:
# Check for missing values
print("="*70)
print("MISSING VALUES CHECK")
print("="*70)

missing = df.isnull().sum()
print(missing)
print(f"\nTotal missing: {missing.sum()}")

if missing.sum() > 0:
    print("\nHandling missing values...")
    # For stock data, forward fill is common (use last known price)
    df = df.ffill()
    print("Used forward fill (ffill) - last known value propagates forward.")
else:
    print("\nNo missing values - data is clean!")

## 2.2 Understanding Price Relationships

### The OHLC Relationship:

For any trading day, these rules ALWAYS hold:
- **Low ≤ Open ≤ High**
- **Low ≤ Close ≤ High**
- **Low = min(all trades)**
- **High = max(all trades)**

In [None]:
# Verify OHLC relationships
print("="*70)
print("VERIFYING OHLC RELATIONSHIPS")
print("="*70)

# Check: Low <= Open <= High
low_open_check = (df['Low'] <= df['Open']).all() and (df['Open'] <= df['High']).all()
print(f"\nLow <= Open <= High: {low_open_check}")

# Check: Low <= Close <= High  
low_close_check = (df['Low'] <= df['Close']).all() and (df['Close'] <= df['High']).all()
print(f"Low <= Close <= High: {low_close_check}")

# Check: Low is minimum
low_min_check = (df['Low'] == df[['Open', 'High', 'Low', 'Close']].min(axis=1)).all()
print(f"Low is minimum of OHLC: {low_min_check}")

# Check: High is maximum
high_max_check = (df['High'] == df[['Open', 'High', 'Low', 'Close']].max(axis=1)).all()
print(f"High is maximum of OHLC: {high_max_check}")

print("\nAll relationships verified - data integrity confirmed!")

---

<a id='part3'></a>
# Part 3: Exploratory Data Analysis

---

## 3.1 Price Trend Visualization

The first step in any time series analysis is to **visualize the data**.

In [None]:
# Stock price history
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# 1. Closing Price
axes[0].plot(df.index, df['Close'], color='steelblue', linewidth=1.5)
axes[0].fill_between(df.index, df['Close'], alpha=0.3, color='steelblue')
axes[0].set_title(f'{TICKER} Closing Price History', fontweight='bold', fontsize=14)
axes[0].set_ylabel('Close Price ($)')
axes[0].set_xlabel('Date')

# Add annotation for price range
axes[0].axhline(y=df['Close'].mean(), color='red', linestyle='--', 
                label=f"Mean: ${df['Close'].mean():.2f}")
axes[0].legend(loc='upper left')

# 2. Trading Volume
colors = ['green' if df['Close'].iloc[i] >= df['Open'].iloc[i] else 'red' 
          for i in range(len(df))]
axes[1].bar(df.index, df['Volume'], color=colors, alpha=0.7, width=1)
axes[1].set_title('Trading Volume', fontweight='bold', fontsize=14)
axes[1].set_ylabel('Volume')
axes[1].set_xlabel('Date')

# 3. Price Range (High - Low)
df['Price_Range'] = df['High'] - df['Low']
axes[2].plot(df.index, df['Price_Range'], color='purple', linewidth=1, alpha=0.8)
axes[2].fill_between(df.index, df['Price_Range'], alpha=0.3, color='purple')
axes[2].set_title('Daily Price Range (High - Low)', fontweight='bold', fontsize=14)
axes[2].set_ylabel('Price Range ($)')
axes[2].set_xlabel('Date')

plt.tight_layout()
plt.show()

# Summary statistics
print("="*70)
print("PRICE SUMMARY")
print("="*70)
print(f"Starting Price: ${df['Close'].iloc[0]:.2f}")
print(f"Ending Price:   ${df['Close'].iloc[-1]:.2f}")
print(f"Highest Price:  ${df['Close'].max():.2f} (on {df['Close'].idxmax().strftime('%Y-%m-%d')})")
print(f"Lowest Price:   ${df['Close'].min():.2f} (on {df['Close'].idxmin().strftime('%Y-%m-%d')})")
print(f"Total Return:   {((df['Close'].iloc[-1] / df['Close'].iloc[0]) - 1) * 100:.2f}%")

## 3.2 Returns Analysis

### Why Analyze Returns Instead of Prices?

| Metric | Formula | Why Use It |
|--------|---------|------------|
| **Price** | P(t) | Absolute value, hard to compare |
| **Return** | (P(t) - P(t-1)) / P(t-1) | Normalized, stationary, comparable |
| **Log Return** | ln(P(t) / P(t-1)) | Additive over time, symmetric |

**Returns are more stationary than prices!** This helps ML models.

In [None]:
# Calculate returns
df['Daily_Return'] = df['Close'].pct_change() * 100  # Percentage return
df['Log_Return'] = np.log(df['Close'] / df['Close'].shift(1)) * 100  # Log return

print("="*70)
print("RETURNS ANALYSIS")
print("="*70)

# Plot returns
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Daily Returns Time Series
axes[0, 0].plot(df.index, df['Daily_Return'], color='steelblue', linewidth=0.8, alpha=0.8)
axes[0, 0].axhline(y=0, color='black', linestyle='-', linewidth=1)
axes[0, 0].fill_between(df.index, df['Daily_Return'], 
                        where=(df['Daily_Return'] > 0), color='green', alpha=0.3)
axes[0, 0].fill_between(df.index, df['Daily_Return'], 
                        where=(df['Daily_Return'] < 0), color='red', alpha=0.3)
axes[0, 0].set_title('Daily Returns Over Time', fontweight='bold')
axes[0, 0].set_ylabel('Return (%)')

# 2. Returns Distribution
axes[0, 1].hist(df['Daily_Return'].dropna(), bins=50, color='steelblue', 
                edgecolor='black', alpha=0.7, density=True)
axes[0, 1].axvline(x=0, color='black', linestyle='-', linewidth=1)
axes[0, 1].axvline(x=df['Daily_Return'].mean(), color='red', linestyle='--', 
                   linewidth=2, label=f"Mean: {df['Daily_Return'].mean():.3f}%")
axes[0, 1].set_title('Distribution of Daily Returns', fontweight='bold')
axes[0, 1].set_xlabel('Return (%)')
axes[0, 1].set_ylabel('Density')
axes[0, 1].legend()

# 3. Box Plot
axes[1, 0].boxplot(df['Daily_Return'].dropna(), vert=True, patch_artist=True,
                   boxprops=dict(facecolor='lightblue', color='black'),
                   medianprops=dict(color='red', linewidth=2))
axes[1, 0].set_title('Returns Box Plot (Outliers = Extreme Days)', fontweight='bold')
axes[1, 0].set_ylabel('Return (%)')

# 4. QQ Plot (Check for Normality)
from scipy import stats
stats.probplot(df['Daily_Return'].dropna(), dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Check Normality)', fontweight='bold')
axes[1, 1].get_lines()[0].set_markerfacecolor('steelblue')
axes[1, 1].get_lines()[0].set_alpha(0.5)

plt.tight_layout()
plt.show()

# Statistics
print("\nReturn Statistics:")
print(f"  Mean Daily Return:     {df['Daily_Return'].mean():.4f}%")
print(f"  Std Dev (Volatility):  {df['Daily_Return'].std():.4f}%")
print(f"  Min Return (Worst Day): {df['Daily_Return'].min():.4f}%")
print(f"  Max Return (Best Day):  {df['Daily_Return'].max():.4f}%")
print(f"  Skewness:              {df['Daily_Return'].skew():.4f}")
print(f"  Kurtosis:              {df['Daily_Return'].kurtosis():.4f}")

print("\nInterpretation:")
print("  - Negative skewness = More extreme negative returns (crash risk)")
print("  - High kurtosis = Fat tails (more extreme events than normal distribution)")
print("  - Q-Q plot deviation from line = Non-normal distribution")

## 3.3 Autocorrelation Analysis

### What is Autocorrelation?

**Autocorrelation** measures how correlated a time series is with its past values.

| Lag | Meaning | If Significant... |
|-----|---------|-------------------|
| Lag 1 | Today vs Yesterday | Recent past predicts near future |
| Lag 5 | Today vs 5 days ago | Weekly patterns |
| Lag 20 | Today vs 20 days ago | Monthly patterns |

**For stock prices**: Usually low autocorrelation in returns (markets are efficient).
But some patterns may exist!

In [None]:
# Autocorrelation analysis
from pandas.plotting import autocorrelation_plot

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Autocorrelation of Prices (usually high - trending)
pd.plotting.autocorrelation_plot(df['Close'].dropna(), ax=axes[0])
axes[0].set_title('Autocorrelation of PRICES (Expected: High)', fontweight='bold')
axes[0].set_xlim(0, 50)

# 2. Autocorrelation of Returns (usually low - efficient market)
pd.plotting.autocorrelation_plot(df['Daily_Return'].dropna(), ax=axes[1])
axes[1].set_title('Autocorrelation of RETURNS (Expected: Low)', fontweight='bold')
axes[1].set_xlim(0, 50)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Prices show HIGH autocorrelation (trending behavior)")
print("- Returns show LOW autocorrelation (past returns don't predict future)")
print("- This is why we need TECHNICAL INDICATORS - they extract patterns!")

## 3.4 Trend and Seasonality Analysis

### What are Trend, Seasonality, and Residuals?

Every time series can be decomposed into three components:

| Component | Definition | Example in Stock Data |
|-----------|------------|----------------------|
| **Trend** | Long-term direction (up/down) | Bull market = upward trend |
| **Seasonality** | Repeating patterns at fixed intervals | Monday effect, January effect |
| **Residuals** | Random noise after removing trend/season | Unpredictable fluctuations |

### Types of Seasonality in Stock Markets:

| Pattern | Period | Description |
|---------|--------|-------------|
| **Day-of-Week Effect** | Weekly | Mondays often negative, Fridays positive |
| **Month-of-Year Effect** | Yearly | "Sell in May", January effect |
| **Quarter-End Effect** | Quarterly | Window dressing by fund managers |
| **Holiday Effect** | Variable | Pre-holiday rallies |

### Additive vs Multiplicative Decomposition:

| Type | Formula | Use When |
|------|---------|----------|
| **Additive** | Y = Trend + Seasonal + Residual | Seasonal variation is constant |
| **Multiplicative** | Y = Trend × Seasonal × Residual | Seasonal variation scales with trend |

Stock prices typically use **multiplicative** (percentage changes scale with price level).

In [None]:
# ============================================================
# SEASONALITY ANALYSIS
# ============================================================
print("="*70)
print("SEASONALITY ANALYSIS")
print("="*70)

# Create temporary columns for analysis
df['Day_Name'] = df.index.day_name()
df['Month_Name'] = df.index.month_name()
df['Year'] = df.index.year

# ============================================================
# 1. DAY-OF-WEEK EFFECT
# ============================================================
print("\n" + "="*60)
print("1. DAY-OF-WEEK EFFECT")
print("="*60)

print("""
The 'Monday Effect' hypothesis:
- Mondays tend to have lower/negative returns
- Fridays tend to have positive returns
- Reason: Weekend news accumulation, investor psychology
""")

# Calculate average return by day of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
daily_returns = df.groupby('Day_Name')['Daily_Return'].agg(['mean', 'std', 'count'])
daily_returns = daily_returns.reindex(day_order)

print("\nAverage Return by Day of Week:")
print(daily_returns.round(4))

# Visualize day-of-week effect
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of average returns
colors = ['green' if x > 0 else 'red' for x in daily_returns['mean']]
axes[0].bar(day_order, daily_returns['mean'], color=colors, edgecolor='black', alpha=0.7)
axes[0].axhline(y=0, color='black', linestyle='-', linewidth=1)
axes[0].set_title('Average Daily Return by Day of Week', fontweight='bold')
axes[0].set_ylabel('Average Return (%)')
axes[0].set_xlabel('Day of Week')
for i, (day, row) in enumerate(daily_returns.iterrows()):
    axes[0].text(i, row['mean'] + 0.01, f"{row['mean']:.3f}%", ha='center', fontweight='bold')

# Box plot
day_data = [df[df['Day_Name'] == day]['Daily_Return'].dropna() for day in day_order]
bp = axes[1].boxplot(day_data, labels=day_order, patch_artist=True)
for patch, color in zip(bp['boxes'], ['#ff9999', '#99ff99', '#9999ff', '#ffff99', '#ff99ff']):
    patch.set_facecolor(color)
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=1)
axes[1].set_title('Return Distribution by Day of Week', fontweight='bold')
axes[1].set_ylabel('Daily Return (%)')

plt.tight_layout()
plt.show()

# ============================================================
# 2. MONTH-OF-YEAR EFFECT
# ============================================================
print("\n" + "="*60)
print("2. MONTH-OF-YEAR EFFECT (Seasonality)")
print("="*60)

print("""
Common Stock Market Seasonality Patterns:
- 'January Effect': Small caps outperform in January
- 'Sell in May': May-October underperforms Nov-April
- 'Santa Rally': December often positive
- 'September Effect': September often negative
""")

# Calculate monthly returns
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
monthly_returns = df.groupby('Month_Name')['Daily_Return'].agg(['mean', 'std', 'count'])
monthly_returns = monthly_returns.reindex(month_order)

print("\nAverage Return by Month:")
print(monthly_returns.round(4))

# Visualize monthly seasonality
fig, ax = plt.subplots(figsize=(14, 6))

colors = ['green' if x > 0 else 'red' for x in monthly_returns['mean']]
bars = ax.bar(range(12), monthly_returns['mean'], color=colors, edgecolor='black', alpha=0.7)
ax.axhline(y=0, color='black', linestyle='-', linewidth=1)
ax.axhline(y=monthly_returns['mean'].mean(), color='blue', linestyle='--', 
           label=f"Overall Mean: {monthly_returns['mean'].mean():.3f}%")
ax.set_title('Average Daily Return by Month (Seasonality Pattern)', fontweight='bold', fontsize=14)
ax.set_ylabel('Average Return (%)')
ax.set_xlabel('Month')
ax.set_xticks(range(12))
ax.set_xticklabels([m[:3] for m in month_order], rotation=45)
ax.legend()

# Annotate best and worst months
best_month = monthly_returns['mean'].idxmax()
worst_month = monthly_returns['mean'].idxmin()
ax.annotate(f'Best: {best_month}', xy=(month_order.index(best_month), monthly_returns.loc[best_month, 'mean']),
            xytext=(10, 20), textcoords='offset points', fontweight='bold', color='green',
            arrowprops=dict(arrowstyle='->', color='green'))
ax.annotate(f'Worst: {worst_month}', xy=(month_order.index(worst_month), monthly_returns.loc[worst_month, 'mean']),
            xytext=(10, -30), textcoords='offset points', fontweight='bold', color='red',
            arrowprops=dict(arrowstyle='->', color='red'))

plt.tight_layout()
plt.show()

# ============================================================
# 3. TIME SERIES DECOMPOSITION
# ============================================================
print("\n" + "="*60)
print("3. TIME SERIES DECOMPOSITION")
print("="*60)

print("""
Decomposition breaks the series into:
1. TREND: Long-term direction
2. SEASONAL: Repeating patterns (we use yearly = ~252 trading days)
3. RESIDUAL: Random noise
""")

from statsmodels.tsa.seasonal import seasonal_decompose

# Use multiplicative decomposition (better for stock prices)
# Period = 252 (approximate trading days per year)
# Use a subset for cleaner visualization
decomposition = seasonal_decompose(df['Close'].dropna(), model='multiplicative', period=252)

fig, axes = plt.subplots(4, 1, figsize=(14, 12))

# Original
axes[0].plot(df.index, df['Close'], color='black', linewidth=1)
axes[0].set_title('Original Time Series', fontweight='bold')
axes[0].set_ylabel('Price ($)')

# Trend
axes[1].plot(decomposition.trend.index, decomposition.trend, color='blue', linewidth=1.5)
axes[1].set_title('Trend Component (Long-term Direction)', fontweight='bold')
axes[1].set_ylabel('Trend')

# Seasonal
axes[2].plot(decomposition.seasonal.index, decomposition.seasonal, color='green', linewidth=1)
axes[2].set_title('Seasonal Component (Yearly Pattern)', fontweight='bold')
axes[2].set_ylabel('Seasonal Factor')
axes[2].axhline(y=1, color='red', linestyle='--', alpha=0.5)

# Residual
axes[3].plot(decomposition.resid.index, decomposition.resid, color='purple', linewidth=0.8, alpha=0.7)
axes[3].set_title('Residual Component (Random Noise)', fontweight='bold')
axes[3].set_ylabel('Residual')
axes[3].axhline(y=1, color='red', linestyle='--', alpha=0.5)
axes[3].set_xlabel('Date')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - TREND shows the overall market direction (upward = bull market)")
print("  - SEASONAL shows yearly patterns (values > 1 = above average, < 1 = below)")
print("  - RESIDUAL is what's left (should be random if model is good)")

# Summary
print("\n" + "="*60)
print("SEASONALITY SUMMARY")
print("="*60)
print(f"\nBest performing day: {daily_returns['mean'].idxmax()} ({daily_returns['mean'].max():.3f}%)")
print(f"Worst performing day: {daily_returns['mean'].idxmin()} ({daily_returns['mean'].min():.3f}%)")
print(f"Best performing month: {monthly_returns['mean'].idxmax()} ({monthly_returns['mean'].max():.3f}%)")
print(f"Worst performing month: {monthly_returns['mean'].idxmin()} ({monthly_returns['mean'].min():.3f}%)")

# Clean up temporary columns
df.drop(['Day_Name', 'Month_Name', 'Year'], axis=1, inplace=True)

---

<a id='part4'></a>
# Part 4: Technical Indicators (Feature Engineering)

---

## Why Technical Indicators?

Raw prices have **low predictive power**. Technical indicators:
1. **Extract patterns** from noisy data
2. **Encode domain knowledge** (what traders actually look at)
3. **Create stationary features** from non-stationary prices
4. **Capture momentum, trend, volatility**

### Categories of Technical Indicators:

| Category | Purpose | Examples |
|----------|---------|----------|
| **Trend** | Identify direction | SMA, EMA, MACD |
| **Momentum** | Measure speed of change | RSI, Stochastic, ROC |
| **Volatility** | Measure price variability | Bollinger Bands, ATR |
| **Volume** | Confirm trends with volume | OBV, VWAP |

## 4.1 Moving Averages (Trend Indicators)

In [None]:
# Create feature DataFrame
data = df.copy()

print("="*70)
print("FEATURE ENGINEERING - TECHNICAL INDICATORS")
print("="*70)

# ============================================================
# 1. MOVING AVERAGES (Trend Indicators)
# ============================================================
print("\n" + "="*60)
print("1. MOVING AVERAGES")
print("="*60)

print("""
What are Moving Averages?
- Average of past N days' prices
- Smooths out noise to reveal trend
- Common periods: 5 (week), 20 (month), 50, 200 (long-term)

Types:
- SMA (Simple): Equal weight to all days
- EMA (Exponential): More weight to recent days
""")

# Simple Moving Averages
data['SMA_5'] = data['Close'].rolling(window=5).mean()   # 1 week
data['SMA_10'] = data['Close'].rolling(window=10).mean() # 2 weeks
data['SMA_20'] = data['Close'].rolling(window=20).mean() # 1 month
data['SMA_50'] = data['Close'].rolling(window=50).mean() # ~2 months

# Exponential Moving Averages
data['EMA_12'] = data['Close'].ewm(span=12, adjust=False).mean()  # Standard for MACD
data['EMA_26'] = data['Close'].ewm(span=26, adjust=False).mean()  # Standard for MACD

# Price relative to Moving Averages (normalized)
data['Price_to_SMA_20'] = data['Close'] / data['SMA_20']
data['Price_to_SMA_50'] = data['Close'] / data['SMA_50']

print("Created:")
print("  - SMA_5, SMA_10, SMA_20, SMA_50")
print("  - EMA_12, EMA_26")
print("  - Price_to_SMA_20, Price_to_SMA_50 (ratios)")

In [None]:
# Visualize Moving Averages
fig, ax = plt.subplots(figsize=(14, 7))

# Plot last 200 days
plot_data = data.tail(200)

ax.plot(plot_data.index, plot_data['Close'], label='Close', linewidth=2, color='black')
ax.plot(plot_data.index, plot_data['SMA_20'], label='SMA 20', linewidth=1.5, 
        linestyle='--', color='blue')
ax.plot(plot_data.index, plot_data['SMA_50'], label='SMA 50', linewidth=1.5, 
        linestyle='--', color='red')
ax.plot(plot_data.index, plot_data['EMA_12'], label='EMA 12', linewidth=1.5, 
        linestyle=':', color='green')

ax.set_title(f'{TICKER} with Moving Averages', fontweight='bold', fontsize=14)
ax.set_xlabel('Date')
ax.set_ylabel('Price ($)')
ax.legend(loc='upper left')

plt.tight_layout()
plt.show()

print("\nTrading Signals from Moving Averages:")
print("  - Price > SMA: Bullish (uptrend)")
print("  - Price < SMA: Bearish (downtrend)")
print("  - SMA_20 crosses above SMA_50: Golden Cross (buy signal)")
print("  - SMA_20 crosses below SMA_50: Death Cross (sell signal)")

## 4.2 Momentum Indicators

In [None]:
# ============================================================
# 2. MOMENTUM INDICATORS
# ============================================================
print("\n" + "="*60)
print("2. MOMENTUM INDICATORS")
print("="*60)

print("""
What is Momentum?
- Rate of change in price
- Identifies overbought/oversold conditions
- Helps predict potential reversals
""")

# RSI (Relative Strength Index)
# Measures: Ratio of average gains to average losses
# Range: 0-100
# Interpretation: >70 = overbought, <30 = oversold

def calculate_rsi(prices, period=14):
    """Calculate RSI (Relative Strength Index)"""
    delta = prices.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=period).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

data['RSI'] = calculate_rsi(data['Close'], period=14)
print("\nRSI (Relative Strength Index):")
print("  - Formula: 100 - 100/(1 + RS), where RS = Avg Gain / Avg Loss")
print("  - Period: 14 days (standard)")
print("  - Range: 0 to 100")
print("  - Overbought: RSI > 70")
print("  - Oversold: RSI < 30")

# MACD (Moving Average Convergence Divergence)
# Measures: Relationship between two EMAs
# Components: MACD line, Signal line, Histogram

data['MACD'] = data['EMA_12'] - data['EMA_26']  # MACD Line
data['MACD_Signal'] = data['MACD'].ewm(span=9, adjust=False).mean()  # Signal Line
data['MACD_Histogram'] = data['MACD'] - data['MACD_Signal']  # Histogram

print("\nMACD (Moving Average Convergence Divergence):")
print("  - MACD Line: EMA_12 - EMA_26")
print("  - Signal Line: 9-day EMA of MACD Line")
print("  - Histogram: MACD - Signal")
print("  - Buy: MACD crosses above Signal")
print("  - Sell: MACD crosses below Signal")

# Rate of Change (ROC)
data['ROC_5'] = ((data['Close'] - data['Close'].shift(5)) / data['Close'].shift(5)) * 100
data['ROC_10'] = ((data['Close'] - data['Close'].shift(10)) / data['Close'].shift(10)) * 100

print("\nROC (Rate of Change):")
print("  - Formula: ((Price_today - Price_N_days_ago) / Price_N_days_ago) * 100")
print("  - Created: ROC_5, ROC_10")

In [None]:
# Visualize RSI and MACD
fig, axes = plt.subplots(3, 1, figsize=(14, 10))

plot_data = data.tail(200)

# 1. Price
axes[0].plot(plot_data.index, plot_data['Close'], color='black', linewidth=1.5)
axes[0].set_title(f'{TICKER} Price', fontweight='bold')
axes[0].set_ylabel('Price ($)')

# 2. RSI
axes[1].plot(plot_data.index, plot_data['RSI'], color='purple', linewidth=1.5)
axes[1].axhline(y=70, color='red', linestyle='--', alpha=0.7, label='Overbought (70)')
axes[1].axhline(y=30, color='green', linestyle='--', alpha=0.7, label='Oversold (30)')
axes[1].axhline(y=50, color='gray', linestyle='-', alpha=0.3)
axes[1].fill_between(plot_data.index, 30, 70, alpha=0.1, color='gray')
axes[1].set_title('RSI (Relative Strength Index)', fontweight='bold')
axes[1].set_ylabel('RSI')
axes[1].set_ylim(0, 100)
axes[1].legend(loc='upper right')

# 3. MACD
axes[2].plot(plot_data.index, plot_data['MACD'], color='blue', linewidth=1.5, label='MACD')
axes[2].plot(plot_data.index, plot_data['MACD_Signal'], color='red', linewidth=1.5, label='Signal')
axes[2].bar(plot_data.index, plot_data['MACD_Histogram'], 
            color=['green' if x >= 0 else 'red' for x in plot_data['MACD_Histogram']], 
            alpha=0.5, label='Histogram')
axes[2].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[2].set_title('MACD (Moving Average Convergence Divergence)', fontweight='bold')
axes[2].set_ylabel('MACD')
axes[2].legend(loc='upper left')

plt.tight_layout()
plt.show()

## 4.3 Volatility Indicators

In [None]:
# ============================================================
# 3. VOLATILITY INDICATORS
# ============================================================
print("\n" + "="*60)
print("3. VOLATILITY INDICATORS")
print("="*60)

print("""
What is Volatility?
- Measure of price variability/risk
- High volatility = Big price swings (opportunity + risk)
- Low volatility = Stable prices
""")

# Bollinger Bands
# Upper Band: SMA + 2*Std
# Lower Band: SMA - 2*Std
# When price touches bands, it may reverse

data['BB_Middle'] = data['Close'].rolling(window=20).mean()
bb_std = data['Close'].rolling(window=20).std()
data['BB_Upper'] = data['BB_Middle'] + (bb_std * 2)
data['BB_Lower'] = data['BB_Middle'] - (bb_std * 2)
data['BB_Width'] = (data['BB_Upper'] - data['BB_Lower']) / data['BB_Middle'] * 100
data['BB_Position'] = (data['Close'] - data['BB_Lower']) / (data['BB_Upper'] - data['BB_Lower'])

print("\nBollinger Bands:")
print("  - Middle: 20-day SMA")
print("  - Upper: Middle + 2 * StdDev")
print("  - Lower: Middle - 2 * StdDev")
print("  - BB_Width: Band width (volatility measure)")
print("  - BB_Position: Where price is within bands (0=lower, 1=upper)")

# Average True Range (ATR)
# Measures average daily price range
# Used for volatility and stop-loss placement

high_low = data['High'] - data['Low']
high_close = abs(data['High'] - data['Close'].shift())
low_close = abs(data['Low'] - data['Close'].shift())
true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
data['ATR'] = true_range.rolling(window=14).mean()
data['ATR_Pct'] = (data['ATR'] / data['Close']) * 100  # As percentage of price

print("\nATR (Average True Range):")
print("  - True Range: max(High-Low, |High-PrevClose|, |Low-PrevClose|)")
print("  - ATR: 14-day moving average of True Range")
print("  - ATR_Pct: ATR as percentage of price")

# Rolling Volatility (Std of Returns)
data['Volatility_5'] = data['Daily_Return'].rolling(window=5).std()
data['Volatility_20'] = data['Daily_Return'].rolling(window=20).std()

print("\nRolling Volatility:")
print("  - Volatility_5: 5-day rolling std of returns")
print("  - Volatility_20: 20-day rolling std of returns")

In [None]:
# Visualize Bollinger Bands
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

plot_data = data.tail(200)

# 1. Bollinger Bands
axes[0].plot(plot_data.index, plot_data['Close'], label='Close', color='black', linewidth=1.5)
axes[0].plot(plot_data.index, plot_data['BB_Upper'], label='Upper Band', 
             color='red', linestyle='--', linewidth=1)
axes[0].plot(plot_data.index, plot_data['BB_Middle'], label='Middle (SMA 20)', 
             color='blue', linestyle='--', linewidth=1)
axes[0].plot(plot_data.index, plot_data['BB_Lower'], label='Lower Band', 
             color='green', linestyle='--', linewidth=1)
axes[0].fill_between(plot_data.index, plot_data['BB_Upper'], plot_data['BB_Lower'], 
                     alpha=0.1, color='gray')
axes[0].set_title(f'{TICKER} with Bollinger Bands', fontweight='bold', fontsize=14)
axes[0].set_ylabel('Price ($)')
axes[0].legend(loc='upper left')

# 2. Band Width (Volatility)
axes[1].plot(plot_data.index, plot_data['BB_Width'], color='purple', linewidth=1.5)
axes[1].fill_between(plot_data.index, plot_data['BB_Width'], alpha=0.3, color='purple')
axes[1].axhline(y=plot_data['BB_Width'].mean(), color='red', linestyle='--', 
                label=f"Mean: {plot_data['BB_Width'].mean():.2f}%")
axes[1].set_title('Bollinger Band Width (Volatility Measure)', fontweight='bold', fontsize=14)
axes[1].set_ylabel('BB Width (%)')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nBollinger Band Trading Signals:")
print("  - Price at Upper Band: Potentially overbought")
print("  - Price at Lower Band: Potentially oversold")
print("  - Bands Narrowing: Low volatility (breakout may follow)")
print("  - Bands Widening: High volatility")

## 4.4 Lag Features and Additional Features

In [None]:
# ============================================================
# 4. LAG FEATURES
# ============================================================
print("\n" + "="*60)
print("4. LAG FEATURES")
print("="*60)

print("""
What are Lag Features?
- Past values of the target/features
- Captures autocorrelation patterns
- Essential for time series ML

Why use them?
- Yesterday's price may predict today's price
- Multiple lags capture different patterns
""")

# Lag features for Close price
for lag in [1, 2, 3, 5, 7, 10]:
    data[f'Close_Lag_{lag}'] = data['Close'].shift(lag)
    data[f'Return_Lag_{lag}'] = data['Daily_Return'].shift(lag)

print("Created Close and Return lags: 1, 2, 3, 5, 7, 10 days")

# ============================================================
# 5. ADDITIONAL FEATURES
# ============================================================
print("\n" + "="*60)
print("5. ADDITIONAL FEATURES")
print("="*60)

# Price-based features
data['Open_Close_Pct'] = (data['Close'] - data['Open']) / data['Open'] * 100  # Intraday return
data['High_Low_Pct'] = (data['High'] - data['Low']) / data['Low'] * 100  # Daily range %

# Volume features
data['Volume_SMA_20'] = data['Volume'].rolling(window=20).mean()
data['Volume_Ratio'] = data['Volume'] / data['Volume_SMA_20']  # Volume spike detection

# Time features (day of week, month)
data['Day_of_Week'] = data.index.dayofweek  # 0=Monday, 4=Friday
data['Month'] = data.index.month
data['Quarter'] = data.index.quarter

# Target variable: Next day's close price
data['Target'] = data['Close'].shift(-1)  # What we want to predict!

print("\nAdditional features created:")
print("  - Open_Close_Pct: Intraday return")
print("  - High_Low_Pct: Daily range percentage")
print("  - Volume_SMA_20, Volume_Ratio: Volume patterns")
print("  - Day_of_Week, Month, Quarter: Calendar features")
print("  - Target: Next day's close (what we predict)")

In [None]:
# Summary of all features
print("="*70)
print("FEATURE ENGINEERING SUMMARY")
print("="*70)

print(f"\nTotal columns: {len(data.columns)}")
print("\nFeatures by Category:")

trend_features = ['SMA_5', 'SMA_10', 'SMA_20', 'SMA_50', 'EMA_12', 'EMA_26', 
                  'Price_to_SMA_20', 'Price_to_SMA_50']
momentum_features = ['RSI', 'MACD', 'MACD_Signal', 'MACD_Histogram', 'ROC_5', 'ROC_10']
volatility_features = ['BB_Middle', 'BB_Upper', 'BB_Lower', 'BB_Width', 'BB_Position', 
                       'ATR', 'ATR_Pct', 'Volatility_5', 'Volatility_20']
lag_features = [col for col in data.columns if 'Lag' in col]

print(f"\n  Trend ({len(trend_features)}):     {trend_features}")
print(f"\n  Momentum ({len(momentum_features)}):  {momentum_features}")
print(f"\n  Volatility ({len(volatility_features)}): {volatility_features}")
print(f"\n  Lag ({len(lag_features)}):       {lag_features}")

print("\n" + "="*70)
print("SAMPLE DATA WITH ALL FEATURES")
print("="*70)
data.tail(5)

---

<a id='part5'></a>
# Part 5: Data Preprocessing

---

## 5.1 Handle Missing Values

Technical indicators create NaN at the start (rolling window needs history).

In [None]:
print("="*70)
print("HANDLING MISSING VALUES")
print("="*70)

print(f"\nBefore cleaning: {data.shape}")
print(f"Total NaN values: {data.isnull().sum().sum()}")

# Show which columns have NaN
nan_counts = data.isnull().sum()
nan_cols = nan_counts[nan_counts > 0].sort_values(ascending=False)
print(f"\nColumns with NaN (top 10):")
print(nan_cols.head(10))

# Drop rows with NaN
data_clean = data.dropna()

print(f"\nAfter cleaning: {data_clean.shape}")
print(f"Rows removed: {len(data) - len(data_clean)}")
print(f"\nWhy? Rolling windows (like SMA_50) need 50 days of history.")
print(f"     Also, Target shift removes last row.")

## 5.2 Feature Selection

In [None]:
print("="*70)
print("FEATURE SELECTION")
print("="*70)

# Columns to exclude from features
exclude_cols = ['Target', 'Adj Close']  # Target is what we predict, Adj Close is redundant

# All other columns are features
feature_cols = [col for col in data_clean.columns if col not in exclude_cols]

X = data_clean[feature_cols]
y = data_clean['Target']

print(f"\nFeatures (X): {X.shape}")
print(f"Target (y): {y.shape}")

print(f"\nFeature columns ({len(feature_cols)}):")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i:2d}. {col}")

## 5.3 Time Series Train-Test Split

### CRITICAL: Why NOT Random Split?

| Split Type | Method | Problem for Time Series |
|------------|--------|------------------------|
| **Random** | Shuffle, then split | Future data leaks into training! |
| **Time-Based** | Split by date | Correct - no future leakage |

**Data Leakage**: Using future information to predict the past = cheating!

```
WRONG (Random):    Train: [2020, 2023, 2019, 2022] | Test: [2021, 2024]
RIGHT (Time):      Train: [2019, 2020, 2021, 2022] | Test: [2023, 2024]
```

In [None]:
print("="*70)
print("TIME-BASED TRAIN-TEST SPLIT")
print("="*70)

# Split point: 80% train, 20% test
split_idx = int(len(X) * 0.8)

# Split chronologically (NOT randomly!)
X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y.iloc[:split_idx]
y_test = y.iloc[split_idx:]

print(f"\nTraining Set:")
print(f"  Samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"  Period: {X_train.index[0].strftime('%Y-%m-%d')} to {X_train.index[-1].strftime('%Y-%m-%d')}")

print(f"\nTest Set:")
print(f"  Samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f"  Period: {X_test.index[0].strftime('%Y-%m-%d')} to {X_test.index[-1].strftime('%Y-%m-%d')}")

print("\nIMPORTANT: Train on PAST, test on FUTURE!")
print("          This simulates real-world prediction.")

# Visualize split
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(y_train.index, y_train.values, color='blue', label='Training Data', linewidth=1.5)
ax.plot(y_test.index, y_test.values, color='orange', label='Test Data', linewidth=1.5)
ax.axvline(x=X_test.index[0], color='red', linestyle='--', linewidth=2, label='Split Point')
ax.set_title('Time-Based Train-Test Split', fontweight='bold', fontsize=14)
ax.set_xlabel('Date')
ax.set_ylabel('Price ($)')
ax.legend()
plt.tight_layout()
plt.show()

## 5.4 Feature Scaling

### Why Scale Features?

| Algorithm | Needs Scaling? | Why |
|-----------|---------------|-----|
| **Linear Models** | Yes | Gradient descent converges faster |
| **KNN** | Yes | Distance-based, sensitive to scale |
| **SVM** | Yes | Kernel calculations affected by scale |
| **Decision Trees** | No | Split-based, scale-invariant |
| **Random Forest** | No | Ensemble of trees |

**We'll scale anyway** - it doesn't hurt tree models and helps others.

In [None]:
print("="*70)
print("FEATURE SCALING")
print("="*70)

# Use StandardScaler: transforms to mean=0, std=1
scaler = StandardScaler()

# IMPORTANT: Fit on training data ONLY, then transform both
# This prevents data leakage from test set!
X_train_scaled = scaler.fit_transform(X_train)  # Fit + Transform
X_test_scaled = scaler.transform(X_test)         # Transform only (use train stats)

# Convert to DataFrame for convenience
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("\nBefore Scaling (Training Set - first 5 features):")
print(X_train.iloc[:, :5].describe().loc[['mean', 'std', 'min', 'max']].round(2))

print("\nAfter Scaling (Training Set - first 5 features):")
print(X_train_scaled.iloc[:, :5].describe().loc[['mean', 'std', 'min', 'max']].round(4))

print("\nScaling ensures:")
print("  - Mean ≈ 0 for all features")
print("  - Std ≈ 1 for all features")
print("  - No feature dominates due to scale differences")

---

<a id='part6'></a>
# Part 6: Model Selection & Training

---

## Why These Models?

| Model | Type | Why Use for Stock Prediction |
|-------|------|-----------------------------|
| **Linear Regression** | Linear | Baseline, interpretable, fast |
| **Ridge** | Regularized Linear | Handles multicollinearity (correlated features) |
| **Lasso** | Regularized Linear | Feature selection (sets some coefficients to 0) |
| **Random Forest** | Ensemble | Captures non-linear patterns, robust |
| **Gradient Boosting** | Ensemble | Often best accuracy, sequential learning |
| **SVR** | Kernel-based | Good for non-linear, few samples |
| **KNN** | Instance-based | Similar days have similar outcomes |

## Key Hyperparameters:

| Model | Key Parameters | What They Control |
|-------|---------------|-------------------|
| **Ridge/Lasso** | alpha | Regularization strength (higher = simpler model) |
| **Random Forest** | n_estimators, max_depth | Number of trees, tree complexity |
| **Gradient Boosting** | n_estimators, learning_rate | Number of boosting rounds, step size |
| **KNN** | n_neighbors | Number of similar samples to use |

In [None]:
# Define models with explanations
print("="*70)
print("MODEL DEFINITIONS")
print("="*70)

models = {
    'Linear Regression': {
        'model': LinearRegression(),
        'description': 'Simple linear fit, no regularization',
        'needs_scaling': True
    },
    'Ridge Regression': {
        'model': Ridge(alpha=1.0, random_state=42),
        'description': 'L2 regularization, handles multicollinearity',
        'needs_scaling': True
    },
    'Lasso Regression': {
        'model': Lasso(alpha=0.1, random_state=42),
        'description': 'L1 regularization, feature selection',
        'needs_scaling': True
    },
    'Random Forest': {
        'model': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1),
        'description': '100 trees, max depth 10, parallel',
        'needs_scaling': False
    },
    'Gradient Boosting': {
        'model': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42),
        'description': '100 boosting rounds, learning rate 0.1',
        'needs_scaling': False
    },
    'KNN Regressor': {
        'model': KNeighborsRegressor(n_neighbors=10, weights='distance'),
        'description': '10 neighbors, distance-weighted',
        'needs_scaling': True
    }
}

print("\nModels to train:")
for i, (name, info) in enumerate(models.items(), 1):
    print(f"\n{i}. {name}")
    print(f"   {info['description']}")
    print(f"   Needs scaling: {info['needs_scaling']}")

In [None]:
# Train all models
print("="*70)
print("TRAINING ALL MODELS")
print("="*70)

results = {}

for name, info in models.items():
    print(f"\nTraining {name}...")
    
    model = info['model']
    
    # Use scaled or unscaled data based on model needs
    if info['needs_scaling']:
        X_tr, X_te = X_train_scaled, X_test_scaled
    else:
        X_tr, X_te = X_train, X_test
    
    # Train
    model.fit(X_tr, y_train)
    
    # Predict
    y_pred_train = model.predict(X_tr)
    y_pred_test = model.predict(X_te)
    
    # Calculate metrics
    results[name] = {
        'model': model,
        'y_pred_train': y_pred_train,
        'y_pred_test': y_pred_test,
        'train_rmse': np.sqrt(mean_squared_error(y_train, y_pred_train)),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
        'train_mae': mean_absolute_error(y_train, y_pred_train),
        'test_mae': mean_absolute_error(y_test, y_pred_test),
        'train_r2': r2_score(y_train, y_pred_train),
        'test_r2': r2_score(y_test, y_pred_test),
        'test_mape': mean_absolute_percentage_error(y_test, y_pred_test) * 100
    }
    
    r = results[name]
    print(f"  Train RMSE: ${r['train_rmse']:.2f} | Test RMSE: ${r['test_rmse']:.2f}")
    print(f"  Train R²: {r['train_r2']:.4f} | Test R²: {r['test_r2']:.4f}")

print("\n" + "="*70)
print("All models trained!")

---

<a id='part7'></a>
# Part 7: Model Evaluation & Comparison

---

## Regression Metrics Explained

| Metric | Formula | Interpretation | Good Value |
|--------|---------|----------------|------------|
| **RMSE** | √(Σ(y-ŷ)²/n) | Error in original units ($) | Lower is better |
| **MAE** | Σ|y-ŷ|/n | Average absolute error ($) | Lower is better |
| **R²** | 1 - SS_res/SS_tot | Variance explained (0-1) | Closer to 1 |
| **MAPE** | Σ|(y-ŷ)/y|/n × 100 | Percentage error | Lower is better |

### Which Metric to Focus On?

- **R²**: Good for comparing models (how much variance explained)
- **RMSE**: Penalizes large errors more (sensitive to outliers)
- **MAE**: More robust to outliers
- **MAPE**: Scale-independent, easy to interpret

## 7.1 Model Comparison

In [None]:
# Create comparison table
print("="*70)
print("MODEL COMPARISON")
print("="*70)

comparison = pd.DataFrame({
    'Model': list(results.keys()),
    'Train RMSE': [r['train_rmse'] for r in results.values()],
    'Test RMSE': [r['test_rmse'] for r in results.values()],
    'Train R²': [r['train_r2'] for r in results.values()],
    'Test R²': [r['test_r2'] for r in results.values()],
    'Test MAPE %': [r['test_mape'] for r in results.values()]
}).sort_values('Test RMSE')

comparison['Rank'] = range(1, len(comparison) + 1)
comparison = comparison[['Rank', 'Model', 'Train RMSE', 'Test RMSE', 'Train R²', 'Test R²', 'Test MAPE %']]

print(comparison.to_string(index=False))

best_model_name = comparison.iloc[0]['Model']
print(f"\nBest Model (by Test RMSE): {best_model_name}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

sorted_models = comparison['Model'].tolist()
test_rmse = comparison['Test RMSE'].tolist()
test_r2 = comparison['Test R²'].tolist()
test_mape = comparison['Test MAPE %'].tolist()

colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(sorted_models)))

# RMSE
axes[0].barh(sorted_models[::-1], test_rmse[::-1], color=colors[::-1], edgecolor='black')
axes[0].set_xlabel('Test RMSE ($)')
axes[0].set_title('RMSE (Lower is Better)', fontweight='bold')
for i, (m, v) in enumerate(zip(sorted_models[::-1], test_rmse[::-1])):
    axes[0].text(v + 0.2, i, f'${v:.2f}', va='center', fontweight='bold', fontsize=9)

# R²
colors_r2 = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(sorted_models)))
axes[1].barh(sorted_models[::-1], test_r2[::-1], color=colors_r2[::-1], edgecolor='black')
axes[1].set_xlabel('Test R² Score')
axes[1].set_title('R² (Higher is Better)', fontweight='bold')
axes[1].set_xlim(0, 1.05)
for i, (m, v) in enumerate(zip(sorted_models[::-1], test_r2[::-1])):
    axes[1].text(v + 0.01, i, f'{v:.4f}', va='center', fontweight='bold', fontsize=9)

# MAPE
axes[2].barh(sorted_models[::-1], test_mape[::-1], color=colors[::-1], edgecolor='black')
axes[2].set_xlabel('Test MAPE (%)')
axes[2].set_title('MAPE (Lower is Better)', fontweight='bold')
for i, (m, v) in enumerate(zip(sorted_models[::-1], test_mape[::-1])):
    axes[2].text(v + 0.05, i, f'{v:.2f}%', va='center', fontweight='bold', fontsize=9)

plt.tight_layout()
plt.show()

## 7.2 Overfitting Analysis

### What is Overfitting?

When a model learns the **training data too well** (including noise) and fails on new data.

| Scenario | Train Error | Test Error | Problem |
|----------|------------|------------|----------|
| **Underfitting** | High | High | Model too simple |
| **Good Fit** | Low | Low (similar) | Optimal complexity |
| **Overfitting** | Very Low | High | Model too complex |

In [None]:
# Check for overfitting
print("="*70)
print("OVERFITTING ANALYSIS")
print("="*70)

fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(sorted_models))
width = 0.35

train_rmse = [results[m]['train_rmse'] for m in sorted_models]
test_rmse = [results[m]['test_rmse'] for m in sorted_models]

bars1 = ax.bar(x - width/2, train_rmse, width, label='Train RMSE', color='steelblue', edgecolor='black')
bars2 = ax.bar(x + width/2, test_rmse, width, label='Test RMSE', color='darkorange', edgecolor='black')

ax.set_xlabel('Model')
ax.set_ylabel('RMSE ($)')
ax.set_title('Train vs Test RMSE - Overfitting Check', fontweight='bold', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(sorted_models, rotation=15, ha='right')
ax.legend()

plt.tight_layout()
plt.show()

print("\nOverfitting Analysis:")
for name in sorted_models:
    train = results[name]['train_rmse']
    test = results[name]['test_rmse']
    gap = test - train
    gap_pct = (gap / train) * 100
    
    if gap_pct > 50:
        status = "OVERFITTING"
    elif gap_pct < 10:
        status = "Good generalization"
    else:
        status = "Slight overfitting"
    
    print(f"  {name}: Train={train:.2f}, Test={test:.2f}, Gap={gap_pct:.1f}% - {status}")

## 7.3 Time Series Cross-Validation

### Why Special CV for Time Series?

Regular K-Fold CV randomly shuffles data - **BAD for time series!**

**TimeSeriesSplit** preserves temporal order:
```
Fold 1: Train=[1,2]     Test=[3]
Fold 2: Train=[1,2,3]   Test=[4]
Fold 3: Train=[1,2,3,4] Test=[5]
```

In [None]:
print("="*70)
print("TIME SERIES CROSS-VALIDATION")
print("="*70)

# 5-fold time series split
tscv = TimeSeriesSplit(n_splits=5)
cv_results = {}

for name, info in models.items():
    # Choose scaled or unscaled data
    if info['needs_scaling']:
        X_cv = X_train_scaled
    else:
        X_cv = X_train
    
    # Cross-validation
    scores = cross_val_score(info['model'], X_cv, y_train, 
                            cv=tscv, scoring='neg_mean_squared_error')
    rmse_scores = np.sqrt(-scores)
    
    cv_results[name] = {
        'mean': rmse_scores.mean(),
        'std': rmse_scores.std(),
        'scores': rmse_scores
    }
    
    print(f"\n{name}:")
    print(f"  Fold RMSE: {rmse_scores.round(2)}")
    print(f"  Mean: ${rmse_scores.mean():.2f} (+/- ${rmse_scores.std()*2:.2f})")

In [None]:
# Visualize CV results
cv_sorted = dict(sorted(cv_results.items(), key=lambda x: x[1]['mean']))
names_cv = list(cv_sorted.keys())
means = [cv_sorted[n]['mean'] for n in names_cv]
stds = [cv_sorted[n]['std'] for n in names_cv]

fig, ax = plt.subplots(figsize=(12, 6))

colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(names_cv)))
bars = ax.barh(names_cv[::-1], means[::-1], xerr=stds[::-1],
               color=colors[::-1], edgecolor='black', capsize=5)

ax.set_xlabel('Cross-Validation RMSE ($)')
ax.set_title('5-Fold Time Series Cross-Validation', fontweight='bold', fontsize=14)

for bar, mean, std in zip(bars, means[::-1], stds[::-1]):
    ax.text(bar.get_width() + std + 0.5, bar.get_y() + bar.get_height()/2,
            f'${mean:.2f}', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

best_cv_model = names_cv[0]
print(f"\nBest Model (CV): {best_cv_model}")
print(f"Mean CV RMSE: ${cv_sorted[best_cv_model]['mean']:.2f}")

---

<a id='part8'></a>
# Part 8: Hyperparameter Tuning

---

## What is Hyperparameter Tuning?

**Hyperparameters** are settings chosen BEFORE training (unlike model parameters learned during training).

| Parameter Type | Example | Set By |
|---------------|---------|--------|
| **Model Parameters** | Weights, coefficients | Learned during training |
| **Hyperparameters** | n_estimators, max_depth | Chosen before training |

## Tuning the Best Model

In [None]:
print("="*70)
print("HYPERPARAMETER TUNING")
print("="*70)

# Tune Random Forest (usually one of the best)
print("\nTuning Random Forest Regressor...")
print("\nHyperparameter Grid:")

param_grid = {
    'n_estimators': [50, 100, 200],      # Number of trees
    'max_depth': [5, 10, 15, None],       # Tree depth (None = unlimited)
    'min_samples_split': [2, 5, 10],      # Min samples to split a node
    'min_samples_leaf': [1, 2, 4]         # Min samples in a leaf
}

for param, values in param_grid.items():
    print(f"  {param}: {values}")

print(f"\nTotal combinations: {np.prod([len(v) for v in param_grid.values()])}")
print("Using GridSearchCV with TimeSeriesSplit...")

# Grid Search with Time Series CV
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
tscv = TimeSeriesSplit(n_splits=3)  # Use 3 folds for speed

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=tscv,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print("\n" + "="*70)
print("TUNING RESULTS")
print("="*70)
print(f"\nBest Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

best_rmse = np.sqrt(-grid_search.best_score_)
print(f"\nBest CV RMSE: ${best_rmse:.2f}")

In [None]:
# Compare tuned vs default
print("="*70)
print("TUNED MODEL PERFORMANCE")
print("="*70)

# Predict with tuned model
best_rf = grid_search.best_estimator_
y_pred_tuned = best_rf.predict(X_test)

tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred_tuned))
tuned_r2 = r2_score(y_test, y_pred_tuned)
tuned_mape = mean_absolute_percentage_error(y_test, y_pred_tuned) * 100

default_rmse = results['Random Forest']['test_rmse']
default_r2 = results['Random Forest']['test_r2']

print("\nComparison:")
print(f"  {'Metric':<15} {'Default':>15} {'Tuned':>15} {'Change':>15}")
print(f"  {'-'*60}")
print(f"  {'Test RMSE':<15} ${default_rmse:>14.2f} ${tuned_rmse:>14.2f} {(tuned_rmse-default_rmse)/default_rmse*100:>+14.1f}%")
print(f"  {'Test R²':<15} {default_r2:>15.4f} {tuned_r2:>15.4f} {(tuned_r2-default_r2)*100:>+14.1f}%")

if tuned_rmse < default_rmse:
    print("\nTuning improved the model!")
else:
    print("\nDefault parameters were already good.")

# Add tuned model to results
results['Random Forest (Tuned)'] = {
    'model': best_rf,
    'y_pred_test': y_pred_tuned,
    'test_rmse': tuned_rmse,
    'test_r2': tuned_r2,
    'test_mape': tuned_mape
}

---

<a id='part9'></a>
# Part 9: Final Predictions & Visualization

---

## 9.1 Best Model Predictions

In [None]:
# Select best model based on test RMSE
best_model_name = min(results.items(), key=lambda x: x[1].get('test_rmse', float('inf')))[0]
best_result = results[best_model_name]

print("="*70)
print(f"FINAL PREDICTIONS - {best_model_name}")
print("="*70)

# Create predictions DataFrame
predictions_df = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': best_result['y_pred_test']
}, index=y_test.index)

predictions_df['Error'] = predictions_df['Predicted'] - predictions_df['Actual']
predictions_df['Abs_Error'] = abs(predictions_df['Error'])
predictions_df['Pct_Error'] = (predictions_df['Error'] / predictions_df['Actual']) * 100

print("\nLast 10 Predictions:")
print(predictions_df.tail(10).round(2))

In [None]:
# Visualize predictions
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Actual vs Predicted over time
axes[0, 0].plot(predictions_df.index, predictions_df['Actual'], 
                label='Actual', color='black', linewidth=1.5)
axes[0, 0].plot(predictions_df.index, predictions_df['Predicted'], 
                label='Predicted', color='red', linewidth=1.5, alpha=0.8)
axes[0, 0].set_title(f'{TICKER} Actual vs Predicted - {best_model_name}', fontweight='bold', fontsize=12)
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].legend()

# 2. Scatter plot
axes[0, 1].scatter(predictions_df['Actual'], predictions_df['Predicted'], 
                   alpha=0.5, color='steelblue', edgecolor='black')
min_val = min(predictions_df['Actual'].min(), predictions_df['Predicted'].min())
max_val = max(predictions_df['Actual'].max(), predictions_df['Predicted'].max())
axes[0, 1].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
axes[0, 1].set_title('Actual vs Predicted Scatter', fontweight='bold', fontsize=12)
axes[0, 1].set_xlabel('Actual Price ($)')
axes[0, 1].set_ylabel('Predicted Price ($)')
axes[0, 1].legend()

# 3. Error distribution
axes[1, 0].hist(predictions_df['Error'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[1, 0].axvline(x=predictions_df['Error'].mean(), color='green', linestyle='--', 
                   linewidth=2, label=f"Mean: ${predictions_df['Error'].mean():.2f}")
axes[1, 0].set_title('Prediction Error Distribution', fontweight='bold', fontsize=12)
axes[1, 0].set_xlabel('Error ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# 4. Error over time
axes[1, 1].plot(predictions_df.index, predictions_df['Error'], color='purple', linewidth=1)
axes[1, 1].axhline(y=0, color='black', linestyle='-', linewidth=1)
axes[1, 1].fill_between(predictions_df.index, predictions_df['Error'], 
                        where=(predictions_df['Error'] > 0), color='green', alpha=0.3)
axes[1, 1].fill_between(predictions_df.index, predictions_df['Error'], 
                        where=(predictions_df['Error'] < 0), color='red', alpha=0.3)
axes[1, 1].set_title('Prediction Error Over Time', fontweight='bold', fontsize=12)
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Error ($)')

plt.tight_layout()
plt.show()

## 9.2 Feature Importance Analysis

In [None]:
# Get feature importance from tree-based model
print("="*70)
print("FEATURE IMPORTANCE ANALYSIS")
print("="*70)

if 'Random Forest (Tuned)' in results:
    model_for_importance = results['Random Forest (Tuned)']['model']
elif 'Random Forest' in results:
    model_for_importance = results['Random Forest']['model']

importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': model_for_importance.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 15 Most Important Features:")
print(importance_df.head(15).to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(10, 10))

top_20 = importance_df.head(20)
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(top_20)))

bars = ax.barh(range(len(top_20)), top_20['Importance'].values, color=colors, edgecolor='black')
ax.set_yticks(range(len(top_20)))
ax.set_yticklabels(top_20['Feature'].values)
ax.invert_yaxis()
ax.set_xlabel('Importance')
ax.set_title('Top 20 Feature Importance', fontweight='bold', fontsize=14)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Lag features often dominate (yesterday's price predicts today's)")
print("- Technical indicators help capture momentum and trends")
print("- Calendar features (day/month) may show seasonal patterns")

---

<a id='part10'></a>
# Part 10: Summary and Conclusions

---

In [None]:
# Final summary dashboard
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.35, wspace=0.3)

# 1. Model Comparison
ax1 = fig.add_subplot(gs[0, :])
model_names = [k for k in results.keys() if 'test_rmse' in results[k]]
rmse_vals = [results[k]['test_rmse'] for k in model_names]
sorted_idx = np.argsort(rmse_vals)
colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(model_names)))
bars = ax1.bar([model_names[i] for i in sorted_idx], [rmse_vals[i] for i in sorted_idx], 
               color=colors, edgecolor='black')
ax1.set_ylabel('Test RMSE ($)')
ax1.set_title('Model Performance Comparison', fontweight='bold', fontsize=14)
ax1.tick_params(axis='x', rotation=15)
for bar in bars:
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.2,
             f'${bar.get_height():.2f}', ha='center', fontweight='bold', fontsize=9)

# 2. Actual vs Predicted
ax2 = fig.add_subplot(gs[1, :2])
ax2.plot(predictions_df.index, predictions_df['Actual'], label='Actual', color='black', linewidth=1.5)
ax2.plot(predictions_df.index, predictions_df['Predicted'], label='Predicted', color='red', linewidth=1.5, alpha=0.8)
ax2.set_title(f'Best Model: {best_model_name}', fontweight='bold', fontsize=12)
ax2.set_ylabel('Price ($)')
ax2.legend()

# 3. Error Distribution
ax3 = fig.add_subplot(gs[1, 2])
ax3.hist(predictions_df['Error'], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
ax3.axvline(x=0, color='red', linestyle='--', linewidth=2)
ax3.set_title('Error Distribution', fontweight='bold', fontsize=12)
ax3.set_xlabel('Error ($)')

# 4. Top Features
ax4 = fig.add_subplot(gs[2, 0])
top_10 = importance_df.head(10)
ax4.barh(top_10['Feature'][::-1], top_10['Importance'][::-1], 
         color=plt.cm.viridis(np.linspace(0.3, 0.9, 10))[::-1], edgecolor='black')
ax4.set_xlabel('Importance')
ax4.set_title('Top 10 Features', fontweight='bold', fontsize=12)

# 5. Metrics Summary
ax5 = fig.add_subplot(gs[2, 1])
ax5.axis('off')
metrics_text = f"""
BEST MODEL: {best_model_name}

Test Metrics:
  RMSE:  ${best_result['test_rmse']:.2f}
  R2:    {best_result['test_r2']:.4f}
  MAPE:  {best_result['test_mape']:.2f}%

Dataset:
  Stock: {TICKER}
  Period: {START_DATE} to {END_DATE}
  Train: {len(X_train)} days
  Test: {len(X_test)} days
  Features: {len(feature_cols)}
"""
ax5.text(0.1, 0.9, metrics_text, transform=ax5.transAxes, fontsize=10,
         verticalalignment='top', fontfamily='monospace',
         bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.5))

# 6. Full Price History
ax6 = fig.add_subplot(gs[2, 2])
ax6.plot(df.index, df['Close'], color='steelblue', linewidth=1)
ax6.fill_between(df.index, df['Close'], alpha=0.3)
ax6.axvline(x=X_test.index[0], color='red', linestyle='--', label='Split')
ax6.set_title(f'{TICKER} Full History', fontweight='bold', fontsize=12)
ax6.set_ylabel('Price ($)')
ax6.legend()

plt.suptitle('STOCK PRICE PREDICTOR - SUMMARY DASHBOARD', fontweight='bold', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

---

## Key Takeaways

### 1. Time Series is Different

| Aspect | What We Learned |
|--------|----------------|
| **Data Split** | Must be chronological, not random |
| **Cross-Validation** | Use TimeSeriesSplit, not K-Fold |
| **Feature Engineering** | Technical indicators encode domain knowledge |
| **Lag Features** | Past values are crucial predictors |
| **Seasonality** | Day-of-week and monthly patterns exist |

### 2. Model Selection Insights

| Finding | Explanation |
|---------|-------------|
| **Tree models work well** | Capture non-linear patterns without assuming relationships |
| **Linear models are baseline** | Good for comparison, but may miss complex patterns |
| **Tuning helps** | Hyperparameters can significantly improve performance |

### 3. Feature Importance

| Feature Type | Why Important |
|-------------|---------------|
| **Lag features** | Yesterday's price strongly predicts today's |
| **Technical indicators** | Encode trader psychology and patterns |
| **Volume features** | Confirm trends and signal reversals |

### 4. Limitations & Disclaimers

**This is for EDUCATIONAL purposes only!**

- Past performance does NOT guarantee future results
- Models cannot predict black swan events
- Markets are influenced by news, sentiment, and unpredictable factors
- Real trading requires risk management, not just prediction

---

## Checklist

- [x] Downloaded stock data from Yahoo Finance
- [x] Understood OHLCV data structure
- [x] Explored price trends and returns
- [x] Analyzed seasonality (day-of-week, monthly patterns)
- [x] Performed time series decomposition (trend, seasonal, residual)
- [x] Created technical indicators (SMA, EMA, RSI, MACD, Bollinger Bands)
- [x] Engineered lag features
- [x] Used time-based train-test split (no data leakage!)
- [x] Trained multiple regression models
- [x] Explained why each model was chosen
- [x] Compared using RMSE, MAE, R², MAPE
- [x] Performed hyperparameter tuning
- [x] Analyzed feature importance
- [x] Visualized predictions

---

**End of Stock Price Predictor Tutorial**

In [None]:
# Final summary
print("="*70)
print("STOCK PRICE PREDICTOR - FINAL SUMMARY")
print("="*70)

print(f"\n📊 DATASET")
print(f"   Stock: {TICKER}")
print(f"   Period: {START_DATE} to {END_DATE}")
print(f"   Trading Days: {len(df)}")
print(f"   Features Created: {len(feature_cols)}")

print(f"\n🏆 BEST MODEL")
print(f"   Name: {best_model_name}")
print(f"   Test RMSE: ${best_result['test_rmse']:.2f}")
print(f"   Test R²: {best_result['test_r2']:.4f}")
print(f"   Test MAPE: {best_result['test_mape']:.2f}%")

print(f"\n📈 ALL MODEL RMSE (Test)")
for name in sorted(results.keys(), key=lambda x: results[x].get('test_rmse', float('inf'))):
    if 'test_rmse' in results[name]:
        print(f"   {name}: ${results[name]['test_rmse']:.2f}")

print(f"\n🔑 TOP 5 IMPORTANT FEATURES")
for _, row in importance_df.head(5).iterrows():
    print(f"   {row['Feature']}: {row['Importance']*100:.1f}%")

print("\n" + "="*70)
print("PROJECT COMPLETE!")
print("="*70)
print("\n⚠️  DISCLAIMER: This is for EDUCATIONAL purposes only.")
print("    Stock markets are unpredictable.")
print("    Do NOT use this for real trading decisions!")