# Amazon Stock Price Direction Prediction

**DISCLAIMER**: This work is for educational purposes only and does not constitute financial advice. The objective of this project is to explore the application of machine learning models in predicting Amazon stock market behavior and to evaluate their limitations.

**What we're doing:** Predicting if tomorrow's Amazon stock will go UP or DOWN

**Why?** If we know the direction, we can:
- Buy at opening if we predict UP
- Don't buy (or sell) if we predict DOWN




**The task:** Binary classification
- Target = 1 means tomorrow closes HIGHER than it opens (UP day)
- Target = 0 means tomorrow closes LOWER than it opens (DOWN day)

**Success criteria:** Get AUC score > 0.515 (better than random guessing)

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Evaluation metrics
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report, confusion_matrix

# Ignore warnings to keep output clean
import warnings
warnings.filterwarnings('ignore')

# Make plots look nice
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

Libraries imported successfully!


We have Amazon stock data from 1997 to 2020 split into three files:
- **Training data** (1997-2016): We use this to teach our model
- **Validation data** (2016-2018): We use this to pick the best model
- **Test data** (2018-2020): We use this for final evaluation (the model has never seen this!)

Each file has the same columns:
- Date
- Open (opening price)
- High (highest price that day)
- Low (lowest price that day)
- Close (closing price)
- Adj Close (adjusted closing price)
- Volume (number of shares traded)

In [3]:
# Load the three CSV files
train_df = pd.read_csv('datasets/AMZN_train.csv', parse_dates=['Date'])
val_df = pd.read_csv('datasets/AMZN_val.csv', parse_dates=['Date'])
test_df = pd.read_csv('datasets/AMZN_test.csv', parse_dates=['Date'])

# Check the sizes
print("Dataset Sizes:")
print(f"  Training:   {len(train_df):,} days")
print(f"  Validation: {len(val_df):,} days")
print(f"  Test:       {len(test_df):,} days")

# Look at the first few rows
print("\nFirst 5 rows of training data:")
train_df.head()

Dataset Sizes:
  Training:   4,781 days
  Validation: 503 days
  Test:       504 days

First 5 rows of training data:


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1997-05-15,2.4375,2.5,1.927083,1.958333,1.958333,72156000
1,1997-05-16,1.96875,1.979167,1.708333,1.729167,1.729167,14700000
2,1997-05-19,1.760417,1.770833,1.625,1.708333,1.708333,6106800
3,1997-05-20,1.729167,1.75,1.635417,1.635417,1.635417,5467200
4,1997-05-21,1.635417,1.645833,1.375,1.427083,1.427083,18853200
