<a href="https://colab.research.google.com/github/aiyman14/DACSS-756-Machine-Learning-for-Social-Science-/blob/main/Time-Series%20Cross-Validated%20Classification%20of%20Bitcoin%20Price%20Movements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### 1) Motivation

In this project, I use supervised machine learning to predict whether the price of Bitcoin will increase or decrease in the next hour based on historical market data. This is a binary classification problem where the target variable indicates whether the next hour’s closing price is higher than the current hour’s closing price. The data I am using is Yahoo Finances hourly data of bitcoin prices from December 20th, 2023, till the present day. There are some standard features in the dataset originally, but I constructed some more using information that is available up to the current hour, while the target represents a future outcome that is unknown at prediction time.

This is a useful machine learning problem because short-term cryptocurrency price movements are noisy and fast-moving, making them hard to predict. On top of that they are influenced by complex interactions between momentum, volatility, and market behavior. While people can analyze charts or technical indicators manually, doing so consistently at an hourly frequency is time-consuming and would just be a very strenious task, and it becomes difficult to process many signals at once. Supervised learning models can evaluate these patterns systematically and detect subtle relationships across multiple features that are hard to track manually.

Using machine learning also saves time and money compared to other methods. Once trained, a model can generate predictions quickly and consistently without requiring constant human intervention. This makes it especially useful for monitoring markets in real time or supporting automated decision-making systems. I think in general, this task fits the very basic description of supervised learning, using past data to make informed predictions about future outcomes. Even though cryptocurrency markets are challenging to forecast, and influenced by other real world factors not just past data, predicting short-term price direction from past data is a realistic way to apply supervised machine learning in this setting.

In [None]:
!pip install yfinance --quiet

import yfinance as yf

In [None]:
btc = yf.download(
    tickers="BTC-USD",
    interval="60m",
    start="2023-12-20",
    progress=False
)

btc.head()

In [None]:
btc.info()

In [None]:
btc.columns

In [None]:
btc.index

In [None]:
btc.to_csv("/kaggle/working/btc_hourly.csv", index=False)

problems with the initial data:

The data did not include a proper datetime column, so there was no clear indication of when each hourly observation occurred. Since the entire goal of this project is to predict the next hour’s price movement using past information, having an accurate timestamp is essential.

The dataset contained an extra first row with the value “BTC-USD". This caused all columns to be read as strings instead of numeric values, which led to repeated type conversion issues when performing calculations.

In [None]:
#flatten columns
if isinstance(btc.columns, pd.MultiIndex):
    btc.columns = btc.columns.get_level_values(-1)

#rename columsn
btc.columns = ['Close', 'High', 'Low', 'Open', 'Volume']

#create copy of desired coluns
btc = btc[['Close', 'High', 'Low', 'Open', 'Volume']].copy()

#turn datetime into an actual columns
btc.index.name = "datetime"
btc = btc.reset_index()

# make sure columns are numeric
num_cols = ['Close', 'High', 'Low', 'Open', 'Volume']
btc[num_cols] = btc[num_cols].astype(float)

#save clean version
btc.to_csv("btc_hourly_clean.csv", index=False)

In [None]:
btc.head()

In [None]:
btc.info()

In [None]:
#make sure datetime is in the right format
btc['datetime'] = pd.to_datetime(btc['datetime'])

#sort in chronological order
btc = btc.sort_values('datetime').reset_index(drop=True)

#check to make sure it worked
btc.info()
btc.head()

Here I am making sure the datetime column is actually usable. I am just forcing it into the proper pandas datetime format jus to make sure there are no issues later.  Then I sort by datetime so the rows are guaranteed to be in the right time order so I can use it later.Resetting the index just cleans up the row numbers after sorting.

In [None]:
#  hourly return based on the closing price
btc['return_1h'] = btc['Close'].pct_change()
btc['target'] = (btc['return_1h'].shift(-1) > 0).astype(int)

#label whether the next hour's price goes up or not
btc = btc.iloc[:-1].copy()

#check balance
btc[['datetime', 'Close', 'return_1h', 'target']].head()
btc['target'].value_counts(normalize=True)

Here I am creating the target variable.

I first calculate the hourly return using the percentage change in the closing price, which tells me how much the price moved from one hour to the next. Then I shift that return forward by one hour so that, for each row, the target reflects what happens in the next hour rather than the current one. If the next hour’s return is positive, I label it as 1, and if it is not, I label it as 0, which gives me a clear up or down classification target. I remove the last row because there is no next hour available for it, so its target would be missing. Finally, I do a quick check to make sure the target lines up with the prices and to see how balanced the classes are, and I can see that the classes are basically balacned, which is a good indicator.

In [None]:
# feature engineering

#lagged returns
btc['lag_return_1h'] = btc['return_1h'].shift(1)
btc['lag_return_2h'] = btc['return_1h'].shift(2)
btc['lag_return_6h'] = btc['return_1h'].shift(6)

#rolling averages
btc['rolling_mean_6h'] = btc['Close'].rolling(window=6).mean()
btc['rolling_mean_24h'] = btc['Close'].rolling(window=24).mean()

#rolling volatility
btc['rolling_std_6h'] = btc['Close'].rolling(window=6).std()
btc['rolling_std_24h'] = btc['Close'].rolling(window=24).std()

#time-of-day and day-of-week
btc['hour'] = btc['datetime'].dt.hour
btc['day_of_week'] = btc['datetime'].dt.dayofweek

#drop rows with missing values
btc = btc.dropna().reset_index(drop=True)

btc.head()

Here I am creating the main features that the models will actually learn from.

I start by adding lagged returns so the model can see how the price has been behaving over the last few hours, which helps capture short term momentum or reversals using only past information.

Then I add rolling averages of the closing price over 6 and 24 hours to give the model a sense of short term trends and whether the current price is relatively high or low compared to recent history. I also include rolling volatility over the same windows, which tells the model how stable or choppy the market has been, since price direction can behave differently in high versus low volatility periods.

I also extract the hour of the day and day of the week from the datetime column to allow the model to pick up on any recurring time based patterns in trading behavior.

Lastly I drop the rows at the start of the dataset where lagged and rolling features cannot be computed becaue they required past data so that the data going into the models is clean and complete.

In [None]:
#exponential moving averages
btc['ema_12'] = btc['Close'].ewm(span=12, adjust=False).mean()
btc['ema_26'] = btc['Close'].ewm(span=26, adjust=False).mean()

#MACD = ema_12 - ema_26
btc['macd'] = btc['ema_12'] - btc['ema_26']

#relative strength index
delta = btc['Close'].diff()
gain = delta.where(delta > 0, 0)
loss = -delta.where(delta < 0, 0)

avg_gain = gain.rolling(window=14).mean()
avg_loss = loss.rolling(window=14).mean()

rs = avg_gain / avg_loss
btc['rsi_14'] = 100 - (100 / (1 + rs))

#drop rows with missing values
btc = btc.dropna().reset_index(drop=True)

#look at the new columns
btc[['ema_12','ema_26','macd','rsi_14']].head()


Here I am adding a set of technical indicators that are commonly used in trading to summarize trend and momentum in a more structured way.

Exponential moving averages smooth the price by repeatedly averaging it over time while giving more weight to the most recent closing prices. I did this using ewm() on the closing price, which updates the average each hour while gradually discounting older prices. The 12 hour EMA reacts more quickly to new price changes, while the 26 hour EMA changes more slowly, so using both allows the model to compare short term price movement to the longer term trend.

The MACD is the difference between these two EMAs, and it is widely used as a momentum indicator, where larger positive values suggest upward momentum and negative values suggest downward momentum.

The relative strength index measures whether recent price movements have been driven more by upward moves or downward moves over a fixed window. I did this by separating positive and negative price changes, averaging them over the past 14 hours, and combining them into a single score between 0 and 100. Higher values indicate stronger recent buying pressure, while lower values indicate stronger selling pressure, which helps summarize short term market momentum in a way the model can use.

These indicators are some of the things people who trade for a living look at, but I wated to include them with the hopes that the model can learn from them like traders do. I want my model to learn similar patterns directly from the data instead of relying on fixed rules or thresholds.

Like lagged returns and the rolling features, all of these indicators require looking back several hours, the first few rows contain missing values, so I drop them to keep the dataset clean before modeling.

In [None]:
#final set of features the model will use
feature_cols = [
    'Close', 'Volume', 'return_1h', 'lag_return_1h', 'lag_return_2h', 'lag_return_6h', 'rolling_mean_6h', 'rolling_mean_24h', 'rolling_std_6h',
    'rolling_std_24h', 'hour', 'day_of_week', 'ema_12', 'ema_26', 'macd', 'rsi_14'
]

#feature matrix X and target vector y
X = btc[feature_cols].copy()
y = btc['target'].copy()

#check shape anc alss balance
X.shape, y.shape, y.value_counts(normalize=True)

Here I am formally defining the inputs and outputs for the supervised learning models that I will implement later.

I list out all the engineered features I want the models to learn from, and then use this list to build the feature matrix X, which contains only those columns, and set y equal to the target variable I created earlier.

Finally, I do a quick check of the shapes and class proportions to make sure the number of rows line up correctly and that the target is still roughly balanced before moving on.

In [None]:
#time based train-test-split

#first 80% = train, last 20% = test
split_index = int(len(X) * 0.8)

X_train = X.iloc[:split_index].copy()
X_test  = X.iloc[split_index:].copy()

y_train = y.iloc[:split_index].copy()
y_test  = y.iloc[split_index:].copy()

#check split sizes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Here I am using train-test-split on the data.

I take the first 80 percent of the observations as the training set and reserve the last 20 percent as the test set so the model is always trained on past data and evaluated on future data. I specifically did this because randomly splitting time series data would leak future information into training and give unrealistically good results.

I then split both the feature matrix and the target variable using the same cutoff point and do a quick shape check to make sure everything lines up correctly before moving on to model training.

In [None]:
# EDA

print("Final btc shape:", btc.shape)
print("X shape:", X.shape, "| y shape:", y.shape)

print("\nData types (features):")
display(X.dtypes.value_counts())

print("\nMissing values (top 10):")
display(btc.isna().sum().sort_values(ascending=False).head(10))

print("\nHourly return sanity check:")
display(btc['return_1h'].describe())

print("\nHourly return extreme quantiles:")
display(btc['return_1h'].quantile([0.001, 0.01, 0.5, 0.99, 0.999]))

In [None]:
# EDA Visuals

import matplotlib.pyplot as plt
import seaborn as sns

#price history over time plot
plt.figure(figsize=(12, 4))
plt.plot(btc['datetime'], btc['Close'])
plt.title("BTC Hourly Closing Price Over Time")
plt.xlabel("Date")
plt.ylabel("Close Price (USD)")
plt.tight_layout()
plt.show()

#distribution of hourly returns plot
plt.figure(figsize=(6, 4))
sns.histplot(btc['return_1h'], bins=50, kde=True)
plt.title("Distribution of Hourly Returns")
plt.xlabel("Hourly Return")
plt.tight_layout()
plt.show()

#correlation heatmap of features plot
plt.figure(figsize=(14, 10))
corr = btc[feature_cols].corr()
sns.heatmap(corr, cmap='coolwarm', center=0)
plt.title("Correlation Heatmap of Features")
plt.tight_layout()
plt.show()

### 2) Exploratory Data Analysis

After cleaning and feature engineering, the dataset contains over seventeen thousand hourly observations and sixteen feature columns, along with a binary target variable. All feature columns are numeric, including price, volume, lagged returns, rolling statistics, technical indicators, and time-based variables such as hour of day and day of week. The target variable is also numeric and represents whether the next hour’s closing price increases or decreases.

Because this is a classification problem, I first check whether the target variable is balanced. The distribution shows that the classes are nearly evenly split, with slightly more upward movements than downward ones, which means there is no strong class imbalance that would bias the models toward one outcome. I also check for missing values throughout the dataset. Missing values appear only as a result of lagged features, rolling windows, and technical indicators that require historical data. These rows are dropped so that the final dataset used for modeling is complete and consistent.

To better understand the structure of the data, I examine several visual summaries. The price history plot shows that Bitcoin’s hourly closing price exhibits strong trends, sharp reversals, and periods of high volatility rather than stable behavior. This highlights why short-term prediction must rely on recent market conditions instead of long-term averages.

The distribution of hourly returns is centered close to zero with heavy tails, meaning most price changes are small but occasional large jumps occur in both directions. This confirms that the prediction task is noisy and not so straightforward.

Finally, the correlation heatmap shows that some features derived from similar ideas, such as rolling averages and exponential moving averages, are correlated, while many other features capture different aspects of market behavior. What I took this to mean is the feature set contains a mix of complementary signals rather than redundant information, making god to train my models from.

### 3) Evaluation Metric

Because the target variable is a binary classification outcome indicating whether Bitcoin's price will increase in the next hour, I evaluate model performance using the F1 score. This metric is appropriate because it balances precision and recall, treating false positives and false negatives symmetrically.

Although the target distribution is nearly perfectly balanced (approximately 50.9% up vs. 49.1% down), hourly crypto returns are noisy and small directional changes can be difficult to predict, making F1 a better measure of meaningful classification performance than accuracy alone.

F1 ensures that the model must correctly identify both upward and downward movements rather than defaulting to the majority direction, and is better for my goal of detecting real directional shifts.


### 4) Model Fitting

To evaluate different modeling approaches, I fit three supervised learning models.

I chose logistic regression with regularization because it is a simple and interpretable baseline model.

Random forest is included as a flexible, tree-based ensemble method that can capture nonlinear relationships and interactions better than my first model.

Finally, I use a support vector machine with a nonlinear kernel to model more complex decision boundaries in a high-dimensional feature space.

Using these three models allows for a meaningful comparison across different levels of model complexity while following the assignment requirement of drawing from distinct methodological chapters.

All models are trained using the same time-based training set and tuned using cross-validation with TimeSeriesSplit to ensure that validation data always comes after training data in time. Hyperparameters are selected using cross-validation based on the F1 score, and the final tuned models are evaluated on a held-out test set.

In [None]:
#Logistic Regression with L2 regularization

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.pipeline import Pipeline

tscv = TimeSeriesSplit(n_splits=5)

log_reg_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=500))
])

param_grid = {
    'logreg__C': [0.001, 0.01, 0.1, 1, 10]
}

log_reg_cv = GridSearchCV(
    estimator=log_reg_pipe,
    param_grid=param_grid,
    cv=tscv,
    scoring='f1',
    n_jobs=-1
)

log_reg_cv.fit(X_train, y_train)

best_logreg = log_reg_cv.best_estimator_
best_logreg_C = log_reg_cv.best_params_['logreg__C']

I use logistic regression as a baseline model because it provides a clear and interpretable starting point for a binary classification problem. Since the features operate on very different scales, I standardize them so the model and regularization behave correctly. I include L2 regularization to reduce overfitting in noisy hourly price data by shrinking coefficients rather than allowing the model to fit small fluctuations too closely. The regularization strength is tuned over a range of values to balance bias and variance. TimeSeriesSplit is used during cross-validation to avoid leaking future information, and the F1 score is used so the model must correctly identify both upward and downward price movements rather than defaulting to a single direction.

In [None]:
#Random Forest Classifier with time-based CV

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42, n_jobs=-1)

param_grid_rf = {
    'n_estimators': [100, 300],
    'max_depth': [None, 5, 10],
    'min_samples_leaf': [1, 5]
}

rf_cv = GridSearchCV(
    estimator=rf,
    param_grid=param_grid_rf,
    cv=tscv,
    scoring='f1',
    n_jobs=-1
)

rf_cv.fit(X_train, y_train)

best_rf = rf_cv.best_estimator_
best_rf_params = rf_cv.best_params_

I use a random forest model to allow for nonlinear relationships and interactions between features that a linear model cannot capture. Random forests combine many decision trees to reduce variance and improve stability compared to a single tree. I tune the number of trees to balance predictive stability and computational cost, and I tune tree depth and minimum leaf size to control overfitting, which I thought would be important in high-frequency financial data like my dataset where noise is common. Cross-validation is again done using TimeSeriesSplit, and the F1 score is used for tuning so the model is evaluated consistently with the other approaches.

In [None]:
#SVM

from sklearn.svm import SVC

svm_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

param_grid_svm = {
    'svm__C': [0.1, 1, 10],
    'svm__kernel': ['rbf'],
    'svm__gamma': ['scale', 0.01, 0.001]
}

svm_cv = GridSearchCV(
    estimator=svm_pipe,
    param_grid=param_grid_svm,
    cv=tscv,
    scoring='f1',
    n_jobs=-1
)

svm_cv.fit(X_train, y_train)

best_svm = svm_cv.best_estimator_
best_svm_params = svm_cv.best_params_

I include a support vector machine with a radial basis function kernel because it is well suited for high-dimensional data and can model complex, nonlinear decision boundaries. Since SVMs are sensitive to feature scale, I standardize the inputs using a pipeline to ensure stable optimization. The regularization parameter is tuned to control how strictly the model fits the training data, while the kernel parameter is tuned to adjust how locally the model responds to individual observations. As with the other models, time-based cross-validation is used to prevent leakage, and the F1 score is used to select the final model configuration. This model provides a contrast to both the linear baseline and the tree-based ensemble in terms of flexibility and generalization.

# 5) Model Comparison

In [None]:
from sklearn.metrics import f1_score

#Logistic Regression
y_pred_logreg = best_logreg.predict(X_test)
f1_logreg = f1_score(y_test, y_pred_logreg)

#Random Forest
y_pred_rf = best_rf.predict(X_test)
f1_rf = f1_score(y_test, y_pred_rf)

#SVM
y_pred_svm = best_svm.predict(X_test)
f1_svm = f1_score(y_test, y_pred_svm)

#results
model_scores = {
    "Logistic Regression (L2)": f1_logreg,
    "Random Forest": f1_rf,
    "SVM (RBF Kernel)": f1_svm
}

model_scores

The three models show clear differences in performance and behavior when evaluated on the held-out test set. The support vector machine achieves the highest F1 score, followed by logistic regression, while the random forest performs the worst.

Logistic regression has the highest bias and lowest variance among the three models, which limits its flexibility but makes it stable and easy to interpret. Its performance suggests that linear relationships capture some useful signal in the data, but not enough to fully model the complexity of short-term Bitcoin price movements.

Random forest is much more flexible and has lower bias, but this flexibility comes with higher variance, and in this setting it appears to overfit noise in the training data, leading to weaker generalization on the test set.

The support vector machine strikes the best balance between bias and variance by allowing nonlinear decision boundaries while still enforcing strong regularization, which helps it generalize better than the other models.

In terms of interpretability, logistic regression is the easiest to understand, random forest offers some interpretability through feature importance but is harder to analyze directly, and the SVM is the least interpretable.

Overall, the results suggest that moderate flexibility with strong regularization is most effective for this task, which explains why the SVM performs best on out-of-sample data.