# **Bitcoin Price Analysis Report**

## **1. [ABOUT Dataset](https://github.com/chu-siang/Bitcoin_Analysis_ML/blob/main/data/raw/bitcoin_raw_data.csv)**

- Dataset Link: https://github.com/chu-siang/Bitcoin_Analysis_ML/blob/main/data/raw/bitcoin_raw_data.csv. 

- The dataset used in this project is a self-collected Bitcoin historical dataset available as a CSV file on GitHub ([bitcoin_raw_data.csv](https://github.com/chu-siang/Bitcoin_Analysis_ML/blob/main/data/raw/bitcoin_raw_data.csv))

- Using Python to retrieve price data from a public cryptocurrency API.

    - The raw data consists of time-series records of Bitcoin’s price and trading volume at regular intervals (hourly)

    - Each record includes a timestamp and basic market information such as open, high, low, close prices, and volume.

    - Technical indicators and target variables were computed from raw data during preprocessing.

    - Features include numeric continuous data (prices, indicators, returns) and binary/categorical data (hour of day, weekend flag).

###   **1-1. Dataset Preprocessing steps:**

  - Converted timestamps to datetime format.

  - Ensured consistent hourly intervals.

  - Data cleaned for missing timestamps and sorted chronologically.

  - Data indexed by date-time.

  - Final dataset includes engineered features and a 24-hour future return target per timestamp.

  - Normalized/scaled features (e.g., scaling for Support Vector Regression).

###   **1-2. Dataset Construction and Feature Extraction :**

- **Data Collection:**

  - Bitcoin price and volume data collected hourly via Python script ([Binance API](https://api.binance.com/api/v3/klines)).

  - Data includes OHLCV (Open, High, Low, Close, Volume).

  - Timestamps processed to create additional features:
    
    - hour (0–23) to capture time-of-day patterns.
    
    - is_weekend (1 for Saturday/Sunday, 0 otherwise), accounting for potential weekend trading differences.

- **Technical Indicators:**

  - Computed from historical price/volume data using rolling-window calculations:

    - Simple Moving Averages (SMA): 

      - 12-hour SMA (sma_12h), average price over past 12 hours, smoothing short-term fluctuations.

    - Exponential Moving Averages (EMA): 

      - 6-hour (ema_6h), 12-hour (ema_12h), 24-hour (ema_24h), emphasizing recent price trends.

    - Relative Strength Index (RSI):

      - 14-period RSI (rsi_14), momentum indicator ranging from 0–100, signaling overbought (>70) or oversold (<30) conditions.

    - Moving Average Convergence Divergence (MACD):
      - Computed from difference between fast EMA (12 periods) and slow EMA (26 periods).

      - Includes macd, macd_signal (9-period EMA of MACD), and optionally macd_hist (MACD histogram), highlighting trend momentum shifts.

    - Bollinger Bands:

      - Calculated using 20-period moving average ± standard deviations.

      - Features include upper band (bb_upper), lower band (bb_lower), middle band (bb_middle), and bandwidth/standard deviation (bb_std), indicating volatility and extreme price deviations.

    - Volatility Measures:

      - 24-hour volatility (volatility_24h), standard deviation of returns over the past day.
      
      - Volume-related features such as volume_change and 24-hour volume moving average (volume_ma_24h) capturing trading activity trends.

  - Indicators calculated using historical data up to the current timestamp, ensuring validity for predictive modeling.

- **Target Variable (24-hour Return):**

  - Predictive target is the future 24-hour return (return_24h):

    - Calculated as percentage change from current time (t) to 24 hours later (t+24h).

    - Formula:  
    
    $$ return_{24h} = \frac{price_{t+24h} - price_{t}}{price_{t}}
    $$

    - Stored as decimal (e.g., 0.01 represents +1%).
  
  - Additional binary target (price_up_24h):

    - 1 if future return positive, 0 otherwise, useful for classification or clustering.

  - future_volatility_24h also computed (realized volatility in the next 24 hours), used only for cluster analysis, not for predictive modeling (to avoid leakage).

- Data Alignment and Scaling:

  - First 24 hours of data dropped due to indicator calculation requirements.

  - Features standardized/normalized (especially necessary for SVR).

  - Final dataset structured with:

    - Predictors (technical indicators, time features) available at current timestamp (t).

    - Target variable representing 24-hour future returns.

**Summary:**

- Constructed dataset includes timestamped records with numerous engineered features (price trends, momentum, volatility, time-context) and labeled with next-day returns.

- This feature set is designed for regression modeling and clustering.


## **2. Research Questions**
The project addresses two main research questions:   
- **Predictive Modeling Question:**

  - Can we predict short-term (24-hour ahead) Bitcoin returns using historical technical indicators?

  - This explores if machine learning models can forecast the next day's price change using past data.

- **Clustering/Market Regime Question:**

  - Can we identify meaningful clusters representing different Bitcoin market states?

  - This involves using unsupervised learning to detect natural groupings (e.g., bull vs. bear markets, high vs. low volatility) based on market data.

## **3.Algo Method (supvervised && unsupervised)**

- **Supervised Learning Methods:**

  - **Random Forest Regressor (RF):**
    - Ensemble method using decision trees trained on random subsets of data.
    - Robust against overfitting; captures nonlinear relationships effectively.
    - Implemented via `RandomForestRegressor` from scikit-learn.
    - Provides feature importance analysis to evaluate predictor relevance.

  - **Support Vector Regressor (SVR):**
    - Support vector machine-based regression with radial basis function (RBF) kernel.
    - Capable of modeling complex, nonlinear data through kernel transformations.
    - Sensitive to parameter tuning and feature scaling.
    - Implemented via `SVR` from scikit-learn.

- **Unsupervised Learning Methods:**

  - **K-Means Clustering:**
    - Partitioning algorithm grouping data into clusters by minimizing intra-cluster variance.
    - Identifies market regimes (e.g., bull vs. bear, high vs. low volatility) based on financial indicators.
    - Used K=4 clusters, chosen through experimentation (elbow method).
    - Standardized features used to ensure fair weighting of all indicators.
    - Implementation via `KMeans` from scikit-learn.

- **Data Augmentation Techniques:**

  - **Gaussian Noise Augmentation:**
    - Added small Gaussian noise (standard deviations: 0.01, 0.05, 0.1 relative to feature scale) to generate synthetic data.
    - Improves robustness and model generalization by providing additional realistic variations.

  - **Synthetic Sample Mixing:**
    - Created new synthetic samples by averaging randomly selected training samples.
    - Enhances coverage of the feature space, analogous to SMOTE (Synthetic Minority Over-sampling Technique) but adapted for regression.

- **Dimensionality Reduction:**

  - **Principal Component Analysis (PCA):**
    - Used both for visualizing cluster results in reduced dimensional space and as a preprocessing step in regression modeling.
    - Evaluated reduction to 3, 5, 10, and 15 components to investigate impacts on model accuracy and complexity.
    - PCA implemented through `PCA` from scikit-learn.

- **Model Training and Validation:**

  - Employed 5-fold cross-validation to optimize hyperparameters and assess predictive performance rigorously.
  - Training and validation splits performed consistently across experiments to ensure comparability and reliability of results.

- **Performance Metrics:**

  - Evaluated regression models using:
    - **Mean Squared Error (MSE):** Measures average squared difference between actual and predicted values; lower values indicate higher accuracy.
    - **Coefficient of Determination (R²):** Represents variance explained by the model, where values closer to 1 indicate better predictive ability.

- **Public Libraries Used:**

  - Data handling and preprocessing: `pandas`. 
  - Numerical computations: `NumPy`.
  - Machine learning algorithms (RF, SVR, KMeans, PCA): `scikit-learn`.
  - Data visualization and plotting: `Matplotlib`.

**References:**
- [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) 
- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [SVR](https://www.geeksforgeeks.org/support-vector-regression-svr-using-linear-and-non-linear-kernels-in-scikit-learn/)
- [NumPy](https://github.com/numpy/numpy)
- [scikit-learn](https://scikit-learn.org/stable/api/index.html)



## Prediction Models

Two models were trained to predict Bitcoin price movements:

1. **Random Forest Regressor**: Achieved an R² of 0.32, indicating it explains about 32% of the variance in price movements.
2. **Support Vector Regressor**: Showed negative R² performance, suggesting it may need further optimization.

## Market State Clustering

K-means clustering was used to identify distinct market states based on technical indicators. Four clusters were identified, representing different market conditions.

## Experiments

Several experiments were conducted:

1. **Data Augmentation**: Adding noise and synthetic samples to improve model robustness.
2. **Dimensionality Reduction**: Using PCA to reduce feature dimensionality while preserving information.

## Conclusions

The Random Forest model showed moderate predictive power for Bitcoin price movements. Clustering analysis revealed distinct market states with different characteristics. Future work could explore more sophisticated models and feature engineering techniques.