# Preprocessing
The only preprocessing needed to train the BTC-USD prediction models is conversion of date to year, month & day.
Since this has already been done in the last task and the preprocessed dataset is already available to us as a csv file, we dont need to do any further preprocessing. 
We can proceed to train the models.

# Linear Regression Model
Linear regression is a fundamental statistical method used to model the relationship between one or more independent variables and a dependent variable. It assumes a linear relationship between the variables, where the goal is to find the best-fitting straight line that minimizes the difference between the predicted values and the actual observed values. The linear regression model is characterized by its simplicity and interpretability.

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import joblib

def train_regression_model(ticker, input_folder="Regression_data"):
    file_path = os.path.join(input_folder, f"prep_{ticker}_data.csv")
    
    if os.path.exists(file_path):
        data = pd.read_csv(file_path)
        
        # Select features and target variable
        features = ['Open', 'High', 'Low', 'Volume', 'Year', 'Month', 'Day']
        target = 'Close'
        
        X = data[features]
        y = data[target]
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Initialize and train a linear regression model
        model = LinearRegression()
        model.fit(X_train, y_train)
        
        # Save the trained model as a joblib file
        model_filename = f"{ticker}_regression_model.joblib"
        joblib.dump(model, model_filename)
        print(f"Trained model saved as '{model_filename}'")
        
        # Predict on the test set
        y_pred = model.predict(X_test)
        
        # Calculate and print the Mean Squared Error
        mse = mean_squared_error(y_test, y_pred)
        print(f"Mean Squared Error: {mse:.2f}")
    else:
        print(f"Preprocessed data file for {ticker} not found.")

# Example usage
ticker_to_train = 'BTC-USD'
train_regression_model(ticker_to_train)


Trained model saved as 'BTC-USD_regression_model.joblib'
Mean Squared Error: 0.79


# Evaluating Linear Regression
Mean Squared Error on Test Dataset: 0.79

# Random Forest Regressor
Random Forest is a versatile and powerful ensemble learning technique widely used in machine learning for both regression and classification tasks. Comprising an ensemble of decision trees, Random Forest combines the predictive strengths of multiple trees to create a robust and accurate model. Each decision tree in the forest is constructed using a random subset of the training data and a subset of features, introducing diversity and reducing the risk of overfitting.

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import joblib

def train_random_forest_model(ticker, input_folder="Regression_data"):
    file_path = os.path.join(input_folder, f"prep_{ticker}_data.csv")
    
    if os.path.exists(file_path):
        data = pd.read_csv(file_path)
        
        # Select features and target variable
        features = ['Open', 'High', 'Low', 'Volume', 'Year', 'Month', 'Day']
        target = 'Close'
        
        X = data[features]
        y = data[target]
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Initialize and train a Random Forest regression model
        model = RandomForestRegressor(n_estimators=100, random_state=42)
        model.fit(X_train, y_train)
        
        # Save the trained model as a joblib file
        model_filename = f"{ticker}_random_forest_model.joblib"
        joblib.dump(model, model_filename)
        print(f"Random Forest model saved as '{model_filename}'")
        
        # Predict on the test set
        y_pred = model.predict(X_test)
        
        # Calculate and print the Mean Squared Error
        mse = mean_squared_error(y_test, y_pred)
        print(f"Mean Squared Error: {mse:.2f}")
    else:
        print(f"Preprocessed data file for {ticker} not found.")

# Example usage
ticker_to_train = 'BTC-USD'
train_random_forest_model(ticker_to_train)


Random Forest model saved as 'BTC-USD_random_forest_model.joblib'
Mean Squared Error: 2.67


# Evaluating Random Forest Regressor
Mean Squared Error on Test Dataset: 2.67



# Gradient Boosting
Gradient Boosting is a powerful machine learning technique that excels at building predictive models with high accuracy and flexibility. It falls under the ensemble learning umbrella and is known for its ability to sequentially improve upon the weaknesses of its predecessors in the ensemble, ultimately leading to a strong and accurate model.

At its core, Gradient Boosting builds a predictive model by combining the predictions of multiple weak learners, often decision trees. Unlike Random Forest, which constructs trees in parallel, Gradient Boosting builds trees sequentially, with each new tree aiming to correct the errors made by the previous ones. 

In [4]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import joblib

def train_gradient_boosting_model(ticker, input_folder="Regression_data"):
    file_path = os.path.join(input_folder, f"prep_{ticker}_data.csv")
    
    if os.path.exists(file_path):
        data = pd.read_csv(file_path)
        
        # Select features and target variable
        features = ['Open', 'High', 'Low', 'Volume', 'Year', 'Month', 'Day']
        target = 'Close'
        
        X = data[features]
        y = data[target]
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Initialize and train a Gradient Boosting regression model
        model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
        model.fit(X_train, y_train)
        
        # Save the trained model as a joblib file
        model_filename = f"{ticker}_gradient_boosting_model.joblib"
        joblib.dump(model, model_filename)
        print(f"Gradient Boosting model saved as '{model_filename}'")
        
        # Predict on the test set
        y_pred = model.predict(X_test)
        
        # Calculate and print the Mean Squared Error
        mse = mean_squared_error(y_test, y_pred)
        print(f"Mean Squared Error: {mse:.2f}")
    else:
        print(f"Preprocessed data file for {ticker} not found.")

# Example usage
ticker_to_train = 'BTC-USD'
train_gradient_boosting_model(ticker_to_train)


Gradient Boosting model saved as 'BTC-USD_gradient_boosting_model.joblib'
Mean Squared Error: 2.35


# Evaluating Gradient Boosting Model
Mean Squared Error on Test Dataset: 2.35

# Final Model Recommendation

1. Linear Regression Model: MSE= 0.79
2. Random Forest Regressor: MSE= 2.67
3. Gradient Boosting Model: MSE= 2.35

From the above observations we can conclude that Linear Regression model is the best-suited for our purpose as it has the least MSE out of all 3 models..