# Cryptocurrency Price Prediction Model

## Problem Statement
In this project, we aim to build a predictive model that forecasts the future closing price of a cryptocurrency based on its historical price data. The dataset contains various features such as the highest price (`High`), lowest price (`Low`), opening price (`Open`), closing price (`Close`), volume of transactions (`Volume`), and market capitalization (`Marketcap`) for each day. Our objective is to apply machine learning regression techniques to estimate the closing price of the cryptocurrency for the next day, given its historical performance.

## Outcome
The outcome of this project will be a regression model that predicts the continuous value of the cryptocurrency's closing price. The model's performance will be evaluated using metrics such as Mean Squared Error (MSE) and R-squared (R2) to quantify its accuracy and predictive power.

## Approach
We will follow these steps to develop our predictive model:
1. Load and preprocess the data, ensuring it is clean and suitable for modeling.
2. Perform exploratory data analysis to understand the trends and patterns in the data.
3. Engineer relevant features that could improve the model's predictions.
4. Split the data into training and testing sets to validate the model's performance.
5. Train several regression models and compare their results to select the best performer.
6. Evaluate the chosen model using appropriate metrics and validate it to ensure its generalizability.
7. Use the final model to make predictions on new, unseen data.

By the end of this project, we will have a robust model capable of predicting cryptocurrency prices, which could be a valuable tool for investors and traders looking to make informed decisions in the volatile crypto market.
ew, unseen data.le crypto market.


In [17]:
import pandas as pd 
import os 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR 
from sklearn.metrics import mean_squared_error, r2_score

In [18]:
# Assuming all CSV files are in the 'data' directory
data_dir = 'C:\\Users\\Subhash.Singh\\Documents\\DataScienceCourse\\Hackthon\\Data'
all_files = [os.path.join(data_dir, file) for file in os.listdir(data_dir) if file.endswith('.csv')]

# Load and combine the CSV files
df_list = [pd.read_csv(file) for file in all_files]
combined_df = pd.concat(df_list, ignore_index=True)

In [19]:
# Check for duplicates
print(combined_df.duplicated().sum())

# Drop duplicates if necessary
combined_df = combined_df.drop_duplicates()

0


In [4]:
# Explore the combined dataset
combined_df.info()
combined_df.describe()
combined_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37082 entries, 0 to 37081
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   SNo        37082 non-null  int64  
 1   Name       37082 non-null  object 
 2   Symbol     37082 non-null  object 
 3   Date       37082 non-null  object 
 4   High       37082 non-null  float64
 5   Low        37082 non-null  float64
 6   Open       37082 non-null  float64
 7   Close      37082 non-null  float64
 8   Volume     37082 non-null  float64
 9   Marketcap  37082 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 2.8+ MB


Unnamed: 0,SNo,Name,Symbol,Date,High,Low,Open,Close,Volume,Marketcap
0,1,Aave,AAVE,2020-10-05 23:59:59,55.112358,49.7879,52.675035,53.219243,0.0,89128130.0
1,2,Aave,AAVE,2020-10-06 23:59:59,53.40227,40.734578,53.291969,42.401599,583091.5,71011440.0
2,3,Aave,AAVE,2020-10-07 23:59:59,42.408314,35.97069,42.399947,40.083976,682834.2,67130040.0
3,4,Aave,AAVE,2020-10-08 23:59:59,44.902511,36.696057,39.885262,43.764463,1658817.0,220265100.0
4,5,Aave,AAVE,2020-10-09 23:59:59,47.569533,43.291776,43.764463,46.817744,815537.7,235632200.0


In [43]:
# Handle missing values by imputation or removal
combined_df = combined_df.dropna()  # Example: Remove rows with missing values

        SNo  Name Symbol                 Date       High        Low  \
0         1  Aave   AAVE  2020-10-05 23:59:59  55.112358  49.787900   
1         2  Aave   AAVE  2020-10-06 23:59:59  53.402270  40.734578   
2         3  Aave   AAVE  2020-10-07 23:59:59  42.408314  35.970690   
3         4  Aave   AAVE  2020-10-08 23:59:59  44.902511  36.696057   
4         5  Aave   AAVE  2020-10-09 23:59:59  47.569533  43.291776   
...     ...   ...    ...                  ...        ...        ...   
37077  2889   XRP    XRP  2021-07-02 23:59:59   0.667287   0.634726   
37078  2890   XRP    XRP  2021-07-03 23:59:59   0.683677   0.644653   
37079  2891   XRP    XRP  2021-07-04 23:59:59   0.707783   0.665802   
37080  2892   XRP    XRP  2021-07-05 23:59:59   0.695653   0.648492   
37081  2893   XRP    XRP  2021-07-06 23:59:59   0.679923   0.652676   

            Open      Close        Volume     Marketcap  
0      52.675035  53.219243  0.000000e+00  8.912813e+07  
1      53.291969  42.401599  5.

In [49]:
# Select features and target
features = ['High', 'Low', 'Open', 'Volume', 'Marketcap']
target = 'Close'

# Extract features and target
X = combined_df[features]
y = combined_df[target]

# Normalize or scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [33]:
# Initialize models
lr = LinearRegression()
rf = RandomForestRegressor()
svr = SVR()

# Train models
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)
svr.fit(X_train, y_train)

# Predict on test set
lr_pred = lr.predict(X_test)
rf_pred = rf.predict(X_test)
svr_pred = svr.predict(X_test)

In [51]:
# Evaluate Linear Regression
print('Linear Regression MSE:', mean_squared_error(y_test, lr_pred))
print('Linear Regression R2:', r2_score(y_test, lr_pred))

# Evaluate Random Forest Regressor
print('Random Forest MSE:', mean_squared_error(y_test, rf_pred))
print('Random Forest R2:', r2_score(y_test, rf_pred))

# Evaluate SVR
print('SVR MSE:', mean_squared_error(y_test, svr_pred))
print('SVR R2:', r2_score(y_test, svr_pred))

Linear Regression MSE: 16033.782014291657
Linear Regression R2: 0.9993675730745086
Random Forest MSE: 12715.684780052983
Random Forest R2: 0.9994984501208887
SVR MSE: 23789237.37666393
SVR R2: 0.06167152325650005


# Model Performance Comparison

## Random Forest Model
- **Mean Squared Error (MSE):** Lowest
- **R-Squared (R2):** Highest

The Random Forest model stands out as the best performing model among the three for this particular dataset. It excels in explaining more variance and maintains a lower average error when predicting cryptocurrency prices.

## Support Vector Regression (SVR) Model
- **Mean Squared Error (MSE):** Much higher
- **R-Squared (R2):** Very low

In contrast, the SVR model shows a poor performance compared to the other two models. Its significantly higher MSE and very low R2 score indicate that it is not well-suited for this dataset.

## Conclusion
Given the metrics provided, the **Random Forest model** is the superior choice for predicting cryptocurrency prices, outperforming the other models in terms of both MSE and R2.
