# Cryptocurrency Trend Prediction

This notebook explores clustering methods to analyze trends in cryptocurrency prices based on 24-hour and 7-day price changes. Key steps include:

1. **Data Loading and Preprocessing**: Load and clean the cryptocurrency data.
2. **Normalization**: Use `StandardScaler` to standardize features for consistency in clustering.
3. **Clustering with K-means**: Determine the optimal number of clusters (k) and segment cryptocurrencies using K-means clustering.
4. **Dimensionality Reduction with PCA**: Reduce data to three principal components, analyze explained variance, and optimize clusters.
5. **Comparison of Clustering Results**: Compare clustering effectiveness before and after PCA to evaluate performance.
6. **Visualization**: Plot clustering results to better understand trends among cryptocurrencies.


In [2]:
# Import required libraries and dependencies
import pandas as pd
import hvplot.pandas  # For interactive plotting
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Import local modules
import ML as ml

In [3]:
# Load the data into a Pandas DataFrame and set 'coin_id' as the index
df_market_data = pd.read_csv("Raw_Data/crypto_market_data.csv", index_col="coin_id")

# Display the first 10 rows to inspect the data
df_market_data.head(10)

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,1.08388,7.60278,6.57509,7.67258,-3.25185,83.5184,37.51761
ethereum,0.22392,10.38134,4.80849,0.13169,-12.8889,186.77418,101.96023
tether,-0.21173,0.04935,0.0064,-0.04237,0.28037,-0.00542,0.01954
ripple,-0.37819,-0.60926,2.24984,0.23455,-17.55245,39.53888,-16.60193
bitcoin-cash,2.90585,17.09717,14.75334,15.74903,-13.71793,21.66042,14.49384
binancecoin,2.10423,12.85511,6.80688,0.05865,36.33486,155.61937,69.69195
chainlink,-0.23935,20.69459,9.30098,-11.21747,-43.69522,403.22917,325.13186
cardano,0.00322,13.99302,5.55476,10.10553,-22.84776,264.51418,156.09756
litecoin,-0.06341,6.60221,7.28931,1.21662,-17.2396,27.49919,-12.66408
bitcoin-cash-sv,0.9253,3.29641,-1.86656,2.88926,-24.87434,7.42562,93.73082


In [4]:
# Generate summary statistics to understand distribution and variance in each column
df_market_data.describe()

Unnamed: 0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,-0.269686,4.497147,0.185787,1.545693,-0.094119,236.537432,347.667956
std,2.694793,6.375218,8.376939,26.344218,47.365803,435.225304,1247.842884
min,-13.52786,-6.09456,-18.1589,-34.70548,-44.82248,-0.3921,-17.56753
25%,-0.60897,0.04726,-5.02662,-10.43847,-25.90799,21.66042,0.40617
50%,-0.06341,3.29641,0.10974,-0.04237,-7.54455,83.9052,69.69195
75%,0.61209,7.60278,5.51074,4.57813,0.65726,216.17761,168.37251
max,4.84033,20.69459,24.23919,140.7957,223.06437,2227.92782,7852.0897


In [7]:
# Plot data to visually inspect trends or patterns in the time series
initial_inspection_trendlines = df_market_data.hvplot.line(width=800, height=400, rot=90)

hvplot.save(initial_inspection_trendlines, 'img/initial_inspection_graph.png')

## Preparing the Data

To prepare the data for clustering analysis:
1. Identify all numerical columns.
2. Normalize the data using `StandardScaler` to ensure consistent scale across features.
3. Set the `coin_id` column as the index for easier identification of each cryptocurrency.
   

In [8]:
# Check data types to confirm columns are in the expected format
df_market_data.dtypes

price_change_percentage_24h     float64
price_change_percentage_7d      float64
price_change_percentage_14d     float64
price_change_percentage_30d     float64
price_change_percentage_60d     float64
price_change_percentage_200d    float64
price_change_percentage_1y      float64
dtype: object

In [9]:
# Identify numerical columns for scaling
num_cols = ml.get_num_cols(df_market_data)

# Normalize numerical columns using StandardScaler
# Note: `get_scaled` function also handles dummy encoding if required
crypto_df = ml.get_scaled(df_market_data, num_cols, dummies_cols=[])

In [10]:
# Set the index to coin_id for easy reference to individual cryptocurrencies
crypto_df.index = df_market_data.index

# Display the first few rows of the normalized data
crypto_df.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317


## Finding the Optimal Value for k

Using the elbow method to identify the optimal number of clusters (k) for K-means clustering

By plotting inertia values against k, we can determine the point where the inertia decline slows, indicating a suitable k value.

In [11]:
# Create a DataFrame with inertia values for each k to plot the Elbow curve
df_elbow = ml.get_elbow(crypto_df)

# Plot the Elbow curve to help identify the optimal k value
elbow_plot = df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve (with original features)")

# Save the Elbow plot for reference
hvplot.save(elbow_plot, 'img/elbow_plot_for_kvalue.png')

# Display the Elbow plot
elbow_plot

##### Analysis of Elbow plot
The elbow plot helps determine the optimal number of clusters (k) for K-means clustering by showing how inertia decreases as k increases:

1. **Significant Improvement**: Inertia decreases sharply from \( k=1 \) to \( k=4 \), suggesting that these clusters capture substantial variance in the data.
2. **Diminishing Returns**: Beyond \( k=4 \), the inertia reduction slows down, indicating that adding more clusters yields minimal improvement.
3. **Optimal k**: The “elbow” occurs around \( k=4 \), suggesting it as the optimal number of clusters, balancing effective clustering with simplicity.

Thus, \( k=4 \) is likely the best choice for this dataset, providing well-defined clusters without unnecessary complexity.

## Clustering Cryptocurrencies with K-means Using Original Data

In this section, we apply the K-means clustering algorithm on the scaled original data. We set `k=4` (based on the elbow plot) to divide the cryptocurrencies into four clusters. A scatter plot is created to visualize the clusters, using 24-hour and 7-day price change percentages for the x and y axes, respectively. Each data point represents a cryptocurrency and is color-coded by its assigned cluster.

In [12]:
# Fit the K-means model using the scaled data with k=4 clusters
clusters_df = ml.fit_model(crypto_df, 4)

# Create a scatter plot to visualize clusters
# x-axis: 24-hour price change, y-axis: 7-day price change
# Points are colored by their cluster label; hover shows crypto name
cluster_plot = clusters_df.hvplot.scatter(
    x="price_change_percentage_24h",
    y="price_change_percentage_7d",
    by='cluster',
    hover_cols=['coin_id'],
    title='Scatter Plot with Clusters, original features (k=4)'
)

# Save and display the plot
hvplot.save(cluster_plot, 'img/cluster_plot.png')
cluster_plot

## Optimizing Clusters with Principal Component Analysis (PCA)

To reduce the complexity of our dataset, we apply Principal Component Analysis (PCA) to condense the information into three principal components. We assess the total explained variance of these three components to ensure they capture a sufficient amount of information from the original features.

In [13]:
# Initialize PCA with 3 components
pca = PCA(n_components=3)

# Apply PCA to reduce data dimensions to three principal components
crypto_pca = pca.fit_transform(crypto_df)

# Display the explained variance ratio for each principal component
vr = pca.explained_variance_ratio_
vr

# Calculate the total explained variance of the three components
pc = sum(vr) * 100
print(f"{pc:.2f}% of the total variance is explained by the three principal components")

# Create a new DataFrame with the PCA-transformed data
crypto_pca_df = pd.DataFrame(crypto_pca, columns=['PC1', 'PC2', 'PC3'])

# Set index to coin_id for easier reference
crypto_pca_df.index = df_market_data.index

# Display the first few rows of the PCA DataFrame
crypto_pca_df.head()

88.86% of the total variance is explained by the three principal components


Unnamed: 0_level_0,PC1,PC2,PC3
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bitcoin,0.448908,-1.245376,-0.85064
ethereum,0.495367,-0.899823,-1.317559
tether,-0.818846,0.071899,0.695015
ripple,-0.840357,0.080054,0.54436
bitcoin-cash,0.81324,-2.66952,-1.643321


## Finding the Optimal k Using PCA Data

After transforming the data with PCA, we re-evaluate the optimal number of clusters (k) using the elbow method. This allows us to determine whether PCA impacts the choice of k compared to the original data.

In [15]:
# Calculate inertia for different k values with PCA data to find the optimal k
df_elbow_pca = ml.get_elbow(crypto_pca_df)

# Plot the elbow curve for PCA-transformed data to identify the optimal k
elbow_pca_plot = df_elbow_pca.hvplot.line(
    x="k",
    y="inertia",
    title="Elbow Curve using PCA"
)

# Save and display the elbow plot for PCA data
hvplot.save(elbow_pca_plot, 'img/elbow_curve_usingPCA_plot.png')
elbow_pca_plot

##### Analysis Questions:
**What is the best value for k when using the PCA data?**

**Answer**: The elbow seems to be at \( k=4 \).

**Does it differ from the best k value found using the original data?**

**Answer**: No, it is the same.

## Clustering Cryptocurrencies with K-means Using PCA Data

We apply K-means clustering on the PCA-transformed data with \( k=4 \). The clusters are visualized using a scatter plot with the first two principal components (PC1 and PC2) as axes. Each data point represents a cryptocurrency, color-coded by its cluster, allowing us to compare results with the original data clusters.

In [16]:
# Fit K-means model with k=4 using the PCA-transformed data
clusters_pca_df = ml.fit_model(crypto_pca_df, 4)

# Create a scatter plot of the PCA data clusters
# x-axis: PC1, y-axis: PC2, color by cluster label, hover shows crypto name
cluster_pca_plot = clusters_pca_df.hvplot.scatter(
    x="PC1",
    y="PC2",
    by='cluster',
    hover_cols=['coin_id'],
    title='Scatter Plot with Clusters using PCA (k=4)'
)

# Save and display the PCA-based cluster plot
hvplot.save(cluster_pca_plot, 'img/cluster_pca_plot.png')
cluster_pca_plot

## Visualize and Compare the Results
Visualization of the cluster analysis results by contrasting the outcome with and without using the optimisation techniques.

### Comparing Elbow Curves: Original Data vs. PCA-Transformed Data

To assess the effect of dimensionality reduction on clustering, we create a composite plot comparing the elbow curves for the original data and the PCA-transformed data. This comparison helps us determine if PCA impacts the optimal choice of \( k \) in the clustering process.

In [18]:
# Composite plot to contrast the Elbow curves
comparison_plot1 = (elbow_plot+elbow_pca_plot).cols(1)

# Save and display the PCA-based cluster plot
hvplot.save(comparison_plot1, 'img/ccomparison_plot.png')

### Comparing Clusters: Original Data vs. PCA-Transformed Data

Next, we visualize and compare the clustering results using both the original features and the PCA-transformed data. This allows us to see if PCA affects the distribution of clusters and to examine potential differences in the clustering structure.

In [19]:
# Composite plot to contrast the clusters
comparison_plot2 = (cluster_plot+cluster_pca_plot).cols(1)

# Save and display the PCA-based cluster plot
hvplot.save(comparison_plot2, 'img/ccomparison_plot_w_PCA.png')

### Visualizing PCA Clusters Across Different Component Combinations

To gain further insight into the clustering structure after PCA, we create scatter plots for various combinations of principal components:
1. **PC1 vs. PC2**: This plot shows clustering patterns along the first two principal components.
2. **PC1 vs. PC3**: This plot provides an alternative view by replacing PC2 with PC3.
3. **PC2 vs. PC3**: This plot highlights clustering patterns along the second and third principal components.

These views allow us to explore cluster separations from different perspectives within the reduced dimensional space.

In [20]:
# PC1 and PC2
cluster_pca_plot_1 = clusters_pca_df.hvplot.scatter(x="PC1",
                           y="PC2",
                           by='cluster',
                           hover_cols='coin_id',
                           title='Clusters with PC1 and PC2')

# PC1 and PC3
cluster_pca_plot_2 = clusters_pca_df.hvplot.scatter(x="PC1",
                           y="PC3",
                           by='cluster',
                           hover_cols='coin_id',
                           title='Clusters with PC1 and PC3')

# PC2 and PC3
cluster_pca_plot_3 = clusters_pca_df.hvplot.scatter(x="PC2",
                           y="PC3",
                           by='cluster',
                           hover_cols='coin_id',
                           title='Clusters with PC2 and PC3')

composite = (cluster_pca_plot_1+cluster_pca_plot_2+cluster_pca_plot_3).cols(1)
hvplot.save(composite, 'img/composite.png')
composite