- Background
- Preparing the Data
- Find the Best Value for k Using the Original Scaled DataFrame
- Cluster Cryptocurrencies with K-means Using the Original Scaled Data
- Optimize Clusters with Principal Component Analysis
- Find the Best Value for k Using the PCA Data
- Cluster Cryptocurrencies with K-means Using the PCA Data
In the dynamic realm of cryptocurrencies, understanding their price movements and identifying the factors that influence them is of paramount importance. With the ever-increasing popularity and volatility of digital assets, the ability to predict whether these currencies are significantly impacted by short-term (24-hour) or longer-term (7-day) price changes is a crucial endeavor. In this assignment, we delve into the world of unsupervised learning, leveraging the power of Python, to develop a predictive model that will unravel the mysteries of cryptocurrency price fluctuations. By exploring historical data and employing cutting-edge machine learning techniques, we aim to shed light on the temporal dynamics of cryptocurrency markets, offering valuable insights for investors, traders, and enthusiasts alike.
The goal of this project is to cluster similar CryptoCurrencies together.
Jupyter Notebook, machine learning, scikit-learn, KMeans clustering, Principle component analysis, PCA, dimensionality reduction, clustering algorithms, pandas, numpy, matplotlib, elbow plot
For this project, we are using the crypto_market_data.csv file that contains price change information over different periods of time of 41 CryptoCurrencies. The first step was to import the csv file and review the data. In order to view the data, we plotted the price change for each cryptocurrency.
Since the data is not normalized, StandardScalar() was used from the scikit-learn library to normalize the data. The scaled data was then put in a DataFrame and the coin ID was set as the index of the DataFrame. The following is a screenshot of the DataFrame:
The elbow method was used to find the ideal number of clusters, k, to apply the KMeans clustering algorithm for clustering the data. The following steps were taken to find the best value of k:
- Create a list with the number of k values from 1 to 11.
- Create an empty list to store the inertia values.
- Create a for loop to compute the inertia with each possible value of k.
- Create a dictionary with the data to plot the elbow curve.
- Plot a line chart with all the inertia values computed with the different values of k to visually identify the optimal value for k.
Looking at the elbow curve, it looks like the best value of k is 4, meaning that the optimal number of clusters is 4. Now that we have the optimal number of clusters, we can cluster the cryptocurrencies with KMeans using the normalized data.
The following steps were taken to create four clusters for the cryptocurrency data:
- Initialize the K-means model with the best value for k.
- Fit the K-means model using the original scaled DataFrame.
- Predict the clusters to group the cryptocurrencies using the original scaled DataFrame.
- Create a copy of the original data and add a new column with the predicted clusters.
After the dataframe was created, hvPlot was used to plot the clusters. The x-axis was set as "price_change_percentage_24h" and the y-axis as "price_change_percentage_7d". The clusters were coloured based on the predicted clusters found using K-means. We added the "coin_id" column in the hover_cols parameter to identify the cryptocurrency represented by each data point.
Next, we optimized the clusters by performing PCA to reduce the dimensionality of the data, aka reduce the features to three principal components. Using the fit_transform function, we fit our data to the PCA model and explored the variance to determine how much information was attributed to each principal component using the explained_variance_ratio_ function on the model. PC (Principle component) 1 explains ~37% of the variance in the data, PC2 explains ~35% of the data, and PC3 explains ~18% of the data. Therefore the total explained variance is ~90%.
The first five rows of the PCA DataFrame appears as follows:
Similar to what was done for the scaled raw data, we used the same steps for the elbow method on the PCA data to find the best value for k (number of clusters). The best value for k when using the PCA data seems to be 4 as well. Below is the elbow graph for the PCA model.
The inertia value itself is smaller for similar values of k. For example, k=4 for the PCA model is about 50, while the value for the KMeans model is around 79. However, the k-value is not different.
Using the optimal number of clusters found in the previous section, we clustered the PCA data using KMeans and 4 clusters using the same steps as before. The resulting clusters are seen below.
We use PCA to retain all of the important information from the features. In this case, we use three principle components to best describe the highest variance in our data. PCA is a great technique to reduce the dimensionality of our data in order to be able to cluster data more efficiently and to be able to visualize the clusters. Without the dimensionality reduction, it's difficult to understand/visualize what the clusters are because there are too many features to plot. Below is the 3D plot of the clusters using the three PCs to better visualize the clusters.








