<a href="https://colab.research.google.com/github/cheshtabiala/PyCaret_clustering/blob/main/MLusing_Pycaret.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **PyCaret for Clustering**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self learning resource**
1. Tutorial on Pycaret **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a>**

2. Documentation on Pycaret-Clustering: **<a href="https://pycaret.readthedocs.io/en/latest/api/clustering.html"> Click Here </a>**

---


### **(a) Install Pycaret**

In [3]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")

Pycaret installed sucessfully!!


### **(b) Get the version of the pycaret**

In [4]:
from pycaret.utils import version
version()

'3.2.0'

---
# **1. Clustering - Part 1 (Kmean Clustering)**
---
### **1.1 Get the list of datasets available in pycaret (55)**

In [5]:
from pycaret.datasets import get_data
dataSets = get_data('index')

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


---
### **1.2 Get the "jewellery" dataset**
---

In [6]:
DataSet = get_data("forest")    # SN is 30
# This is unsupervised dataset.
# No target is defined.

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


---
### **1.3 Download the "jewellery" dataset to local system**
---

In [7]:
DataSet.to_csv("DataSet.csv")
from google.colab import files
files.download('DataSet.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

  ---
### **1.4 "Parameter setting"  for clustering model**
##### **Train/Test division, applying data pre-processing** {Sampling, Normalization, Transformation, PCA, Handaling of Outliers, Feature Selection}
---

In [8]:
from pycaret.clustering import *
kMeanClusteringParameters = setup(DataSet)


Unnamed: 0,Description,Value
0,Session id,6865
1,Original data shape,"(517, 13)"
2,Transformed data shape,"(517, 30)"
3,Numeric features,11
4,Categorical features,2
5,Preprocess,True
6,Imputation type,simple
7,Numeric imputation,mean
8,Categorical imputation,mode
9,Maximum one-hot encoding,-1


---
### **1.5 Building "KMean" clustering model**
---

In [9]:
KMeanClusteringModel = create_model('kmeans', num_clusters=4)

Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.6238,1404.4596,0.4654,0,0,0


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

---
### **1.6 Assign Model - "Assign the labels" to the dataset**
---



In [10]:
kMeanPrediction = assign_model(KMeanClusteringModel)
kMeanPrediction

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,Cluster
0,7,5,mar,fri,86.199997,26.200001,94.300003,5.1,8.200000,51,6.7,0.0,0.000000,Cluster 1
1,7,4,oct,tue,90.599998,35.400002,669.099976,6.7,18.000000,33,0.9,0.0,0.000000,Cluster 3
2,7,4,oct,sat,90.599998,43.700001,686.900024,6.7,14.600000,33,1.3,0.0,0.000000,Cluster 3
3,8,6,mar,fri,91.699997,33.299999,77.500000,9.0,8.300000,97,4.0,0.2,0.000000,Cluster 1
4,8,6,mar,sun,89.300003,51.299999,102.199997,9.6,11.400000,99,1.8,0.0,0.000000,Cluster 1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,4,3,aug,sun,81.599998,56.700001,665.599976,1.9,27.799999,32,2.7,0.0,6.440000,Cluster 3
513,2,4,aug,sun,81.599998,56.700001,665.599976,1.9,21.900000,71,5.8,0.0,54.290001,Cluster 3
514,7,4,aug,sun,81.599998,56.700001,665.599976,1.9,21.200001,70,6.7,0.0,11.160000,Cluster 3
515,1,4,aug,sat,94.400002,146.000000,614.700012,11.3,25.600000,42,4.0,0.0,0.000000,Cluster 3


---
### **1.7 Clustering in "Three line of code"**
---

In [11]:
from pycaret.clustering import *

kMeanClusteringParameters = setup(DataSet, verbose=False)
KMeanClusteringModel = create_model('kmeans', num_clusters=4)
kMeanPrediction = assign_model(KMeanClusteringModel)
kMeanPrediction

Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.6238,1404.4596,0.4654,0,0,0


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,Cluster
0,7,5,mar,fri,86.199997,26.200001,94.300003,5.1,8.200000,51,6.7,0.0,0.000000,Cluster 1
1,7,4,oct,tue,90.599998,35.400002,669.099976,6.7,18.000000,33,0.9,0.0,0.000000,Cluster 0
2,7,4,oct,sat,90.599998,43.700001,686.900024,6.7,14.600000,33,1.3,0.0,0.000000,Cluster 0
3,8,6,mar,fri,91.699997,33.299999,77.500000,9.0,8.300000,97,4.0,0.2,0.000000,Cluster 1
4,8,6,mar,sun,89.300003,51.299999,102.199997,9.6,11.400000,99,1.8,0.0,0.000000,Cluster 1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,4,3,aug,sun,81.599998,56.700001,665.599976,1.9,27.799999,32,2.7,0.0,6.440000,Cluster 0
513,2,4,aug,sun,81.599998,56.700001,665.599976,1.9,21.900000,71,5.8,0.0,54.290001,Cluster 0
514,7,4,aug,sun,81.599998,56.700001,665.599976,1.9,21.200001,70,6.7,0.0,11.160000,Cluster 0
515,1,4,aug,sat,94.400002,146.000000,614.700012,11.3,25.600000,42,4.0,0.0,0.000000,Cluster 0


---
### **1.8 "Saving" the result**
---



In [12]:
kMeanPrediction.to_csv("KMeanResult.csv")
print("Result file save sucessfully!!")

Result file save sucessfully!!


---
### **1.9 Download the "result file" to user local system**
---

In [13]:
from google.colab import files
files.download('KMeanResult.csv')
# Open and Explore result file (KMeanResult.csv).

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---
# **2. Clustering: Saving and Loading the Model**
---
### **2.1 Save the "trained model"**
---

In [14]:
x = save_model(KMeanClusteringModel, 'kMeanClusteringModelFile')

Transformation Pipeline and Model Successfully Saved


---
### **2.2 Download the "trained model**
---

In [15]:
from google.colab import files
files.download('kMeanClusteringModelFile.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---
### **2.3 Load the model**
---
##### Use it, while working on **"Anaconda/Jupyter notebook"** on local machine

In [16]:
KMeanClusteringModel1 = load_model('kMeanClusteringModelFile')

Transformation Pipeline and Model Successfully Loaded


---
### **2.4 Upload and Load the trained model to "Colab Environment"**
---
##### **Upload the trained model**

In [None]:
from google.colab import files
files.upload()                     # Uncomment this line

##### **Load the trained model**

In [None]:
KMeanClusteringModel1 = load_model('kMeanClusteringModelFile (1)')

---
# **3. Clustering: Cluster the new dataset (Unseen Data)**
---
### **3.1 Select some data or upload user dataset file**

In [None]:
# Select top 10 rows
newData = get_data("forest").iloc[:10]

---
### **3.2 Make prediction on the new dataset (Unseen Data)**
---

In [None]:
newPredictions = predict_model(KMeanClusteringModel, data = newData)
newPredictions

---
### **3.3 Save the prediction result to csv**
---

In [None]:
newPredictions.to_csv("NewPredictions.csv")
print("Result file save sucessfully!!")

---
# **4. Clustering: Ploting the Cluster**
---
```
- Cluster PCA Plot (2d)          'cluster'
- Cluster TSnE (3d)              'tsne'
- Elbow Plot                     'elbow'
- Silhouette Plot                'silhouette'
- Distance Plot                  'distance'
- Distribution Plot              'distribution'
```

---
### **4.1 Evaluate Cluster Model**
---

In [None]:
evaluate_model(KMeanClusteringModel)

---
### **4.2 2D-plot for Cluster**
---

In [None]:
plot_model(KMeanClusteringModel, plot='cluster')

---
### **4.3 3D-plot for Cluster**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'tsne')

---
### **4.4 Elbow Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'elbow')

---
### **4.5 Silhouette Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'silhouette')

---
### **4.6 Distribution Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'distribution')

---
### **4.7 Distance Plot**
---

In [None]:
# plot_model(KMeanClusteringModel, plot = 'distance') # Rerun the code

---
# **5. Compelete Code for Clustering (KMean)**
---
### **5.1 For Cluster = 3, 4, 5, 6**

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **5.2 Other Clustering Algorithms**
---
```
- K-Means clustering                 'kmeans'
- Affinity Propagation               'ap'
- Mean shift clustering              'meanshift'
- Spectral Clustering                'sc'
- Agglomerative Clustering           'hclust'
- Density-Based Spatial Clustering   'dbscan'
- OPTICS Clustering                  'optics'
- Birch Clustering                   'birch'
- K-Modes clustering                 'kmodes'
```

---
# **6. Clustering: Apply "Data Preprocessing"**
---
### **Read the Dataset**

In [None]:
from pycaret.clustering import *
from pycaret.datasets import get_data

DataSet = get_data('forest')

---
### **6.1 Model Performance using "Normalization"**
---
### **6.1.1 Elbow Plot**


In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.1.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)


---
### **6.1.3 3D Plot for Cluster = 5**
---

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans', num_clusters = 5)
plot_model(x, plot = 'tsne')

---
### **6.2 Model Performance using "Transformation"**
---

### **6.2.1 Elbow Plot**


In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.2.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.3 Model Performance using "PCA"**
---
### **6.3.1 Elbow Plot**

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.3.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.4 Model Performance using "Transformation" + "Normalization"**
---
### **6.4.1 Elbow Plot**

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.4.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.5 Model Performance using "Transformation" + "Normalization" + "PCA"**
---
### **6.5.1 Elbow Plot**

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.5.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore',
      transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
# **7. Other Clustering Techniques**
---
```
K-Means clustering                 'kmeans'
Affinity Propagation               'ap'
Mean shift clustering              'meanshift'
Spectral Clustering                'sc'
Agglomerative Clustering           'hclust'
Density-Based Spatial Clustering   'dbscan'
OPTICS Clustering                  'optics'
Birch Clustering                   'birch'
K-Modes clustering                 'kmodes'
```

---
### **7.1 Buildign Agglomerative (Hierarchical) clustering model**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)

x = create_model('hclust')
plot_model(x, plot = 'elbow')

---
### **7.1.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
hierarchicalModel = create_model('hclust', num_clusters=3)
hierarchicalModelPrediction = assign_model(hierarchicalModel)
hierarchicalModelPrediction

---
### **7.1.2 Evaluate Agglomerative (Hierarchical) Clustering**
---

In [None]:
evaluate_model(hierarchicalModel)

---
### **7.2 Density-Based Spatial Clustering**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)
dbscanModel = create_model('dbscan')

---
### **7.2.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
dbscanModelPrediction = assign_model(dbscanModel)
dbscanModelPrediction

# Noisy samples are given the label -1 i.e. 'Cluster -1'

### **Key Points**

- num_clusters not required for some of the clustering Alorithms (Affinity Propagation ('ap'), Mean shift
  clustering ('meanshift'), Density-Based Spatial Clustering ('dbscan') and OPTICS Clustering ('optics')).
- num_clusters param for these models are automatically determined.

- When fit doesn't converge in Affinity Propagation ('ap') model, all datapoints are labelled as -1.

- Noisy samples are given the label -1, when using Density-Based Spatial  ('dbscan') or OPTICS Clustering ('optics').

- OPTICS ('optics') clustering may take longer training times on large datasets.


---
# **8. Deploy the model on AWS**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/clustering.html#pycaret.clustering.deploy_model">Click Here</a>**