<a href="https://colab.research.google.com/github/cheshtabiala/PyCaret_clustering/blob/main/MLusing_Pycaret.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **PyCaret for Clustering**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self learning resource**
1. Tutorial on Pycaret **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a>**

2. Documentation on Pycaret-Clustering: **<a href="https://pycaret.readthedocs.io/en/latest/api/clustering.html"> Click Here </a>**

---


### **(a) Install Pycaret**

In [1]:
!pip install pycaret &> /dev/null
print ("Pycaret installed sucessfully!!")
!pip install ucimlrepo &> null print
("ucimlrepo installed successfully")

Pycaret installed sucessfully!!


'ucimlrepo installed successfully'

In [None]:
!pip install ucimlrepo &> null print
("ucimlrepo installed successfully")

### **(b) Get the version of the pycaret**

In [None]:
from pycaret.utils import version
version()

In [None]:
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

In [None]:
from sklearn.datasets import fetch_openml

DataSet = fetch_openml(name='iris', version=1,as_frame=True)

In [None]:
# Features
X = DataSet.data

# Target variable
y = DataSet.target



In [None]:
from pycaret.clustering import *
model = setup(X, verbose = False)

In [None]:

models_list = model.models().Name.index
print(models_list)

In [None]:
models_list = models_list[[0,2,4,5]]
print('Clustering Models Taken: ', models_list)

In [None]:
parameters ={ 'No Data Processing': {'transformation': False, 'normalize': False, 'pca': False}, 'Using Normalisation': {'transformation': False, 'normalize': True, 'pca': False}, 'Using Transform': {'transformation': True, 'normalize': False, 'pca': False}, 'Using PCA': {'transformation': False, 'normalize': False, 'pca': True}, 'T+N': {'transformation': True, 'normalize': True, 'pca': False}, 'T+N+PCA': {'transformation': True, 'normalize': True, 'pca': True}, }

In [None]:
results = []
for model in models_list:
    model_results = pd.DataFrame()
    for cluster_size in range(3, 6):
        for name, args in parameters.items():
            exp = setup(X, verbose=False, **args)
            create_model(model, num_clusters=cluster_size, verbose=False)
            temp = exp.pull()
            temp['name'] = name
            temp['cluster_size'] = cluster_size
            model_results = pd.concat([model_results, temp], ignore_index=True)
    model_results.set_index(['name', 'cluster_size'], inplace=True)
    model_results_transposed = model_results.sort_index().T
    model_results_transposed.iloc[:3, :].to_csv(model + '.csv')
    print(model)
    display(model_results_transposed.iloc[:3, :])


In [None]:

from google.colab import files

results = []
for model in models_list:
    model_results = pd.DataFrame()
    for cluster_size in range(3, 6):
        for name, args in parameters.items():
            exp = setup(X, verbose=False, **args)
            create_model(model, num_clusters=cluster_size, verbose=False)
            temp = exp.pull()
            temp['name'] = name
            temp['cluster_size'] = cluster_size
            model_results = pd.concat([model_results, temp], ignore_index=True)
    model_results.set_index(['name', 'cluster_size'], inplace=True)
    model_results_transposed = model_results.sort_index().T
    csv_filename = model + '.csv'
    model_results_transposed.iloc[:3, :].to_csv(csv_filename)
    # Download CSV file
    files.download(csv_filename)
    print(f"Downloaded {csv_filename}")
    display(model_results_transposed.iloc[:3, :])


---
# **1. Clustering - Part 1 (Kmean Clustering)**
---
### **1.1 Get the list of datasets available in pycaret (55)**

In [None]:
from pycaret.datasets import get_data
dataSets = get_data('index')

---
### **1.2 Get the "jewellery" dataset**
---

In [None]:
DataSet = get_data("forest")


---
### **1.3 Download the "jewellery" dataset to local system**
---

In [None]:
DataSet.to_csv("DataSet.csv")
from google.colab import files
files.download('DataSet.csv')

  ---
### **1.4 "Parameter setting"  for clustering model**
##### **Train/Test division, applying data pre-processing** {Sampling, Normalization, Transformation, PCA, Handaling of Outliers, Feature Selection}
---

In [None]:
from pycaret.clustering import *
kMeanClusteringParameters = setup(DataSet)


---
### **1.5 Building "KMean" clustering model**
---

In [None]:
KMeanClusteringModel = create_model('kmeans', num_clusters=4)

---
### **1.6 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
kMeanPrediction = assign_model(KMeanClusteringModel)
kMeanPrediction

---
### **1.7 Clustering in "Three line of code"**
---

In [None]:
from pycaret.clustering import *

kMeanClusteringParameters = setup(DataSet, verbose=False)
KMeanClusteringModel = create_model('kmeans', num_clusters=4)
kMeanPrediction = assign_model(KMeanClusteringModel)
kMeanPrediction

---
### **1.8 "Saving" the result**
---



In [None]:
kMeanPrediction.to_csv("KMeanResult.csv")
print("Result file save sucessfully!!")

---
### **1.9 Download the "result file" to user local system**
---

In [None]:
from google.colab import files
files.download('KMeanResult.csv')


---
# **2. Clustering: Saving and Loading the Model**
---
### **2.1 Save the "trained model"**
---

In [None]:
x = save_model(KMeanClusteringModel, 'kMeanClusteringModelFile')

---
### **2.2 Download the "trained model**
---

In [None]:
from google.colab import files
files.download('kMeanClusteringModelFile.pkl')

---
### **2.3 Load the model**
---
##### Use it, while working on **"Anaconda/Jupyter notebook"** on local machine

In [None]:
KMeanClusteringModel1 = load_model('kMeanClusteringModelFile')

---
### **2.4 Upload and Load the trained model to "Colab Environment"**
---
##### **Upload the trained model**

In [None]:
from google.colab import files
files.upload()                     # Uncomment this line

##### **Load the trained model**

In [None]:
KMeanClusteringModel1 = load_model('kMeanClusteringModelFile')

---
# **3. Clustering: Cluster the new dataset (Unseen Data)**
---
### **3.1 Select some data or upload user dataset file**

In [None]:
# Select top 10 rows
newData = get_data("forest").iloc[:10]

---
### **3.2 Make prediction on the new dataset (Unseen Data)**
---

In [None]:
newPredictions = predict_model(KMeanClusteringModel, data = newData)
newPredictions

---
### **3.3 Save the prediction result to csv**
---

In [None]:
newPredictions.to_csv("NewPredictions.csv")
print("Result file save sucessfully!!")

---
# **4. Clustering: Ploting the Cluster**
---
```
- Cluster PCA Plot (2d)          'cluster'
- Cluster TSnE (3d)              'tsne'
- Elbow Plot                     'elbow'
- Silhouette Plot                'silhouette'
- Distance Plot                  'distance'
- Distribution Plot              'distribution'
```

---
### **4.1 Evaluate Cluster Model**
---

In [None]:
evaluate_model(KMeanClusteringModel)

---
### **4.2 2D-plot for Cluster**
---

In [None]:
plot_model(KMeanClusteringModel, plot='cluster')

---
### **4.3 3D-plot for Cluster**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'tsne')

---
### **4.4 Elbow Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'elbow')

---
### **4.5 Silhouette Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'silhouette')

---
### **4.6 Distribution Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'distribution')

---
# **5. Compelete Code for Clustering (KMean)**
---
### **5.1 For Cluster = 3, 4, 5, 6**

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **5.2 Other Clustering Algorithms**
---
```
- K-Means clustering                 'kmeans'
- Affinity Propagation               'ap'
- Mean shift clustering              'meanshift'
- Spectral Clustering                'sc'
- Agglomerative Clustering           'hclust'
- Density-Based Spatial Clustering   'dbscan'
- OPTICS Clustering                  'optics'
- Birch Clustering                   'birch'
- K-Modes clustering                 'kmodes'
```

---
# **6. Clustering: Apply "Data Preprocessing"**
---
### **Read the Dataset**

In [None]:
from pycaret.clustering import *
from pycaret.datasets import get_data

DataSet = get_data('forest')

---
### **6.1 Model Performance using "Normalization"**
---
### **6.1.1 Elbow Plot**


In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.1.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 6)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 7)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 8)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 9)


---
### **6.1.3 3D Plot for Cluster = 6**
---

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans', num_clusters = 6)
plot_model(x, plot = 'tsne')

---
### **6.2 Model Performance using "Transformation"**
---

### **6.2.1 Elbow Plot**


In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.2.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.3 Model Performance using "PCA"**
---
### **6.3.1 Elbow Plot**

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.3.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.4 Model Performance using "Transformation" + "Normalization"**
---
### **6.4.1 Elbow Plot**

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.4.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.5 Model Performance using "Transformation" + "Normalization" + "PCA"**
---
### **6.5.1 Elbow Plot**

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.5.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore',
      transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
# **7. Other Clustering Techniques**
---
```
K-Means clustering                 'kmeans'
Affinity Propagation               'ap'
Mean shift clustering              'meanshift'
Spectral Clustering                'sc'
Agglomerative Clustering           'hclust'
Density-Based Spatial Clustering   'dbscan'
OPTICS Clustering                  'optics'
Birch Clustering                   'birch'
K-Modes clustering                 'kmodes'
```

---
### **7.1 Buildign Agglomerative (Hierarchical) clustering model**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)

x = create_model('hclust')
plot_model(x, plot = 'elbow')

---
### **7.1.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
hierarchicalModel = create_model('hclust', num_clusters=3)
hierarchicalModelPrediction = assign_model(hierarchicalModel)
hierarchicalModelPrediction

---
### **7.1.2 Evaluate Agglomerative (Hierarchical) Clustering**
---

In [None]:
evaluate_model(hierarchicalModel)

---
### **7.2 Density-Based Spatial Clustering**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)
dbscanModel = create_model('dbscan')

---
### **7.2.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
dbscanModelPrediction = assign_model(dbscanModel)
dbscanModelPrediction

# Noisy samples are given the label -1 i.e. 'Cluster -1'

K-meann shift clustering


In [None]:
from pycaret.clustering import *

meanshiftClusteringParameters = setup(DataSet, verbose=False)
meanshiftClusteringModel = create_model('meanshift', num_clusters=4)
meanshiftClusteringPrediction = assign_model(meanshiftClusteringModel)
meanshiftClusteringPrediction

In [None]:
evaluate_model(meanshiftClusteringModel)

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 6)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 7)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 8)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 9)

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('meanshift', num_clusters = 6)
plot_model(x, plot = 'tsne')

In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 3)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 4)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 5)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 6)

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 3)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 4)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 5)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 6)

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore',
      transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 3)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 4)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 5)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 6)

In [None]:
from pycaret.clustering import * model = setup(X, verbose = False)

### **Key Points**

- num_clusters not required for some of the clustering Alorithms (Affinity Propagation ('ap'), Mean shift
  clustering ('meanshift'), Density-Based Spatial Clustering ('dbscan') and OPTICS Clustering ('optics')).
- num_clusters param for these models are automatically determined.

- When fit doesn't converge in Affinity Propagation ('ap') model, all datapoints are labelled as -1.

- Noisy samples are given the label -1, when using Density-Based Spatial  ('dbscan') or OPTICS Clustering ('optics').

- OPTICS ('optics') clustering may take longer training times on large datasets.


---
# **8. Deploy the model on AWS**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/clustering.html#pycaret.clustering.deploy_model">Click Here</a>**