<a href="https://colab.research.google.com/github/arshiyaakishore/Pycaret_Clustering_Assignment/blob/main/MLusing_Pycaret.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# **PyCaret for Clustering**
---
- It is a bundle of many Machine Learning algorithms.
- Only three lines of code is required to compare 20 ML models.
- Pycaret is available for:
    - Classification
    - Regression
    - Clustering

---

### **Self learning resource**
1. Tutorial on Pycaret **<a href="https://pycaret.readthedocs.io/en/latest/tutorials.html"> Click Here</a>**

2. Documentation on Pycaret-Clustering: **<a href="https://pycaret.readthedocs.io/en/latest/api/clustering.html"> Click Here </a>**

---


### **(a) Install Pycaret**

In [1]:
!pip install pycaret &> /dev/null
print ("Pycaret installed")
!pip install ucimlrepo &> null print
("ucimlrepo installed")

Pycaret installed


'ucimlrepo installed'

In [2]:
!pip install ucimlrepo &> null print
("ucimlrepo installed successfully")

'ucimlrepo installed successfully'

### **(b) Get the version of the pycaret**

In [3]:
from pycaret.utils import version
version()

'3.3.0'

In [4]:
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

In [5]:
from sklearn.datasets import fetch_openml

DataSet = fetch_openml(name='iris', version=1,as_frame=True)

In [6]:
# Features
X = DataSet.data

# Target variable
y = DataSet.target



In [7]:
from pycaret.clustering import *
model = setup(X, verbose = False)

In [8]:

models_list = model.models().Name.index
print(models_list)

Index(['kmeans', 'ap', 'meanshift', 'sc', 'hclust', 'dbscan', 'optics',
       'birch'],
      dtype='object', name='ID')


In [9]:
models_list = models_list[[0,2,4,5]]
print('Clustering Models Taken: ', models_list)

Clustering Models Taken:  Index(['kmeans', 'meanshift', 'hclust', 'dbscan'], dtype='object', name='ID')


In [10]:
parameters ={ 'No Data Processing': {'transformation': False, 'normalize': False, 'pca': False}, 'Using Normalisation': {'transformation': False, 'normalize': True, 'pca': False}, 'Using Transform': {'transformation': True, 'normalize': False, 'pca': False}, 'Using PCA': {'transformation': False, 'normalize': False, 'pca': True}, 'T+N': {'transformation': True, 'normalize': True, 'pca': False}, 'T+N+PCA': {'transformation': True, 'normalize': True, 'pca': True}, }

In [11]:
results = []
for model in models_list:
    model_results = pd.DataFrame()
    for cluster_size in range(3, 6):
        for name, args in parameters.items():
            exp = setup(X, verbose=False, **args)
            create_model(model, num_clusters=cluster_size, verbose=False)
            temp = exp.pull()
            temp['name'] = name
            temp['cluster_size'] = cluster_size
            model_results = pd.concat([model_results, temp], ignore_index=True)
    model_results.set_index(['name', 'cluster_size'], inplace=True)
    model_results_transposed = model_results.sort_index().T
    model_results_transposed.iloc[:3, :].to_csv(model + '.csv')
    print(model)
    display(model_results_transposed.iloc[:3, :])


kmeans


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.551,0.497,0.3697,0.4805,0.4202,0.3483,0.4575,0.3876,0.3497,0.4787,0.3993,0.3763,0.551,0.4951,0.4594,0.6537,0.5999,0.5833
Calinski-Harabasz,560.366,529.1257,456.6451,158.3778,210.0275,204.1851,242.996,207.0175,208.0562,156.1431,205.4652,168.563,560.366,527.8354,426.1936,1163.8412,1281.0988,1212.925
Davies-Bouldin,0.6664,0.778,0.9087,0.8116,0.9203,0.9486,0.8385,0.8819,0.9621,0.7868,0.8494,1.0546,0.6664,0.7642,0.9763,0.4872,0.5677,0.5398


meanshift


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.6855,0.6855,0.6855,0.5836,0.5836,0.5836,0.5836,0.5836,0.5836,0.5802,0.5802,0.5802,0.6855,0.6855,0.6855,0.7633,0.7633,0.7633
Calinski-Harabasz,508.8825,508.8825,508.8825,254.447,254.447,254.447,254.4469,254.4469,254.4469,248.9035,248.9035,248.9035,508.8823,508.8823,508.8823,762.9648,762.9648,762.9648
Davies-Bouldin,0.3893,0.3893,0.3893,0.5931,0.5931,0.5931,0.5931,0.5931,0.5931,0.5976,0.5976,0.5976,0.3893,0.3893,0.3893,0.2606,0.2606,0.2606


hclust


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.5541,0.4887,0.4842,0.4462,0.4251,0.3625,0.4462,0.4251,0.3625,0.4455,0.3993,0.355,0.5541,0.4887,0.4842,0.6503,0.5885,0.5807
Calinski-Harabasz,556.841,513.7721,487.0704,223.4591,200.0175,199.3025,223.459,200.0175,199.3024,220.2605,198.7303,194.9616,556.8412,513.7722,487.0705,1138.6493,1222.9414,1353.347
Davies-Bouldin,0.6566,0.7956,0.8207,0.8356,0.8637,0.8962,0.8356,0.8637,0.8962,0.8059,0.9811,0.9465,0.6566,0.7956,0.8207,0.4854,0.5736,0.5969


dbscan


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.4858,0.4858,0.4858,0.3209,0.3209,0.3209,0.3209,0.3209,0.3209,0.3492,0.3492,0.3492,0.4858,0.4858,0.4858,0.7633,0.7633,0.7633
Calinski-Harabasz,219.8703,219.8703,219.8703,87.0028,87.0028,87.0028,87.0028,87.0028,87.0028,76.9759,76.9759,76.9759,219.8702,219.8702,219.8702,762.9648,762.9648,762.9648
Davies-Bouldin,7.2228,7.2228,7.2228,5.71,5.71,5.71,5.71,5.71,5.71,6.1608,6.1608,6.1608,7.2228,7.2228,7.2228,0.2606,0.2606,0.2606


In [12]:

from google.colab import files

results = []
for model in models_list:
    model_results = pd.DataFrame()
    for cluster_size in range(3, 6):
        for name, args in parameters.items():
            exp = setup(X, verbose=False, **args)
            create_model(model, num_clusters=cluster_size, verbose=False)
            temp = exp.pull()
            temp['name'] = name
            temp['cluster_size'] = cluster_size
            model_results = pd.concat([model_results, temp], ignore_index=True)
    model_results.set_index(['name', 'cluster_size'], inplace=True)
    model_results_transposed = model_results.sort_index().T
    csv_filename = model + '.csv'
    model_results_transposed.iloc[:3, :].to_csv(csv_filename)
    # Download CSV file
    files.download(csv_filename)
    print(f"Downloaded {csv_filename}")
    display(model_results_transposed.iloc[:3, :])


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloaded kmeans.csv


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.5526,0.4951,0.3743,0.4578,0.4217,0.3466,0.4829,0.4211,0.4067,0.4621,0.414,0.3919,0.5526,0.4974,0.431,0.6537,0.6141,0.586
Calinski-Harabasz,560.3999,527.8353,459.8423,242.1591,209.8105,204.7351,151.1124,209.198,175.1252,238.9244,203.7522,171.5073,560.4,528.1665,445.2147,1163.8412,1351.2991,1416.9005
Davies-Bouldin,0.6623,0.7642,0.9092,0.8419,0.9149,0.9592,0.71,0.9191,0.9416,0.834,0.9241,1.0614,0.6623,0.7545,0.8998,0.4872,0.5358,0.5987


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloaded meanshift.csv


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.6855,0.6855,0.6855,0.5836,0.5836,0.5836,0.5836,0.5836,0.5836,0.5802,0.5802,0.5802,0.6855,0.6855,0.6855,0.7633,0.7633,0.7633
Calinski-Harabasz,508.8825,508.8825,508.8825,254.447,254.447,254.447,254.4469,254.4469,254.4469,248.9035,248.9035,248.9035,508.8823,508.8823,508.8823,762.9648,762.9648,762.9648
Davies-Bouldin,0.3893,0.3893,0.3893,0.5931,0.5931,0.5931,0.5931,0.5931,0.5931,0.5976,0.5976,0.5976,0.3893,0.3893,0.3893,0.2606,0.2606,0.2606


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloaded hclust.csv


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.5541,0.4887,0.4842,0.4462,0.4251,0.3625,0.4462,0.4251,0.3625,0.4455,0.3993,0.355,0.5541,0.4887,0.4842,0.6503,0.5885,0.5807
Calinski-Harabasz,556.841,513.7721,487.0704,223.4591,200.0175,199.3025,223.459,200.0175,199.3024,220.2605,198.7303,194.9616,556.8412,513.7722,487.0705,1138.6493,1222.9414,1353.347
Davies-Bouldin,0.6566,0.7956,0.8207,0.8356,0.8637,0.8962,0.8356,0.8637,0.8962,0.8059,0.9811,0.9465,0.6566,0.7956,0.8207,0.4854,0.5736,0.5969


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloaded dbscan.csv


name,No Data Processing,No Data Processing,No Data Processing,T+N,T+N,T+N,T+N+PCA,T+N+PCA,T+N+PCA,Using Normalisation,Using Normalisation,Using Normalisation,Using PCA,Using PCA,Using PCA,Using Transform,Using Transform,Using Transform
cluster_size,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5,3,4,5
Silhouette,0.4858,0.4858,0.4858,0.3209,0.3209,0.3209,0.3209,0.3209,0.3209,0.3492,0.3492,0.3492,0.4858,0.4858,0.4858,0.7633,0.7633,0.7633
Calinski-Harabasz,219.8703,219.8703,219.8703,87.0028,87.0028,87.0028,87.0028,87.0028,87.0028,76.9759,76.9759,76.9759,219.8702,219.8702,219.8702,762.9648,762.9648,762.9648
Davies-Bouldin,7.2228,7.2228,7.2228,5.71,5.71,5.71,5.71,5.71,5.71,6.1608,6.1608,6.1608,7.2228,7.2228,7.2228,0.2606,0.2606,0.2606


---
# **1. Clustering - Part 1 (Kmean Clustering)**
---
### **1.1 Get the list of datasets available in pycaret (55)**

In [13]:
from pycaret.datasets import get_data
dataSets = get_data('index')

Unnamed: 0,Dataset,Data Types,Default Task,Target Variable 1,Target Variable 2,# Instances,# Attributes,Missing Values
0,anomaly,Multivariate,Anomaly Detection,,,1000,10,N
1,france,Multivariate,Association Rule Mining,InvoiceNo,Description,8557,8,N
2,germany,Multivariate,Association Rule Mining,InvoiceNo,Description,9495,8,N
3,bank,Multivariate,Classification (Binary),deposit,,45211,17,N
4,blood,Multivariate,Classification (Binary),Class,,748,5,N
5,cancer,Multivariate,Classification (Binary),Class,,683,10,N
6,credit,Multivariate,Classification (Binary),default,,24000,24,N
7,diabetes,Multivariate,Classification (Binary),Class variable,,768,9,N
8,electrical_grid,Multivariate,Classification (Binary),stabf,,10000,14,N
9,employee,Multivariate,Classification (Binary),left,,14999,10,N


---
### **1.2 Get the "perfume" dataset**
---

In [14]:
DataSet = get_data("perfume")


Unnamed: 0,Perfume,Measurement1,Measurement2,Measurement3,Measurement4,Measurement5,Measurement6,Measurement7,Measurement8,Measurement9,...,Measurement19,Measurement20,Measurement21,Measurement22,Measurement23,Measurement24,Measurement25,Measurement26,Measurement27,Measurement28
0,ajayeb,64558,64556,64543,64543,64541,64543,64543,64541,64541,...,64541,64541,64541,64541,64541,64541,64528,64528,64528,64528
1,ajmal,60502,60489,61485,60487,61485,61513,60515,60500,60500,...,60472,60472,60461,61470,60487,60487,61485,60487,60472,60472
2,amreaj,57040,57040,57040,58041,58041,58041,58041,57042,57042,...,58041,58041,58041,58041,58041,58041,58041,58041,58041,58041
3,aood,71083,72087,71091,71095,71099,72103,71099,72099,72099,...,72095,71095,71095,72103,71103,71103,71103,72103,72103,72098
4,asgar_ali,68209,68209,68216,68216,68223,68223,68223,68223,68230,...,68230,67224,67217,67217,68223,68223,68223,68223,68223,68230


---
### **1.3 Download the "perfume" dataset to local system**
---

In [15]:
DataSet.to_csv("DataSet.csv")
from google.colab import files
files.download('DataSet.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

  ---
### **1.4 "Parameter setting"  for clustering model**
##### **Train/Test division, applying data pre-processing** {Sampling, Normalization, Transformation, PCA, Handaling of Outliers, Feature Selection}
---

In [16]:
from pycaret.clustering import *
kMeanClusteringParameters = setup(DataSet)


Unnamed: 0,Description,Value
0,Session id,1525
1,Original data shape,"(20, 29)"
2,Transformed data shape,"(20, 48)"
3,Numeric features,28
4,Categorical features,1
5,Preprocess,True
6,Imputation type,simple
7,Numeric imputation,mean
8,Categorical imputation,mode
9,Maximum one-hot encoding,-1


---
### **1.5 Building "KMean" clustering model**
---

In [17]:
KMeanClusteringModel = create_model('kmeans', num_clusters=4)

Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5132,60.1923,0.399,0,0,0


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

---
### **1.6 Assign Model - "Assign the labels" to the dataset**
---



In [18]:
kMeanPrediction = assign_model(KMeanClusteringModel)
kMeanPrediction

Unnamed: 0,Perfume,Measurement1,Measurement2,Measurement3,Measurement4,Measurement5,Measurement6,Measurement7,Measurement8,Measurement9,...,Measurement20,Measurement21,Measurement22,Measurement23,Measurement24,Measurement25,Measurement26,Measurement27,Measurement28,Cluster
0,ajayeb,64558,64556,64543,64543,64541,64543,64543,64541,64541,...,64541,64541,64541,64541,64541,64528,64528,64528,64528,Cluster 2
1,ajmal,60502,60489,61485,60487,61485,61513,60515,60500,60500,...,60472,60461,61470,60487,60487,61485,60487,60472,60472,Cluster 2
2,amreaj,57040,57040,57040,58041,58041,58041,58041,57042,57042,...,58041,58041,58041,58041,58041,58041,58041,58041,58041,Cluster 2
3,aood,71083,72087,71091,71095,71099,72103,71099,72099,72099,...,71095,71095,72103,71103,71103,71103,72103,72103,72098,Cluster 0
4,asgar_ali,68209,68209,68216,68216,68223,68223,68223,68223,68230,...,67224,67217,67217,68223,68223,68223,68223,68223,68230,Cluster 0
5,bukhoor,71046,71046,71046,71046,71046,71046,71046,71046,71046,...,70049,70049,70048,70049,70048,70046,70046,70048,71048,Cluster 0
6,burberrry,61096,61096,60093,60092,60093,60093,61096,61096,61096,...,60089,60089,60092,60089,60089,60089,61088,61088,60089,Cluster 2
7,dehenalaod,68132,69137,69137,68137,68137,69142,69142,68137,68137,...,69142,69142,69142,69142,68137,69137,69137,69137,69136,Cluster 0
8,junaid,71590,71575,71574,71560,71560,71559,72573,71559,71559,...,72556,72556,72557,72556,72542,72542,72556,72556,72556,Cluster 0
9,kausar,74631,74649,74650,74650,74650,74632,74632,74632,73633,...,73585,73584,73600,73601,73601,73585,73585,73585,73585,Cluster 0


---
### **1.7 Clustering in "Three line of code"**
---

In [19]:
from pycaret.clustering import *

kMeanClusteringParameters = setup(DataSet, verbose=False)
KMeanClusteringModel = create_model('kmeans', num_clusters=4)
kMeanPrediction = assign_model(KMeanClusteringModel)
kMeanPrediction

Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.5087,61.1216,0.4182,0,0,0


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,Perfume,Measurement1,Measurement2,Measurement3,Measurement4,Measurement5,Measurement6,Measurement7,Measurement8,Measurement9,...,Measurement20,Measurement21,Measurement22,Measurement23,Measurement24,Measurement25,Measurement26,Measurement27,Measurement28,Cluster
0,ajayeb,64558,64556,64543,64543,64541,64543,64543,64541,64541,...,64541,64541,64541,64541,64541,64528,64528,64528,64528,Cluster 1
1,ajmal,60502,60489,61485,60487,61485,61513,60515,60500,60500,...,60472,60461,61470,60487,60487,61485,60487,60472,60472,Cluster 1
2,amreaj,57040,57040,57040,58041,58041,58041,58041,57042,57042,...,58041,58041,58041,58041,58041,58041,58041,58041,58041,Cluster 1
3,aood,71083,72087,71091,71095,71099,72103,71099,72099,72099,...,71095,71095,72103,71103,71103,71103,72103,72103,72098,Cluster 0
4,asgar_ali,68209,68209,68216,68216,68223,68223,68223,68223,68230,...,67224,67217,67217,68223,68223,68223,68223,68223,68230,Cluster 0
5,bukhoor,71046,71046,71046,71046,71046,71046,71046,71046,71046,...,70049,70049,70048,70049,70048,70046,70046,70048,71048,Cluster 0
6,burberrry,61096,61096,60093,60092,60093,60093,61096,61096,61096,...,60089,60089,60092,60089,60089,60089,61088,61088,60089,Cluster 1
7,dehenalaod,68132,69137,69137,68137,68137,69142,69142,68137,68137,...,69142,69142,69142,69142,68137,69137,69137,69137,69136,Cluster 0
8,junaid,71590,71575,71574,71560,71560,71559,72573,71559,71559,...,72556,72556,72557,72556,72542,72542,72556,72556,72556,Cluster 0
9,kausar,74631,74649,74650,74650,74650,74632,74632,74632,73633,...,73585,73584,73600,73601,73601,73585,73585,73585,73585,Cluster 0


---
### **1.8 "Saving" the result**
---



In [20]:
kMeanPrediction.to_csv("KMeanResult.csv")
print("Result file save sucessfully!!")

Result file save sucessfully!!


---
### **1.9 Download the "result file" to user local system**
---

In [21]:
from google.colab import files
files.download('KMeanResult.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---
# **2. Clustering: Saving and Loading the Model**
---
### **2.1 Save the "trained model"**
---

In [22]:
x = save_model(KMeanClusteringModel, 'kMeanClusteringModelFile')

Transformation Pipeline and Model Successfully Saved


---
### **2.2 Download the "trained model**
---

In [23]:
from google.colab import files
files.download('kMeanClusteringModelFile.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

---
### **2.3 Load the model**
---
##### Use it, while working on **"Anaconda/Jupyter notebook"** on local machine

In [24]:
KMeanClusteringModel1 = load_model('kMeanClusteringModelFile')

Transformation Pipeline and Model Successfully Loaded


---
### **2.4 Upload and Load the trained model to "Colab Environment"**
---
##### **Upload the trained model**

In [25]:
from google.colab import files
files.upload()                     # Uncomment this line

Saving DataSet.csv to DataSet (1).csv


{'DataSet (1).csv': b",Perfume,Measurement1,Measurement2,Measurement3,Measurement4,Measurement5,Measurement6,Measurement7,Measurement8,Measurement9,Measurement10,Measurement11,Measurement12,Measurement13,Measurement14,Measurement15,Measurement16,Measurement17,Measurement18,Measurement19,Measurement20,Measurement21,Measurement22,Measurement23,Measurement24,Measurement25,Measurement26,Measurement27,Measurement28\n0,ajayeb,64558,64556,64543,64543,64541,64543,64543,64541,64541,64541,64541,64541,64541,64528,64528,63529,63529,64541,64541,64541,64541,64541,64541,64541,64528,64528,64528,64528\n1,ajmal,60502,60489,61485,60487,61485,61513,60515,60500,60500,60487,60500,61526,60528,60528,60528,60500,61483,61485,60472,60472,60461,61470,60487,60487,61485,60487,60472,60472\n2,amreaj,57040,57040,57040,58041,58041,58041,58041,57042,57042,58043,58043,58043,58043,58043,57042,58041,58041,58041,58041,58041,58041,58041,58041,58041,58041,58041,58041,58041\n3,aood,71083,72087,71091,71095,71099,72103,71099,720

##### **Load the trained model**

In [26]:
KMeanClusteringModel1 = load_model('kMeanClusteringModelFile')

Transformation Pipeline and Model Successfully Loaded


---
# **3. Clustering: Cluster the new dataset (Unseen Data)**
---
### **3.1 Select some data or upload user dataset file**

In [27]:
# Select top 10 rows
newData = get_data("perfume").iloc[:10]

Unnamed: 0,Perfume,Measurement1,Measurement2,Measurement3,Measurement4,Measurement5,Measurement6,Measurement7,Measurement8,Measurement9,...,Measurement19,Measurement20,Measurement21,Measurement22,Measurement23,Measurement24,Measurement25,Measurement26,Measurement27,Measurement28
0,ajayeb,64558,64556,64543,64543,64541,64543,64543,64541,64541,...,64541,64541,64541,64541,64541,64541,64528,64528,64528,64528
1,ajmal,60502,60489,61485,60487,61485,61513,60515,60500,60500,...,60472,60472,60461,61470,60487,60487,61485,60487,60472,60472
2,amreaj,57040,57040,57040,58041,58041,58041,58041,57042,57042,...,58041,58041,58041,58041,58041,58041,58041,58041,58041,58041
3,aood,71083,72087,71091,71095,71099,72103,71099,72099,72099,...,72095,71095,71095,72103,71103,71103,71103,72103,72103,72098
4,asgar_ali,68209,68209,68216,68216,68223,68223,68223,68223,68230,...,68230,67224,67217,67217,68223,68223,68223,68223,68223,68230


---
### **3.2 Make prediction on the new dataset (Unseen Data)**
---

In [28]:
newPredictions = predict_model(KMeanClusteringModel, data = newData)
newPredictions

Unnamed: 0,Perfume_ajayeb,Perfume_ajmal,Perfume_amreaj,Perfume_aood,Perfume_asgar_ali,Perfume_bukhoor,Perfume_burberrry,Perfume_dehenalaod,Perfume_junaid,Perfume_kausar,...,Measurement20,Measurement21,Measurement22,Measurement23,Measurement24,Measurement25,Measurement26,Measurement27,Measurement28,Cluster
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64541.0,64541.0,64541.0,64541.0,64541.0,64528.0,64528.0,64528.0,64528.0,Cluster 1
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,60472.0,60461.0,61470.0,60487.0,60487.0,61485.0,60487.0,60472.0,60472.0,Cluster 1
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,58041.0,58041.0,58041.0,58041.0,58041.0,58041.0,58041.0,58041.0,58041.0,Cluster 1
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,71095.0,71095.0,72103.0,71103.0,71103.0,71103.0,72103.0,72103.0,72098.0,Cluster 0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,67224.0,67217.0,67217.0,68223.0,68223.0,68223.0,68223.0,68223.0,68230.0,Cluster 0
5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,70049.0,70049.0,70048.0,70049.0,70048.0,70046.0,70046.0,70048.0,71048.0,Cluster 0
6,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,60089.0,60089.0,60092.0,60089.0,60089.0,60089.0,61088.0,61088.0,60089.0,Cluster 1
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,69142.0,69142.0,69142.0,69142.0,68137.0,69137.0,69137.0,69137.0,69136.0,Cluster 0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,72556.0,72556.0,72557.0,72556.0,72542.0,72542.0,72556.0,72556.0,72556.0,Cluster 0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,73585.0,73584.0,73600.0,73601.0,73601.0,73585.0,73585.0,73585.0,73585.0,Cluster 0


---
### **3.3 Save the prediction result to csv**
---

In [29]:
newPredictions.to_csv("NewPredictions.csv")
print("Result file save sucessfully!!")

Result file save sucessfully!!


---
# **4. Clustering: Ploting the Cluster**
---
```
- Cluster PCA Plot (2d)          'cluster'
- Cluster TSnE (3d)              'tsne'
- Elbow Plot                     'elbow'
- Silhouette Plot                'silhouette'
- Distance Plot                  'distance'
- Distribution Plot              'distribution'
```

---
### **4.1 Evaluate Cluster Model**
---

In [30]:
evaluate_model(KMeanClusteringModel)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

---
### **4.2 2D-plot for Cluster**
---

In [31]:
plot_model(KMeanClusteringModel, plot='cluster')

---
### **4.3 3D-plot for Cluster**
---

In [33]:
plot_model(KMeanClusteringModel, plot = 'tsne')

ValueError: perplexity must be less than n_samples

---
### **4.4 Elbow Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'elbow')

---
### **4.5 Silhouette Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'silhouette')

---
### **4.6 Distribution Plot**
---

In [None]:
plot_model(KMeanClusteringModel, plot = 'distribution')

---
# **5. Compelete Code for Clustering (KMean)**
---
### **5.1 For Cluster = 3, 4, 5, 6**

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **5.2 Other Clustering Algorithms**
---
```
- K-Means clustering                 'kmeans'
- Affinity Propagation               'ap'
- Mean shift clustering              'meanshift'
- Spectral Clustering                'sc'
- Agglomerative Clustering           'hclust'
- Density-Based Spatial Clustering   'dbscan'
- OPTICS Clustering                  'optics'
- Birch Clustering                   'birch'
- K-Modes clustering                 'kmodes'
```

---
# **6. Clustering: Apply "Data Preprocessing"**
---
### **Read the Dataset**

In [None]:
from pycaret.clustering import *
from pycaret.datasets import get_data

DataSet = get_data('forest')

---
### **6.1 Model Performance using "Normalization"**
---
### **6.1.1 Elbow Plot**


In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.1.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 6)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 7)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 8)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 9)


---
### **6.1.3 3D Plot for Cluster = 6**
---

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('kmeans', num_clusters = 6)
plot_model(x, plot = 'tsne')

---
### **6.2 Model Performance using "Transformation"**
---

### **6.2.1 Elbow Plot**


In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.2.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.3 Model Performance using "PCA"**
---
### **6.3.1 Elbow Plot**

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.3.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.4 Model Performance using "Transformation" + "Normalization"**
---
### **6.4.1 Elbow Plot**

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.4.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
### **6.5 Model Performance using "Transformation" + "Normalization" + "PCA"**
---
### **6.5.1 Elbow Plot**

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)
x = create_model('kmeans')
plot_model(x, plot = 'elbow')

---
### **6.5.2 Evaluation parameters for Cluster = 3, 4, 5, 6**
---

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore',
      transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('kmeans', num_clusters = 3)

print("For Cluster = 4")
x = create_model('kmeans', num_clusters = 4)

print("For Cluster = 5")
x = create_model('kmeans', num_clusters = 5)

print("For Cluster = 6")
x = create_model('kmeans', num_clusters = 6)

---
# **7. Other Clustering Techniques**
---
```
K-Means clustering                 'kmeans'
Affinity Propagation               'ap'
Mean shift clustering              'meanshift'
Spectral Clustering                'sc'
Agglomerative Clustering           'hclust'
Density-Based Spatial Clustering   'dbscan'
OPTICS Clustering                  'optics'
Birch Clustering                   'birch'
K-Modes clustering                 'kmodes'
```

---
### **7.1 Buildign Agglomerative (Hierarchical) clustering model**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)

x = create_model('hclust')
plot_model(x, plot = 'elbow')

---
### **7.1.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
hierarchicalModel = create_model('hclust', num_clusters=3)
hierarchicalModelPrediction = assign_model(hierarchicalModel)
hierarchicalModelPrediction

---
### **7.1.2 Evaluate Agglomerative (Hierarchical) Clustering**
---

In [None]:
evaluate_model(hierarchicalModel)

---
### **7.2 Density-Based Spatial Clustering**
---

In [None]:
from pycaret.datasets import get_data
from pycaret.clustering import *

DataSet = get_data('forest', verbose=False)
setup(data = DataSet, verbose=False)
dbscanModel = create_model('dbscan')

---
### **7.2.1 Assign Model - "Assign the labels" to the dataset**
---



In [None]:
dbscanModelPrediction = assign_model(dbscanModel)
dbscanModelPrediction

# Noisy samples are given the label -1 i.e. 'Cluster -1'

K-meann shift clustering


In [None]:
from pycaret.clustering import *

meanshiftClusteringParameters = setup(DataSet, verbose=False)
meanshiftClusteringModel = create_model('meanshift', num_clusters=4)
meanshiftClusteringPrediction = assign_model(meanshiftClusteringModel)
meanshiftClusteringPrediction

In [None]:
evaluate_model(meanshiftClusteringModel)

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 6)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 7)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 8)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 9)

In [None]:
setup(data = DataSet, normalize = True, normalize_method = 'zscore', verbose=False)
x = create_model('meanshift', num_clusters = 6)
plot_model(x, plot = 'tsne')

In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, transformation = True, transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 3)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 4)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 5)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 6)

In [None]:
setup(data = DataSet, pca = True, pca_method = 'linear', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, transformation = True, normalize = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 3)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 4)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 5)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 6)

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore', transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)
x = create_model('meanshift')
plot_model(x, plot = 'elbow')

In [None]:
setup(data = DataSet, transformation = True, normalize = True, pca = True,
      normalize_method = 'zscore',
      transformation_method = 'yeo-johnson',
      pca_method = 'linear', verbose=False)

print("For Cluster = 3")
x = create_model('meanshift', num_clusters = 3)

print("For Cluster = 4")
x = create_model('meanshift', num_clusters = 4)

print("For Cluster = 5")
x = create_model('meanshift', num_clusters = 5)

print("For Cluster = 6")
x = create_model('meanshift', num_clusters = 6)

In [None]:
from pycaret.clustering import * model = setup(X, verbose = False)

### **Key Points**

- num_clusters not required for some of the clustering Alorithms (Affinity Propagation ('ap'), Mean shift
  clustering ('meanshift'), Density-Based Spatial Clustering ('dbscan') and OPTICS Clustering ('optics')).
- num_clusters param for these models are automatically determined.

- When fit doesn't converge in Affinity Propagation ('ap') model, all datapoints are labelled as -1.

- Noisy samples are given the label -1, when using Density-Based Spatial  ('dbscan') or OPTICS Clustering ('optics').

- OPTICS ('optics') clustering may take longer training times on large datasets.


---
# **8. Deploy the model on AWS**
---
**<a href="https://pycaret.readthedocs.io/en/latest/api/clustering.html#pycaret.clustering.deploy_model">Click Here</a>**