# Clustering: Extracting Patterns From Data & Concept + Feature Scaling + PCA [Pt.1]
## Customers segmentation based on their credit card usage behavior

(Photo)

This is the first article of a series in which I will outline an end-to-end clustering project. We will begin with some initial concepts, going to dataset exploratory analysis, passing through data preprocessing, grouping customers into clusters, and at the end, we will perform an analysis of the clusters, plus marketing suggestions.

 It's lots of things to do, so let's start!

 ### Table of Contents

[Part One](#part_one)
* [Clustering Definition](#clustering_definition)
* [Supervisionized Learning x Unsupervisionized Learning](#supervised_unsupervised)
* [K-Means Clustering](#concept)
* [Dataset Dictionary](#data_dictionary)
* [Project Aim](#project_aim)
* [EDA (Exploratory Data Analysis)](#eda)
* [Feature Scaling](#feature_scaling)
* [PCA (Principal Component Analysis)](#pca)
* [Cluster Visualization with Plotly](#visualization)
* [Conclusion](#conclu)

[Part Two](#part_two)
* [Coming soon...](#)

<a id="part_one"></a>
<h2><b> Part One </b></h2>

---

#### Code from the article: [Linear Regression: Advanced Modeling Techniques & Pipeline [Pt.1]](https://medium.com/@viniciusnala/linear-regression-advanced-modeling-techniques-pipeline-pt-1-3c0433230b88)

<a id="clustering_definition"></a>
## Clustering Definition

Clustering or data grouping analysis is a set of data mining techniques that aims to make an automatic grouping of the data according to their degree of similarity. The criterion of similarity depends on the problem and algorithm. The result of this process is the division of a data set into a certain amount of groups (clusters).

This is a very important definition, and all that we will do after that is related somewhat to this definition. Another very important definition is the difference between the two concepts below.

<a id="supervised_unsupervised"></a>
## Supervised Learning x Unsupervised Learning

Supervised Learning is an approach where an algorithm is trained on input data that has been labeled for a particular output. The model fits the data capturing the underlying pattern and relationship between the input data and the output labels, enabling it to predict outcomes accurately.

The two anterior data science projects that I did:

- [Machine Learning Model to Predict Survival in Titanic [Pt. 1]](https://medium.com/@viniciusnala/machine-learning-model-to-predict-survival-in-titanic-pt-1-b3681d1794fb)
- [Linear Regression: Advanced Modeling Techniques & Pipeline [Pt.1]](https://medium.com/@viniciusnala/linear-regression-advanced-modeling-techniques-pipeline-pt-1-3c0433230b88)

I built a supervised machine learning model, which I used to predict a "target variable" (label).

For example, in the first project, I constructed a model to predict if a person would survive or not on Titanic. To do that I provided data containing general characteristics of each passenger and the information if this person survived or not on Titanic (target variable) - the model captured the underlying pattern and relationship between the general characteristics of the passenger and if he survived or not -, then I presented to it neve-before-seen data with general characteristics of some passengers, but without the information if the passengers survived or not, being the model responsible to guess who would survive or not.

Unlike, supervised learning, unsupervised learning uses unlabeled data, i.e. we don't provide the target variable (label) when training the model. These algorithms aren't used to predict something but to discover hidden patterns in data without the need for human intervention (hence, they are "unsupervised").

If the difference is still not so clear, along the project you will understand better. Let's continue!

<a id="concept"></a>
## K-Means Clustering

The goal of clustering is to identify meaningful groups of data. These groups can be analyzed in depth, or be passed as a feature, or as an outcome to a classification or regression model. K-Means was the first clustering method to be developed, has a very simple algorithm, and is still widely used.

(Photo)

K-Means divides the data into a K number of clusters, each cluster has a centroid. The main aim of this algorithm is to minimize the sum of the squared distances of each point to the centroid of its assigned cluster. K-Means does not ensure the clusters will have the same size but finds the clusters that are best separated.

Let's see a simple example, imagine that we have an N number of records and two variables X, and Y. Suppose we wanna split the data into K = 3 clusters, which means assigning each record (Xi, Yi) to a cluster K.

(Photo)

The algorithm first selects random coordinates for the centroids, and using the euclidean distance formula measures the distance between each data point and each cluster centroid. Then each point is assigned to the closest centroid.

(Photo)
(Photo)

After that, the coordinates of the centroids are re-computed by taking the mean of all data points contained in that cluster. Given an assignment of Nk records to cluster K, the center of the cluster(Xk, Yk) is calculated through the equation:

(Photo)

In simple terms, we are just summing the Xk and Yk of the cluster and dividing by the number of points in that cluster.

After this process, the algorithm computes the sum of squares within each cluster, which is given by the formula:

(Photo)

The K-Means keep repeating the process of measuring the distance between the points and the centroids, assigning each data point to the closest center, and re-computing the new coordinates of the centroids until the sum of squares across all three clusters is minimized.

(photo)
(photo)

Repeat the same process many times can require a great computational cost, especially when the amount of data is large. 

If you wanna see by yourself how it works, I highly recommend this simulator: [K-Means Clustering Simulator](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/).

<a id="data_dictionary"></a>
## Dataset Dictionary

The dataset that will be used consists of the credit card usage behavior of 8950 customers during 6 months, having 18 behavioral features. The link to the dataset is available on Kaggle.

[Credit Card Dataset for Clustering](https://www.kaggle.com/datasets/arjunbhasin2013/ccdata)

- CUST_ID: Identification of Credit Card holder (Categorical)
- BALANCE: Balance amount left in their account to make purchases
- BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- PURCHASES: Amount of purchases made from account
- ONEOFF_PURCHASES: Maximum purchase amount done in one-go
- INSTALLMENTS_PURCHASES: Amount of purchase done in installment
- CASH_ADVANCE: Cash in advance given by the user
- PURCHASES_FREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- PURCHASESINSTALLMENTSFREQUENCY: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- CASHADVANCEFREQUENCY: How frequently the cash in advance being paid
- CASHADVANCETRX: Number of Transactions made with "Cash in Advanced"
- PURCHASES_TRX: Number of purchase transactions made
- CREDIT_LIMIT: Limit of Credit Card for user
- PAYMENTS: Amount of Payment done by user
- MINIMUM_PAYMENTS: Minimum amount of payments made by user
- PRCFULLPAYMENT: Percent of full payment paid by user
- TENURE: Tenure of credit card service for user

<a id="project_aim"></a>
## Project Aim

Initially, our focus will be to segment customers according to their similarities. After that, we will analyze this segmentation and define an effective credit card marketing strategy.

<a id="eda"></a>
## EDA (Exploratory Data Analysis)

Let's begin by importing the libraries and the dataset:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('/kaggle/input/ccdata/CC GENERAL.csv')

Visualizing some characteristics of the dataset:

In [None]:
df.head(10)

In [None]:
df.shape

Through Pandas Profiling we can see a very detailed report about the general characteristics of the data set.

In [None]:
from pandas_profiling import ProfileReport

ProfileReport(df).to_notebook_iframe()

In [None]:
# Dowload the report
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_file("report.html")

After analyzing the report I concluded that the features CUSTOMER_ID and TENURE are not going to contribute with the segmentation so we can exclude them from the dataset.

In [None]:
# Delete the irrelevant features
df.drop(columns=['CUST_ID', 'TENURE'], inplace=True)

Now our dataset has 16 columns.

Let's see how many missing values there are in the dataset.

In [26]:
# See Null Values
df.isnull().sum().sort_values(ascending=False)

BALANCE                             0
BALANCE_FREQUENCY                   0
PURCHASES                           0
ONEOFF_PURCHASES                    0
INSTALLMENTS_PURCHASES              0
CASH_ADVANCE                        0
PURCHASES_FREQUENCY                 0
ONEOFF_PURCHASES_FREQUENCY          0
PURCHASES_INSTALLMENTS_FREQUENCY    0
CASH_ADVANCE_FREQUENCY              0
CASH_ADVANCE_TRX                    0
PURCHASES_TRX                       0
CREDIT_LIMIT                        0
PAYMENTS                            0
MINIMUM_PAYMENTS                    0
PRC_FULL_PAYMENT                    0
dtype: int64

Just a few ones, to deal with them I will use KNNImputer: each missing value will be imputed using the mean value from the n_neighbors nearest neighbors, in this case, will be the mean value of the 5 nearest neighbors.

In [None]:
from sklearn.impute import KNNImputer

# Columns with missing values
null_columns = df.columns[df.isnull().any()].tolist()

# Fill Null values
imputer = KNNImputer(n_neighbors=5)
df_imp = pd.DataFrame(imputer.fit_transform(df[null_columns]), columns=null_columns)
df = df.fillna(df_imp)

This way, instead of filling all the values with the mean or the median, which is considered a simple approach, we reduce the risk of biasing the clustering results.

In [None]:
# Distribution Visualization
plt.figure(figsize=(20,35))

for i, col in enumerate(df.columns):
    ax = plt.subplot(9, 2, i+1)
    sns.kdeplot(df[col], ax=ax)
    plt.xlabel(col)
        
plt.show()

Almost every variable is very right-skewed or very left-skewed, this indicates that maybe there are some outliers, which is not something very surprising since we are working with credit card. Certainly, there will be a small portion of people that have a very high amount of money and credit limit, while the majority portion of people has more or less the same amount of money and credit limit.

<a id="feature_scaling"></a>
## Feature Scaling

(photo)

If you see the frequency columns like BALANCE_FREQUENCY, and PURCHASES_FREQUENCY, they vary between an interval of 0 and 1, 0 means 0% of frequency and 1 means 100% frequency. Comparing them with the columns BALANCE and PURCHASES, you will see that they don't have a limit to their variation, we only know that their minimum can be 0.

However, if you put these columns in the way they're right now in the cluster, it will not yield a high-quality cluster because the cluster will understand that 1.00 dollar difference in BALANCE is as significant as 1.00 percent difference in BALANCE_FREQUENCY.
 
In virtue of that, we need to put all the columns into the same scale, otherwise, will be the same thing as clustering people on their weights in kilograms and heights in meters, is a 1kg difference as significant as a 1m difference in height?
 
This is why scaling the dataset is a vital part when using clustering techniques. Therefore, we will use Normalizer to scale the dataset, but there are many ways to scale a dataset, and each one is used for a specific situation.

In [None]:
# Feature Scaling
from sklearn.preprocessing import Normalizer

scaler = Normalizer(norm='l2')

Normalizer, unlike the other methods, works on the rows, not the columns. This seems very unintuitive, but it means that will scale each value according to its line, and not according to the values in its column.

By default, L2 normalization is applied to each observation so the values in a row have a unit norm. Unit norm with L2 means that if each element were squared and summed, the total would equal 1. Moreover, Normalizer transforms all the features into values between -1 and 1.

<a id="pca"></a>
## PCA (Principal Component Analysis)

In [None]:
# Reduce dimensions
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=1234)

PCA is a method used in unsupervised machine learning (such as clustering) that reduces high-dimension data to smaller dimensions while preserving as much information as possible. Using PCA before applying clustering algorithm reduces dimensions, data noise, and decreases computation cost.

In this article, the number of features will be reduced to 2 dimensions so that the clustering results can be visualized.


                                                                        ...
To organize better these two preprocessing steps we will embed them into a unique step with Pipeline:

In [None]:
# Preprocessing
from sklearn.pipeline import Pipeline

# Pipeline
pipe_normalizedscaled_pca = Pipeline([('scaling', scaler), ('pca', pca)])

The dataset will be in this way after the preprocessing:

In [None]:
# Transformed dataset
pd.DataFrame(
    pipe_normalizedscaled_pca.fit_transform(df),
    columns=['x', 'y']
)

<a id="visualization"></a>
## Cluster Visualization with Plotly

Once done the preprocessing, the only thing that remains is the modeling and visualization. And to do this I will create a function, so we can use it whenever we want.

In [None]:
# Libs
from sklearn.cluster import KMeans
import plotly.express as px
import seaborn as sns

# Function
def Visualize_Cluster(df, pipeline, n_clusters):
    '''
    Display a scatter plot cluster after transforming the data and using it to fit KMeans Cluster 
    
        Parameters:
                df (pandas.core.frame.DataFrame): Dataframe that will be used in the Pipeline and train the KMeans Cluster
                pipeline (sklearn.pipeline.Pipeline): Transform the Dataframe
                n_clusters (int): Number of clusters that the KMeans Cluster will have
        
        Returns:
                None    
    '''
    
    data = pd.DataFrame(pipeline.fit_transform(df), columns=['x', 'y'])
    
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, max_iter=300, verbose=False, random_state=1234)
    clusters = pd.DataFrame(kmeans.fit_predict(data), columns=['Cluster']) 
    
    clusters_data = pd.concat([data, clusters], axis=1)
    
    fig = px.scatter(clusters_data, x='x', y='y', color='Cluster')
    fig.show()

First, we import the KMeans library for the clustering and the Plotly library for the cluster visualization. Then we write the function that has as parameters: the dataset for training the KMeans model, the pipeline for preprocessing, and the number of clusters in KMeans.

Internally, the function will transform the dataset according to the steps in the Pipeline, assuming that this transformation will only return two columns, which will be named "x", and "y". Create the KMeans Cluster, and inside this object, it will be passed the number of clusters previously specified in the parameters of the function as an argument, so we can visualize how many clusters we want. And in the end, we will plot the scatter graph of the clustering, each color will be a cluster.

Now, let's visualize our clustering with 5 clusters.

In [None]:
Visualize_Cluster(df, pipe_normalizedscaled_pca, n_clusters=5)

Remember: almost all attributes of the KMeans model in the function are with the default values, the only attribute that we can change is the number of clusters of the clustering. Therefore, we can make the clustering even better by changing the parameters of the KMeans model.

Let's see how would be our clustering with 10 clusters.

In [None]:
Visualize_Cluster(df, pipe_normalizedscaled_pca, n_clusters=10)

<a id="conclu"></a>
## Conclusion

In this first part, we started the project learning some concepts and understanding how clustering works. In part two of the series, we will give continuity to the project by learning the most common metrics used to validate a cluster and how to use these metrics to find the ideal number of clusters for this dataset.

In virtue of the content being about metrics, we will have to enter into math equations, so I will explain many things in mathematical terms (what people usually dislike, but the subject being studied requires that be this way), and I will assume that you have at least a basic algebra knowledge.

<a id="part_two"></a>
<h2><b> Part Two </b></h2>

---

#### Coming soon...

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

pipe_normalizer = Pipeline([('normalization', Normalizer())])

In [None]:
data = pipe_normalizer.fit_transform(df)
data = pd.DataFrame(data, columns=df.columns)

def cluster_algorithm(n_clusters, dataset):
    kmean = KMeans(n_clusters=n_clusters, random_state=1234)
    labels = kmean.fit_predict(dataset)
    
    si = silhouette_score(dataset, labels, metric='euclidean', random_state=1234)
    db = davies_bouldin_score(dataset, labels)
    ch = calinski_harabasz_score(dataset, labels)
    
    return si, db, ch


n_clusters, si_list, db_list, ch_list = list(), list(), list(), list()

for i in range(2, 101):
    si, db, ch = cluster_algorithm(i, data)
    n_clusters.append(int(i))
    si_list.append(si)
    db_list.append(db)
    ch_list.append(ch)

metrics = pd.DataFrame(np.column_stack((n_clusters, si_list, db_list, ch_list)), columns=['N° Clusters', 'silhouette', 'davies_bouldin', 'calinski_harabasz'])

In [None]:
import plotly.express as px
# Silhouette
fig = px.line(metrics, x='N° Clusters', y='silhouette', title='Silhouette')
fig.show()

In [None]:
fig = px.line(metrics, x='N° Clusters', y='davies_bouldin', title='Davies Bouldin')
fig.show()

In [None]:
fig = px.line(metrics, x='N° Clusters', y='calinski_harabasz', title='Calinski Harabasz')
fig.show()

In [None]:
metrics[metrics['N° Clusters'] <= 8][['N° Clusters', 'silhouette']].sort_values(by='silhouette', ascending=False)

metrics[metrics['N° Clusters'] <= 8][['N° Clusters', 'davies_bouldin']].sort_values(by='davies_bouldin', ascending=True)

metrics[metrics['N° Clusters'] <= 8][['N° Clusters', 'calinski_harabasz']].sort_values(by='calinski_harabasz', ascending=False)

# n_cluster = 5
metrics[metrics['N° Clusters'] == 5].reset_index(drop=True)

In [None]:
dataset = np.random.rand(8950, 16)

si, db, ch = cluster_algorithm(5, dataset)

pd.DataFrame(data=[[5, si, db, ch]], columns=['N° Clusters', 'silhouette', 'davies_bouldin', 'calinsk_harabasz'])

In [None]:
print('Metrics of the cluster with a normal dataset:')
display(metrics[metrics['N° Clusters'] == 5].reset_index(drop=True))
print('Metrics of the cluster with a random dataset:')
display(pd.DataFrame(data=[[5, si, db, ch]], columns=['N° Clusters', 'silhouette', 'davies_bouldin', 'calinsk_harabasz']))

In [None]:
splited_df = np.array_split(df, 5)

section, si_list, db_list, ch_list = list(), list(), list(), list()

for i in range(5):
    si, sb, ch = cluster_algorithm(5, splited_df[i])
    section.append(i+1)
    si_list.append(si)
    db_list.append(sb)
    ch_list.append(ch)


pd.DataFrame(np.column_stack([section, si_list, db_list, ch_list]), columns=['Section N°', 'silhouette', 'davies_bouldin', 'calinski_harabasz'])

In [None]:
# data = pd.DataFrame(pipe_normalizer.fit_transform(df), columns=df.columns)
    
kmeans = KMeans(n_clusters=5, n_init=10, max_iter=300, verbose=False, random_state=1234)
labels = pd.DataFrame(kmeans.fit_predict(data), columns=['CLUSTER']) 
    
clusters_data = pd.concat([df, labels], axis=1)
clusters_data

In [None]:
clusters_data['CLUSTER'].value_counts()

In [None]:
fig = px.scatter(clusters_data, x='PURCHASES', y='PAYMENTS', color='CLUSTER')
fig.show()

In [None]:
fig = px.scatter(clusters_data, x='PURCHASES', y='CASH_ADVANCE', color='CLUSTER')
fig.show()

In [None]:
sns.pairplot(clusters_data, hue='CLUSTER', palette='bright')

In [None]:
clusters_data.groupby(by='CLUSTER').describe()

In [None]:
centroids = kmeans.cluster_centers_
centroids

In [None]:
for i in range(centroids.shape[1]):
    print(f'{clusters_data.columns[i]}:', '{:.4f}'.format(centroids[:, i].var()))

In [None]:
centroids_list = list()

for i in range(centroids.shape[1]):
    centroids_list.append(centroids[:, i].var())


centroids_var = pd.DataFrame(centroids_list, index=clusters_data.columns[:-1], columns=['Variance of the Centroids'])
centroids_var

In [None]:
significant_centroids_var = centroids_var[centroids_var['Variance of the Centroids'] > 0.009]
significant_centroids_var

In [None]:
from scipy import stats

stats.probplot(clusters_data['BALANCE'], plot=plt)
plt.show()

The dataset is a usage of account during the last 6 months so it means that at the start of these 6 months, lets say the amount in account is VALUE
where: VALUE = PURCHASE + BALANCE
so PURCHASE is the amount of money the account spent
and the BALANCE is the money left in VALUE

In [None]:
balance = significant_centroids_var.index[0]

clusters_data[[balance, 'CLUSTER']].groupby(by='CLUSTER').describe()

CLUSTER 0: 

CLUSTER 1: sobrou uma boa quantidade para comprar

CLUSTER 2: maior valor que sobrou para comprar

CLUSTER 3: menor valor que sobrou para comprar

CLUSTER 4: sobrou uma quantidade mediana para comprar

In [None]:
purchases = significant_centroids_var.index[1]

clusters_data[[purchases, 'CLUSTER']].groupby(by='CLUSTER').describe()

CLUSTER 0: foco deles é comprar, tem muito dinheiro 

CLUSTER 1: compram uma boa quantidade e ainda sobra uma boa quantidade para comprar

CLUSTER 2: compram muito pouco, tem muito dinheiro

CLUSTER 3: compram uma quantidade considerável em compras, não tem muito dinheiro

CLUSTER 4: compram pouquíssimo, tem uma quantidade considerável de dinheiro

In [None]:
cash_advance = significant_centroids_var.index[2]

clusters_data[[cash_advance, 'CLUSTER']].groupby(by='CLUSTER').describe()

CLUSTER 0: quase não paga antecipadamente

CLUSTER 1: paga pouco antecipadamente

CLUSTER 2: bom valor que pagam antecipadamente

CLUSTER 3: quase não paga antecipadamente

CLUSTER 4: maior valor que pagam antecipadamente

In [None]:
credit_limit = significant_centroids_var.index[3]

clusters_data[[credit_limit, 'CLUSTER']].groupby(by='CLUSTER').describe()

CLUSTER 0: limite alto

CLUSTER 1: menor limite

CLUSTER 2: limite alto

CLUSTER 3: maior limite

CLUSTER 4: limite mediano

In [None]:
payments = significant_centroids_var.index[4]

clusters_data[[payments, 'CLUSTER']].groupby(by='CLUSTER').describe()

CLUSTER 0: pagam uma boa quantidade de dinheiro, o que mais tendem a pagar a fatura completa

CLUSTER 1: pagam uma quantidade considerável, porém quase não pagam o que gastam

CLUSTER 2: pagam pouco, quase não pagam o que gastam 

CLUSTER 3: o que menos paga, porém até que paga o que gasta

CLUSTER 4: o que mais pagam, tendem a pagar toda fatura

CLUSTER 0: foco deles é gastar, quase não paga antecipadamente porém os que mais pagam completa, limite alto: os melhores clientes

CLUSTER 1: sobra uma boa quantidade para comprar, pagam uma quantidade considerável quantidade, paga pouco antecipadamente, menor limite

CLUSTER 2: tem dinheiro mas não gasta, bom valor que pagam antecipadamente porem quase não pagam a fatura inteira, limite alto: tem mais caloteiros

CLUSTER 3: não tem muito dinheiro, porém gastam uma boa parte desse dinheiro que não é muito, quase não paga antecipadamente, maior limite, o que menos pagam porém até que paga o que gasta: os piores clientes

CLUSTER 4: possuem uma quantidade considerável de dinheiro porém gastam pouquíssimo, maior valor que pagam antecipadamente e são os melhores pagadores, limite mediano: clientes com alto potencial

In [None]:
clusters_data[['PRC_FULL_PAYMENT', 'CLUSTER']].groupby(by='CLUSTER').describe()

In [None]:
minimum_payments = significant_centroids_var.index[5]

clusters_data[[minimu_payments, 'CLUSTER']].groupby(by='CLUSTER').describe()

CLUSTER 0: pagamento mínimo baixo

CLUSTER 1: pagamento mínimo extremamente alto

CLUSTER 2: pagamento mínimo bom

CLUSTER 3: o menor pagamento mínimo

CLUSTER 4: pagamento mínimo mediano

In [None]:
clusters_data['CLUSTER'].value_counts()

http://benalexkeen.com/feature-scaling-with-scikit-learn/

https://www.quora.com/Which-one-is-better-before-clustering-standardization-or-normalization

https://stats.stackexchange.com/questions/183236/what-is-the-relation-between-k-means-clustering-and-pca

https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/

https://datascience.stackexchange.com/questions/76930/the-impact-of-using-different-scaling-strategy-with-clustering