<a href="https://colab.research.google.com/github/cristiandarioortegayubro/BA/blob/main/cl_km_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![logo](https://github.com/cristiandarioortegayubro/BA/blob/main/dba.png?raw=true)

![](https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

## **Carga de bibliotecas necesarias**

### **Para el tratamiento de los datos**

In [1]:
import pandas as pd
import numpy as np

### **Para gráficos**

In [2]:
import plotly.express as px
import plotly.graph_objects as go

### **Para preprocesamiento de datos y modelo**

In [3]:
import sklearn
from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
import sklearn.metrics as metrics
from sklearn.metrics import silhouette_score

## **Extracción de Datos - Creación del DataFrame**

In [4]:
datos = "https://raw.githubusercontent.com/cristiandarioortegayubro/BA/main/Datasets/Clientes.csv"

In [5]:
clientes = pd.read_csv(datos)
clientes

Unnamed: 0,ID,Trabajo,Edad,Salario,Compra
0,15624510,1,19,19000,No
1,15810944,1,35,20000,No
2,15668575,0,26,43000,No
3,15603246,0,27,57000,No
4,15804002,1,19,76000,No
...,...,...,...,...,...
395,15691863,0,46,41000,Si
396,15706071,1,51,23000,Si
397,15654296,0,50,20000,Si
398,15755018,1,36,33000,No


El dataframe contiene 5 variables y 400 observaciones. 

Las variables indican:
- **ID:** La identificación del cliente
- **Trabajo:** Corresponde 1 cuando es dependiente y 0 cuando es independiente
- **Edad:** La edad del cliente
- **Salario:** El salario estimado del cliente
- **Compra:** Si, cuando el cliente ha comprado y no cuando no lo ha hecho.

Pero para el desarrollo del algoritmo se simplificara a un problema bidimensional, seleccionando las variables de edad y salario.

## **Eliminando variables**

In [6]:
clientes.info() #visualizacion de los tipos de datos del dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       400 non-null    int64 
 1   Trabajo  400 non-null    int64 
 2   Edad     400 non-null    int64 
 3   Salario  400 non-null    int64 
 4   Compra   400 non-null    object
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [7]:
clientes = clientes.drop(columns=["ID", "Trabajo", "Compra"]) #se elimina la variable no relevante
clientes #vista del dataframe

Unnamed: 0,Edad,Salario
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000
...,...,...
395,46,41000
396,51,23000
397,50,20000
398,36,33000


## **Número de clusters**

In [8]:
clusters = pd.DataFrame()
inertia = []

In [9]:
clusters["cluster_range"] = range(1, 10)

In [10]:
for k in clusters["cluster_range"]:
    kmeans = cluster.KMeans(n_clusters=k, random_state=8).fit(clientes)
    inertia.append(kmeans.inertia_)

In [11]:
clusters["inertia"] = inertia

In [12]:
clusters.inertia = round(clusters.inertia, 4)

In [13]:
clusters.head(10)

Unnamed: 0,cluster_range,inertia
0,1,463878500000.0
1,2,165197400000.0
2,3,59521270000.0
3,4,33492340000.0
4,5,19983400000.0
5,6,14684270000.0
6,7,10122510000.0
7,8,8096856000.0
8,9,6340387000.0


### Graficando clusters óptimos.

In [14]:
fig = px.line(clusters,
              x = "cluster_range",
              y = "inertia",
              markers = True,
              title = "Metodo del codo",
              template = "gridon",
              labels = {"cluster_range":"clusters"})
fig.show()

# **Evaluando el Algoritmo**

## **Algoritmo K-means**

In [15]:
clientes.head()

Unnamed: 0,Edad,Salario
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000


In [16]:
km = cluster.KMeans(n_clusters = 4, n_init = 20, random_state = 123)
km

KMeans(n_clusters=4, n_init=20, random_state=123)

In [17]:
km.fit(clientes)

KMeans(n_clusters=4, n_init=20, random_state=123)

In [18]:
centroids = km.cluster_centers_
labels = km.labels_

In [19]:
centroids

array([[3.68888889e+01, 5.37407407e+04],
       [4.24788732e+01, 1.26338028e+05],
       [3.59701493e+01, 8.04850746e+04],
       [3.72643678e+01, 2.68735632e+04]])

In [20]:
centroids = pd.DataFrame(centroids, columns=['Edad', 'Salario'])
centroids

Unnamed: 0,Edad,Salario
0,36.888889,53740.740741
1,42.478873,126338.028169
2,35.970149,80485.074627
3,37.264368,26873.563218


In [21]:
labels

array([3, 3, 0, 0, 2, 0, 2, 1, 3, 0, 2, 0, 2, 3, 2, 2, 3, 3, 3, 3, 3, 0,
       0, 3, 3, 3, 3, 3, 0, 3, 2, 1, 3, 0, 2, 3, 3, 0, 2, 3, 3, 0, 1, 3,
       2, 3, 2, 0, 1, 2, 3, 0, 2, 3, 0, 0, 0, 2, 3, 1, 3, 2, 0, 1, 2, 0,
       3, 2, 0, 2, 2, 3, 3, 1, 3, 1, 0, 3, 2, 3, 2, 0, 0, 2, 0, 1, 0, 2,
       2, 0, 2, 1, 3, 3, 2, 0, 3, 1, 2, 3, 2, 0, 2, 1, 3, 2, 3, 2, 2, 2,
       2, 2, 0, 0, 2, 0, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 3, 3, 2, 0, 3,
       2, 2, 0, 0, 2, 1, 0, 3, 2, 2, 0, 2, 3, 2, 2, 3, 0, 2, 3, 0, 2, 0,
       0, 3, 0, 2, 3, 1, 2, 2, 3, 3, 2, 2, 0, 2, 1, 0, 2, 1, 1, 0, 2, 3,
       0, 3, 3, 3, 3, 2, 1, 0, 0, 0, 2, 0, 2, 3, 2, 3, 0, 2, 2, 0, 2, 3,
       2, 3, 3, 2, 1, 2, 2, 0, 1, 1, 1, 3, 2, 1, 0, 0, 0, 1, 0, 2, 2, 1,
       2, 2, 1, 2, 0, 0, 1, 1, 2, 2, 1, 0, 1, 2, 1, 2, 0, 2, 2, 1, 1, 0,
       2, 1, 2, 1, 0, 1, 0, 2, 3, 0, 1, 1, 0, 2, 2, 0, 2, 1, 2, 1, 1, 2,
       2, 1, 2, 2, 1, 0, 1, 2, 0, 1, 3, 2, 2, 2, 3, 3, 2, 0, 2, 3, 1, 2,
       0, 1, 2, 2, 1, 2, 3, 2, 0, 0, 2, 1, 2, 1, 3,

## **Métricas**

### **Calinski Harabasz**

In [22]:
metrics.calinski_harabasz_score(clientes, labels)

1698.0186456742101

### **Silhouette**

In [23]:
metrics.silhouette_score(clientes, labels)

0.6065989841357814

### **Davies Bouldin**

In [24]:
metrics.davies_bouldin_score(clientes, labels)

0.4605605033920007

In [25]:
clientes['cluster'] = labels

In [26]:
clientes

Unnamed: 0,Edad,Salario,cluster
0,19,19000,3
1,35,20000,3
2,26,43000,0
3,27,57000,0
4,19,76000,2
...,...,...,...
395,46,41000,0
396,51,23000,3
397,50,20000,3
398,36,33000,3


## **Grafico**

### **Plotly**

In [27]:
fig = go.Figure([go.Scatter(x = clientes.Edad, 
                            y = clientes.Salario,
                            mode = "markers",
                            name = "Clusters",
                            marker = dict(color = clientes.Salario,
                                          colorscale = 'bluered',
                                          showscale = False)),

                 go.Scatter(x = centroids.Edad,
                            y = centroids.Salario,
                            mode = "markers",
                            name = "Centroide",
                            marker_color = "orange",
                            marker = dict(size = 12)),
                 ])

fig.update_layout(template =    "gridon",
                  title =       "Edad y Salarios",
                  yaxis_title = "Y",
                  xaxis_title = "X")

fig.show()