In [1]:
#' # K-means Clustering Analysis
#' 
#' This notebook performs K-means clustering analysis on a given dataset. The analysis includes data preprocessing, 
#' determining the optimal number of clusters, and visualizing the clustering results.
#' 
#' ## Functions
#' 
#' ### sum_two_numbers
#' 
#' This function calculates the sum of two numbers.
#' 
#' #### Parameters
#' - `a`: A numeric value representing the first number.
#' - `b`: A numeric value representing the second number.
#' 
#' #### Returns
#' - The sum of the two numbers.
#' 
#' #### Examples
#' ```r
#' sum_two_numbers(3, 5)
#' sum_two_numbers(-2, 7)
#' ```
#' 
#' #### Export
#' # K-means Clustering and Visualization
#' 
#' This script performs K-means clustering on specific columns of a dataset and visualizes the results.
#' 
#' ## Libraries
#' 
#' The following libraries are required:
#' - `ggplot2`: For data visualization.
#' - `dplyr`: For data manipulation.
#' 
#' ## Data Loading
#' 
#' The dataset is loaded from a CSV file named `datos_limpios.csv`.
#' 
#' ## Functions
#' 
#' ### `categorizar_kmeans`
#' 
#' This function applies K-means clustering to a specified column of a dataframe and categorizes the data into clusters.
#' 
#' #### Parameters
#' - `data`: The dataframe containing the data.
#' - `columna`: The name of the column to which K-means clustering will be applied.
#' - `k`: The number of clusters to create (default is 3).
#' 
#' #### Returns
#' - The dataframe with an additional column indicating the cluster category for the specified column.
#' 
#' ## Main Script
#' 
#' The script applies the `categorizar_kmeans` function to a list of specified columns and prints the resulting dataframe. Optionally, it visualizes the distribution of categories for each column using histograms.
#' 
#' ### Columns to be Clustered
#' 
#' The columns to which K-means clustering is applied are:
#' - `Asociacion`
#' - `duracion_segundos`
#' - `impresiones`
#' - `vistas`
#' - `suscriptores`
#' 
#' ### Visualization
#' 
#' Histograms are generated to visualize the distribution of categories for each column.
#' - This function is exported for use in other scripts or packages.
#' 
#' ```r
#' sum_two_numbers <- function(a, b) {
#'   return(a + b)
#' }
#' ```

In [2]:
# Cargar librerías necesarias
library(ggplot2)
library(dplyr)
library(scales)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [1]:
install.packages("cluster")
library(cluster)


Installing package into ‘/home/saratrasv/R/x86_64-pc-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)



In [2]:
# Cargar el dataset limpio desde un archivo CSV
df <- read.csv("../../datasets/datos_limpios.csv")

# Normalizar una columna
normalizar <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

obtener_rangos <- function(data, columna, k = 3) {
  set.seed(123) # Reproducibilidad

  # Normalizar la columna
  columna_norm <- normalizar(data[[columna]])

  # Aplicar K-means
  kmeans_result <- kmeans(columna_norm, centers = k)

  # Calcular el Silhouette Score
  sil_score <- silhouette(kmeans_result$cluster, dist(columna_norm))
  promedio_silhouette <- mean(sil_score[, 3])

  # Ordenar los clusters por la media de los valores originales
  orden_clusters <- order(tapply(data[[columna]], kmeans_result$cluster, mean))

  # Asignar etiquetas ordenadas: Bajo, Medio, Alto
  etiquetas <- c("Bajo", "Medio", "Alto")

  # Mostrar los rangos reales para cada categoría ordenada
  for (i in 1:k) {
    cluster_index <- orden_clusters[i]
    valores_cluster <- data[[columna]][kmeans_result$cluster == cluster_index]
    rango_min <- min(valores_cluster)
    rango_max <- max(valores_cluster)
    cat("Rango", etiquetas[i], ":", rango_min, "-", rango_max, "\n")
  }

  # Mostrar el Silhouette Score promedio
  cat("Silhouette Score Promedio:", promedio_silhouette, "\n")
}

# Aplicar y mostrar los rangos de cada columna seleccionada
columnas <- c("duracion_segundos", "impresiones", "vistas", "suscriptores")

for (columna in columnas) {
  cat("\nRangos para", columna, ":\n")
  obtener_rangos(df, columna)
}


Rangos para duracion_segundos :
Rango Bajo : 24 - 57 
Rango Medio : 195 - 323 
Rango Alto : 399 - 443 
Silhouette Score Promedio: 0.8006478 

Rangos para impresiones :
Rango Bajo : 235 - 2707 
Rango Medio : 3450 - 6190 
Rango Alto : 10523 - 11458 
Silhouette Score Promedio: 0.7605541 

Rangos para vistas :
Rango Bajo : 38 - 163 
Rango Medio : 212 - 278 
Rango Alto : 394 - 597 
Silhouette Score Promedio: 0.6596161 

Rangos para suscriptores :
Rango Bajo : 0 - 1 
Rango Medio : 2 - 3 
Rango Alto : 4 - 5 
Silhouette Score Promedio: 0.6776471 


<div style="background-color: #FFFFE0; font-family: 'Times New Roman'; padding: 10px;">    

## discretize_data
</div>

<div style="background-color: #FFFFE0; font-family: 'Times New Roman'; padding: 10px;">    
    <p>
        The code below performs the following tasks:
    </p>
    <ul>
        <li>Loads the cleaned dataset from a CSV file named <code>datos_limpios.csv</code>.</li>
        <li>Defines a function <code>normalizar</code> to normalize a given column of data.</li>
        <li>Defines a function <code>obtener_rangos</code> that applies K-means clustering to a specified column, normalizes the column, and assigns ordered labels (Bajo, Medio, Alto) to the clusters based on the mean values of the original data.</li>
        <li>Applies the <code>obtener_rangos</code> function to a list of specified columns (<code>duracion_segundos</code>, <code>impresiones</code>, <code>vistas</code>, <code>suscriptores</code>) and prints the real ranges for each ordered category.</li>
    </ul>
</div>

In [None]:
library(arules)
# create vector of values
my_values <- read.csv("../../datasets/for_ranges.csv")

# discretize values in vector
#discretize(my_values$duracion_promedio_vistas, method = "interval")
#discretize(my_values$vistas)
#discretize(my_values$tiempo_reproduccion_horas, method = "interval")
#discretize(my_values$suscriptores, method = "interval")
#discretize(my_values$impresiones, method = "interval")
#discretize(my_values$ctr, method = "interval")