### Selecting clustering methods

Clustering is an unsupervised learning technique that segments a dataset into groups or clusters based on similarities between observations. In this project, the objective of clustering is to identify groups of customers with similar behaviors, providing key insights for decision-making and the design of personalized marketing strategies. This segmentation will help understand purchasing patterns, platform interaction, and potential customer categories.

The cleaned dataset presents the following relevant characteristics:

**Dataset Size:** It contains exactly 55,002 records, representing a sufficiently large sample to derive representative clusters.

**Variable Distributions:**

- **Symmetrical:** `dias_primera_compra` and `info_perfil`. These columns show consistent behaviors and allow for hierarchical relationships between customers to be observed.

- **Asymmetrical:** `n_clicks`, `n_visitas`, `monto_compras`, and `monto_descuentos`. These columns exhibit skewness toward extreme values, indicating behaviors of highly active customers or those with high purchasing power.

**Outliers:** Present in `n_clicks`, `n_visitas`, `monto_compras`, and `monto_descuentos`, which may represent exceptional customers, such as frequent shoppers or significant consumers. There are no outliers in `dias_primera_compra` or `info_perfil`, reflecting more consistent distributions.

Based on these characteristics, the selection of four clustering methods suitable for exploring significant patterns in this context will be theoretically justified.

### Exploration of Clustering Methods

The following describes the methods considered, along with their advantages, disadvantages, and justification for inclusion or exclusion in this project.

##### K-Means

**Description:** Divides the data into k clusters, minimizing internal variance within each group. It relies on Euclidean distance, making it particularly suitable for compact and spherical clusters.

**Advantages:**

- Efficient for large datasets due to its computational simplicity.

- Easy to interpret and visualize.

**Disadvantages:**

- Sensitive to outliers, which can skew centroids toward extreme values.

- Assumes clusters are spherical and homogeneous, which may not be suitable for columns like `monto_compras` and `monto_descuentos`.

**Justification:** Included as a baseline method due to its simplicity and efficiency. Although sensitive to outliers, it provides a reference point for evaluating other methods.

##### DBSCAN

**Description:** Detects clusters based on density and classifies scattered points as noise. It is particularly useful for identifying arbitrarily shaped clusters.

**Advantages:**

- Naturally handles outliers by classifying them as noise.

- Does not require predefining the number of clusters, allowing for greater exploratory flexibility.

**Disadvantages:**

- Sensitive to epsilon and minimum point parameters, which must be carefully tuned to avoid overfitting.

- Limited scalability for very large datasets, though manageable for this project.

**Justification:** Included for its ability to handle non-spherical distributions and robustness against outliers, making it especially relevant for `n_clicks`, `n_visitas`, and other skewed columns.

##### Hierarchical Clustering (Agglomerative)

**Description:** Builds a hierarchy of clusters by iteratively merging observations into larger groups. This allows relationships between data points to be visualized through a dendrogram.

**Advantages:**

- Provides a visual representation (dendrogram) that facilitates hierarchical interpretation.

- Does not require predefining the number of clusters, allowing different levels of granularity to be explored.

**Disadvantages:**

- Computationally expensive for large datasets, though feasible with a representative sample.

**Justification:** Included for its interpretability and ability to analyze hierarchical structures, particularly useful for symmetrical columns like `dias_primera_compra` and `info_perfil`, which may reflect temporal or profile patterns.

##### Gaussian Mixture Models (GMM)

**Description:** Models the data as a combination of Gaussian distributions, assigning probabilities to each point for belonging to a cluster.

**Advantages:**

- Flexible in handling elliptical or overlapping clusters.

- Provides probabilities, allowing for more detailed interpretations of cluster membership.

**Disadvantages:**

- Requires specifying the number of clusters beforehand.

- Sensitive to outliers, which can distort Gaussian distributions.

**Justification:** Included for its flexibility in modeling complex distributions, especially useful for columns like `monto_compras` and `monto_descuentos`, which exhibit diverse behavioral patterns.

##### OPTICS

**Description:** Similar to DBSCAN but capable of identifying hierarchical structures in clusters with variable density.

**Advantages:**

- Detects clusters with variable densities, offering greater precision in some cases.

- Handles outliers.

**Disadvantages:**

- Less interpretable than DBSCAN.

- Computationally expensive for large datasets.

**Justification:** Excluded due to its complexity and lower interpretability compared to DBSCAN.

##### BIRCH

**Description:** Designed for large datasets, it uses a tree structure to summarize data before clustering.

**Advantages:**

- Scalable for large datasets.

- Handles outliers and reduces dimensionality.

**Disadvantages:**

- Less precise for complex clusters.

**Justification:** Excluded due to lower interpretability and precision compared to other methods.

##### Bisecting K-Means

**Description:** A variant of K-Means that iteratively splits clusters to optimize grouping.

**Advantages:**

- More effective in some cases than traditional K-Means.

- Scalable for large datasets.

**Disadvantages:**

- Less interpretable than K-Means.

**Justification:** Excluded due to the simplicity and effectiveness of traditional K-Means in this context.

### Selected Methods

The methods selected for this project are:

**K-Means:** For its efficiency, ease of interpretation, and applicability as a baseline method.

**DBSCAN:** For its robustness against outliers and ability to handle non-spherical distributions.

**Gaussian Mixture Models (GMM):** For its flexibility in modeling non-spherical and complex distributions.

**Hierarchical Clustering (Agglomerative):** For its interpretability and ability to explore hierarchical structures, particularly in symmetrical variables.

These methods represent a balance between efficiency, robustness, and interpretability, aligning with the dataset's characteristics and the project's objectives.