checks if it is in colab environment and pip install InTRASOM.

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  # Install Intrasom
  !pip install intrasom


if in colab environment, download input sample  from github and create folders used in example.

In [None]:
import os

if IN_COLAB:
  os.makedirs('data', exist_ok=True)
  !wget 'https://github.com/InTRA-USP/IntraSOM/raw/main/examples/data/Animais_missing.xlsx' -P ./data
  !wget 'https://github.com/InTRA-USP/IntraSOM/raw/main/examples/data/Animais_proj.xlsx' -P ./data
  import os
  path = 'Results'
  os.makedirs(path, exist_ok=True)
  os.makedirs('Plots', exist_ok=True)
  os.makedirs('Plots/U_matrix', exist_ok=True)

in_file_path = 'data/Animais_missing.xlsx'

<div style="text-align: center; padding: 0px;">
  <img src="https://github.com/InTRA-USP/IntraSOM/blob/main/examples/images/Logo_fundo_branco.svg?raw=1" style="max-width: 700px;">
</div>

<center><h1>Library Application Examples</h1></center>
<div style="text-align: justify">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IntraSOM is a completely Python-based library for Self-Organizing Maps (SOM) developed by the RDI Center ‘Integrated Technology for Rock and Fluid Analysis’ (InTRA) (https://www.usp.br/intra/). IntraSOM was developed using Object-Oriented Programming and includes a hexagonal lattice, toroidal topology, and a wide range of tools for visualization to improve the analysis, exploration, and classification of complex n-dimensional datasets. This library includes efficient clustering algorithms and is capable of processing datasets with missing data during training. The main goal of this implementation is to raise the popularity of SOM and non-supervised algorithms and make it accessible to researchers and professionals in several fields of science and industry. The Python-based implementation presented here is an expandable framework capable of easily connecting with other machine learning libraries and algorithms.
</div>


# Imports
***

In [None]:
# IntraSOM Library
import intrasom

# Results Clustering and Plotting Modules
from intrasom.visualization import PlotFactory
from intrasom.clustering import ClusterFactory

# Other importations
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt

# Load DataBase
***
To illustrate the library’s functionalities, we employ a binary database composed of the presence and absence of descriptive features of animals. This database is a modified version of the one originally presented in Kohonen’s book “Self-Organizing Maps” (2012), including new animal samples. The database is shown below and has 18 animal samples with 13 binary descriptive features related to the animals’ size, covering, and physical functions. This way, we start with a 13-dimensional space (related to the features) and 18 samples (related to the animals) as the basis for the self-organization presented here.

In [None]:
# Load dataframe
data = pd.read_excel(in_file_path, index_col=0)

This database intentionally replaces three variables of the sample "Dove" with -9999 to demonstrate the utilization of the mask parameter. This parameter is associated with handling missing values during training.

In [None]:
data

# Training Object Creation
***

<p style="text-align: justify; text-indent: 1.5cm;">
The SOM object is created using the <strong>SOMFactory</strong> class, specifically the <em>build</em> class method within the intrasom main module. The main parameter required for this method is the training data. The training data can be provided either in the form of a pandas dataframe with features as columns and samples as rows, or as a numpy ndarray. If desired, individual names for features and samples can be provided using the <em>component_names</em> and <em>sample_names</em>. If these names are not provided, default names are automatically assigned in the format "sample_#" and "variable_#".

The input parameters are:
- **data**: input data, represented as an n x m matrix, where n is the number of samples or instances, and m is the number of features. The data can be provided in either pandas dataframe or ndarray format (if ndarray is chosen, the `component_names` and `sample_names`parameters can be filled).

- **mapsize**: tuple/list defining the size of the SOM map in the format (columns,rows). If an integer is provided, it is considered as the total number of neurons. For example, a mmap of 144 neurons will automatically create a SOM map of size (12,12). For the development of periodicity in hexagonal grids, it is not possible to create SOM maps with an odd number of rows. Therefore, when choosing map sizes with an off number of rows, the algorithm automatically choose the closest even integer (below) the selected. If no number is inserted, the map size will be provided by the heuristic function defined in `expected_mapsize()` function will be considered.

- **mask**: Maks for null values. Examples: -9999.

- **mapshape**: shape of the SOM map training topology. Example: "toroid".

- **lattice**: map lattice type. Example: "hexa"

- **normalization**: type of data normalization. Example: "var" or None

- **initialization**: method used for SOM initialization. Options: "pca" (only for complete datasets without NaN values; does not work with missing data) or "random".

- **neighborhood**: type of neighborhood calculation. Example: "gaussian"

- **training**: training mode. Example: "batch"

- **name**: name used to identify the SOM object or project. The chosen name will be used to name the saved files at the end of training and in other functions of the library.

- **component_names**: list of labels for the variables used in training. If not provided, a list will be automatically created in the format: [Variable 1, Variable 2, ...].

- **unit_names**: list of labels associated with the units of the training variables. If not provided, a unit list will be automatically created in the style: [Unit <variable1>, Unit <variable2>, ...].
    
- **sample_names**: list with the names of the samples. If not provided, a list will be automatically created in the format: [Sample 1, Sample 2, ...].
    
- **missing**: boolean value that should be filled if the database has missing values (NaN). For training of the "Bruto" type, a search for the BMUs (Best Matching Units) is performed using a distance calculation function with missing data, and the codebook update is done by filling the missing data with 0. In the fine-tuning step, this process is repeated if the parameter "previous_epoch" is set to False, or there is a substitution of the empty values with the calculated values for those cells in the previous training epoch if the parameter "previous_epoch" is set to True. In order to allow freedom of movement for vectors with missing data across the Kohonen map, a random regularization factor is generated for this filling. This factor decays during training based on the decay of the search radius for BMU updates. This factor can be observed along with the quantization error and the search radius in the training progress bar.


In [None]:
mapsize = (24,14)
som_test = intrasom.SOMFactory.build(data,
                                     mask=-9999,
                                     mapsize=mapsize,
                                     mapshape='toroid',
                                     lattice='hexa',
                                     normalization='var',
                                     initialization='random',
                                     neighborhood='gaussian',
                                     training='batch',
                                     name='Example',
                                     component_names=None,
                                     unit_names = None,
                                     sample_names=None,
                                     missing=True,
                                     save_nan_hist = True,
                                     pred_size=0)

<div style="text-align: justify">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;It is worth mentioning that the same object could have been created with only the necessary parameters for training, assuming default values for all others. It is necessary to indicate the presence of dummy variables for missing data with the argument *mask=-9999*, and the training should be performed with missing data using the argument *missing=True*.
</div>

# Training
***

<p style="text-align: justify; text-indent: 1.5cm;">
Once the SOM map is initialized, we proceed to the training phase. At the end of training, all the original samples will be represented by a neuron, which is then referred to as the Best Matching Unit (BMU)
</p>

<p style="text-align: justify; text-indent: 1.5cm;">
Os parâmetros que podem ser utilizados são:
</p>

- **n_job**: number of jobs to use for parallelizing the training. This number cannot exceed the number of CPU processing cores.
- **shared_memory**: flag to activate shared memory, important to not overload RAM memmory.
- **train_rough_len**: number of epochs during rough training.
- **train_rough_radiusin**: initial radius for BMU search during rough training.
- **train_rough_radiusfin**: final radius for BMU search during rough training.
- **train_finetune_len**: number of epochs during fine-tuning train.
- **train_finetune_radiusin**: initial radius for BMU search during fine-tuning.
- **train_finetune_radiusfin**: final radius for BMU search during fine-tuning.
- **train_len_factor**: factor to multiply the number of epochs for both rough and fine-tuning training together.
- **maxtrainlen**: maximum desired number of epochs. Default: np.Inf (infinite).
- **previous_epoch**: filling empty values with the ones found in the last epoch of training during fine-tuning.
</p>

<p style="text-align: justify; text-indent: 1.5cm;">
It is recommended to use an initial search radius that is compatible with the size of the training map. On the other hand, a final radius of 1 is suggested, which represents the alteration of the neighborhood within a Manhattan distance of 1 from the activated BMU.
</p>

**Obs**: During training, a progress bar and descriptors are displayed, including the current training epoch, the training radius associated with each epoch, the quantization error for each epoch, and the regularization factor applied to the filling of empty data.

In [None]:
som_test.train(train_len_factor=2,
               previous_epoch = True)

## Load Trained Map

In [None]:
import json
import pandas as pd
import numpy as np

# IntraSOM Library
import intrasom

# Results Plotting Module
from intrasom.visualization import PlotFactory

<p style="text-align: justify; text-indent: 1.5cm;">
To load the training results, access to three files is required:
<ol>
<li>Input data for training.</li>
<li>The trained neuron map saved after training in the Results folder, with the name "nome_projeto_BMUS.(xlsx/csv/parquet)".</li>
<li>The parameters used for training, saved in JSON format in the Results folder with the name "params_nome_projeto.json".</li>
</ol>
</p>

In [None]:
data = pd.read_excel(in_file_path, index_col=0)
bmus = pd.read_parquet("Results/Example_neurons.parquet")
params = json.load(open("Results/params_Example.json", encoding='utf-8'))

<p style="text-align: justify; text-indent: 1.5cm;">
To load the training, each of the instantiated files is passed as a parameter to the <span style="font-style: italic;">SOMFactory.load_som</span> class function, including the parameters: data, trained_neurons, and params, respectively.
</p>

In [None]:
som_test = intrasom.SOMFactory.load_som(data = data,
                                       trained_neurons = bmus,
                                       params = params)

## Training Report

<p style="text-align: justify; text-indent: 1.5cm;">
At the end of the training process, a 'Training Report' is automatically generated, containing descriptive numerical information about the input data, initialization, training, and training quality, following the format below:<br>
</p>
<center><img src="https://github.com/InTRA-USP/IntraSOM/blob/main/examples/data/train_report.jpg?raw=1"/></center>

# Main Module
***

<p style="text-align: justify; text-indent: 1.5cm;">
The training results are automatically saved in a directory called 'Results', created in the same location as the Jupyter Notebook used for training. Additionally, these results can be accessed through specific methods, as shown below.
</p>

## Results Dataframe

<p style="text-align: justify; text-indent: 1.5cm;">
A dataframe named <span style="font-style: italic;">project_name_results</span> is created in the 'Results' directory. This dataframe can be accessed using the static method <span style="font-style: italic;">.results_dataframe</span> associated with the training object. The dataframe consists of the following columns:
</p>

* **BMU**: shows the associated BMU for each input vector.
* **q-error**: quantization error for each sample with respect to its BMU.
* **Ret_x, Ret_y, Cub_x, Cub_y, Cub_z**: two columns for rectangular or Cartesian coordinates (Ret_x, Ret_y) and three columns for cubic coordinates (Cub_x, Cub_y, Cub_z) representing the location of these BMUs in the training map.
* **B_variavel**: columns containing the trained weights of each input variable with respect to the BMU in the format B_variable.

In [None]:
som_test.results_dataframe

## Neurons Dataframe

<p style="text-align: justify; text-indent: 1.5cm;">
Another dataframe is created that contains the training results of the weights for each neuron belonging to the trained map. The format is also similar to the results dataframe. This result can also be accessed through the static method <span style="font-style: italic;">.neurons_dataframe</span>.
</p>

In [None]:
som_test.neurons_dataframe

## Representative Samples

<p style="text-align: justify; text-indent: 1.5cm;">
    For the evaluation of sample representativeness, it is possible to observe the distribution of samples within their respective BMU considering the distance relationships using the function <span style="font-style: italic;">.rep_sample</span>. As the BMU represents the best-matched vector for one or more samples, sample representativeness is measured by their respective distance to the BMU.
</p>
<p style="text-align: justify; text-indent: 1.5cm;">
    The return of this function is a dictionary where the keys are the BMUs, and the values are the samples associated with each of them, following the order of distance: the first sample, most representative of the set, is the one closest to the BMU, while the least representative sample is the last one, represented by the farthest sample.
</p>

In [None]:
rep_dic = som_test.rep_sample(save=True)
# Ordering BMUs in ascending order
sorted_dict = dict(sorted(rep_dic.items()))
sorted_dict

## Missing Data Imputation

<p style="text-align: justify; text-indent: 1.5cm;">
This functionality deals with replacing the missing value of a given sample with the vector projection of its corresponding BMU on the axis of the missing variable.
</p>
<p style="text-align: justify; text-indent: 1.5cm;">
Therefore, if the goal of the training is to impute missing data, this functionality can be achieved using the <span style="font-style: italic;">.imput_missing()</span> method. The default number of decimal places used for imputation is 6.
</p>

In [None]:
som_test.imput_missing()

# Visualization Module
***

<p style="text-align: justify; text-indent: 1.5cm;">
    The visualization of training results, such as the U-Matrix and component maps, is done through the <span style="font-style: italic;">.visualization()</span> module by creating a <span style="font-style: italic;">PlotFactory()</span> object that takes the SOM object as an argument.
</p>

In [None]:
plot = PlotFactory(som_test)

## Matriz-U

<p style="text-align: justify; text-indent: 1.5cm;">
    With the created plotting object, the Unified Distance Matrix (U-Matrix) can be generated using the <span style="font-style: italic;">.plot_umatrix()</span> method. The arguments for styling the plot are:
</p>

* **figsize**: size of the U-Matrix plot window. Default: (10,10
* **hits**: boolean value to indicate whether to plot the hits of input vectors on the BMUs (proportional to the number of vectors per BMU)
* **title**: title of the created figure. Default: "U-Matrix"
* **title_size**: size of the plotted title. Default: 40
* **title_pad**:  spacing between the title and the top of the matrix. Default: 25
* **legend_title**: title of the legend color bar. Default: "Distance"
* **legend_pad**:  spacing between the legend title and the U-Matrix. Default: 0
* **legend_title_size**: size of the legend title. Default: 25.
* **legend_ticks_size**: size of the numbers on the legend color bar. Default: 20.
* **save**: boolean value to define whether to save the generated image. The image will be saved in the directory (Plots/U-Matrix). Default: True.
* **label_plot**: boolean value to add labels to the hexagons of the U-Matrix. Default: False
* **label_plot_dic**: dictionary that maps specific labels for each hexagon. Default: None
* **label_title_xy**: coordinates (x, y) to position the title of the labels. Default: (-0.02, 1.1)
* **watermark_neurons**:  boolean value to add a watermark with the BMUs on the image. Default: False
* **file_name**: the name to be given to the saved file. If no name is provided, the project name will be used.
* **file_path**:  the system path where the image will be saved if the default path is not used.
* **log**: plot the U-Matrix in logarithmic scale for better contrast in visualizing dissimilarity boundaries in the presence of outliers.

### Traditional U-Matrix
Plot of the traditional Unified Distance Matrix.

In [None]:
plot.plot_umatrix(figsize = (13,2.5),
                  hits = False,
                  title = "U-Matrix",
                  title_size = 20,
                  title_pad = 20,
                  legend_title = "Distance",
                  legend_title_size = 12,
                  legend_ticks_size = 7,
                  label_title_xy = (0,0.5),
                  save = True,
                  file_name = "umatrix",
                  file_path = '',
                  watermark_neurons=False)

### U-Matrix with BMU
<p style="text-align: justify; text-indent: 1.5cm;">
Plot of the U-Matrix with the BMU hits (white hexagons) which are proportional in size to the amount of entry samples which are grouped to that BMU. To achieve this is necessary to flag the hits parameter as True.
</p>

In [None]:
plot.plot_umatrix(figsize = (13,2.5),
                  hits = True,
                  title = "U-Matrix - With Hits",
                  title_size = 20,
                  title_pad = 20,
                  legend_title = "Distance",
                  legend_title_size = 12,
                  legend_ticks_size = 7,
                  label_title_xy = (0,0.5),
                  save = True,
                  file_name = "umatrix_hits",
                  file_path = False,
                  watermark_neurons=False)

### U-Matrix with Labaled Representative Samples
<p style="text-align: justify; text-indent: 1.5cm;">
It is possible to plot the labels of the samples in the U-Matrix using the parameter <span style="font-style: italic;">samples_label</span> set to True. In the parameter <span style="font-style: italic;">samples_label_index</span>, you can specify the indices of the samples you want to plot as a list or array. Additionally, the parameter <span style="font-style: italic;">samples_label_fontsize</span> allows you to specify the font size for the plot.
</p>

In [None]:
plot.plot_umatrix(figsize = (13,2.5),
                  hits = True,
                  title = "U-Matrix - Labeled Representative Samples",
                  title_size = 20,
                  title_pad = 20,
                  legend_title = "Distance",
                  legend_title_size = 12,
                  legend_ticks_size = 7,
                  label_title_xy = (0,0.5),
                  save = False,
                  file_name = "umatrix_sample_labels",
                  file_path = False,
                  watermark_neurons=False,
                  samples_label = True,
                  samples_label_index = range(18),
                  samples_label_fontsize = 8,
                 save_labels_rep = True)

### U-Matrix with BMU Activated by Boolean Variable/Feature
<p style="text-align: justify; text-indent: 1.5cm;">
If one of the training variables is of boolean type (True or False) or represented by a binary value, it is possible to identify that variable in the U-Matrix. To do so, it is necessary to use the parameter label_plt as True and the parameter label_plot_name as the desired variable's identification for the plot. BMUs with a label of True (1) are displayed in white, while BMUs with a label of False (0) are displayed in black. If more than one value is grouped in a single BMU, the color representation will be based on the mode of that group. As an example, we show the binary variable related to the size of animals, specifically "Small". In the figure, small animals appear in white hexagons, while animals that are not small appear in black hexagons. The size of the hexagons remains proportional to the number of samples.
</p>

In [None]:
plot.plot_umatrix(figsize = (13,2.5),
                  hits = True,
                  title = "U-Matrix with boolean feature activated BMU - Categorical Feature: Small",
                  title_size = 16,
                  title_pad = 20,
                  legend_title = "Distance",
                  legend_title_size = 12,
                  legend_ticks_size = 7,
                  label_title_xy = (0,0.5),
                  save = False, #True,
                  file_name = "umatrix_var_labels",
                  file_path = False,
                  watermark_neurons=False,
                  label_plot = True,
                 label_plot_name = "Small")

### Neuron Template
<p style="text-align: justify; text-indent: 1.5cm;">
It is possible to generate a template for identifying the positioning of neurons and, consequently, the original samples associated with the neurons that are BMUs (Best Matching Units). The parameter to activate the neuron template is the <span style="font-style: italic;">watermark_neurons</span> and should be flaged as True. The fontsize of the neuron numbers can be customized by the parameter <span style="font-style: italic;">neurons_fontsize</span>. To visualize only the neurons template the <span style="font-style: italic;">watermark_neurons_alfa</span> should be set to 1 (no transparency).
</p>

In [None]:
plot.plot_umatrix(figsize = (13,2.5),
                  hits = False,
                  title = "Neuron Template",
                  title_size = 20,
                  title_pad = 25,
                  legend_title = "Distance",
                  legend_title_size = 12,
                  legend_ticks_size = 7,
                  label_title_xy = (0,0.5),
                  save = True,
                  file_name = "umatrix_watermark_template",
                  file_path = False,
                  watermark_neurons=True,
                  watermark_neurons_alfa = 1,
                  neurons_fontsize = 7)

### U-Matrix with Neuron Template
<p style="text-align: justify; text-indent: 1.5cm;">
If the parameter <span style="font-style: italic;">watermark_neurons_alfa</span> is set to a value between 0 and 1, the neuron template will appear as a watermark overlay on the U-Matrix.
</p>

In [None]:
plot.plot_umatrix(figsize = (13,2.5),
                  hits = True,
                  title = "U-Matrix with Neuron Template",
                  title_size = 20,
                  title_pad = 25,
                  legend_title = "Distance",
                  legend_title_size = 12,
                  legend_ticks_size = 7,
                  label_title_xy = (0,0.5),
                  save = True,
                  file_name = "umatrix_watermark_template",
                  file_path = False,
                  watermark_neurons=True,
                  neurons_fontsize = 7)

### Projected U-Matrix on the Torus
<p style="text-align: justify; text-indent: 1.5cm;">
For visualization purposes, the <span style="font-style: italic;">.plot_torus()</span> plotting function creates a three-dimensional projection of the trained map on the torus.<br>
Parameters: <br>
</p>

* **inner_out_prop** (float): proportion of the inner radius to the outer radius of the torus (default: 0.4).
* **red_factor** (int): reduction factor for the image matrix (default: 4). A smaller value provides better resolution for the projection but requires more processing power.
* **hits** (bool): indicates whether impact areas should be highlighted in the image matrix (default: False).

In [None]:
plot.plot_torus(hits=True, inner_out_prop = 0.25, red_factor = 2)

# Projection onto Trained Map Module
***

<p style="text-align: justify; text-indent: 1.5cm;">
    To project new data onto a trained map, simply load the respective data and use the function <span style="font-style: italic;">project_nan_data()</span>.
</p>


## Load Projection Database
<p style="text-align: justify; text-indent: 1.5cm;">
    A database composed of three new animals was created with Baboon, Hummingbird and Jaguar to demonstrate data projection onto a trained map.
</p>

In [None]:
data_proj = pd.read_excel("data/Animais_proj.xlsx", index_col=0)

In [None]:
data_proj

## Projected Data

In [None]:
proj_data_result = som_test.project_nan_data(data_proj=data_proj)
proj_data_result

## Representative Samples
<p style="text-align: justify; text-indent: 1.5cm;">
    The dictionary of representative samples can be updated with the samples projected from the <span style="font-style: italic;">project</span> parameter. This parameter should receive the result of the projected data.
</p>

In [None]:
rep_dic = som_test.rep_sample(save=True, project=proj_data_result)
# Ordenando os BMUs de forma crescente
sorted_dict = dict(sorted(rep_dic.items()))
sorted_dict

## Comparison with Input Data
<p style="text-align: justify; text-indent: 1.5cm;">
    The project_samples_label parameter in the <span style="font-style: italic;">.plot_umatrix()</span> function can be used with the result of the projected data to observe the distribution of the projected samples on the U-Matrix.
</p>

In [None]:
plot.plot_umatrix(figsize = (12,2.5),
                  hits = True,
                  title = "U-Matrix - Labeled Representative Samples",
                  title_size = 20,
                  title_pad = 20,
                  legend_title = "Distance",
                  legend_title_size = 12,
                  legend_ticks_size = 7,
                  label_title_xy = (0,0.5),
                  save = True,
                  file_name = "umatrix_projected_data",
                  file_path = False,
                  watermark_neurons=False,
                  project_samples_label = proj_data_result,
                  samples_label = True,
                  samples_label_index = range(19),
                  samples_label_fontsize = 8,
                  save_labels_rep = True)

## Component Plots
<p style="text-align: justify; text-indent: 1.5cm;">
The generation of component maps can be done using the methods <span style="font-style: italic;">.component_plot()</span> for generating individual maps, <span style="font-style: italic;">.all_component_plots()</span> for generating maps for all training variables, and <span style="font-style: italic;">component_plot_collage()</span> for creating plot grids.
</p>

### Individual Plots
<p style="text-align: justify; text-indent: 1.5cm;">
The optional arguments are the same as those for plotting the U-Matrix. The input argument for choosing the variable to plot is <span style="font-style: italic;">component_name()</span> and can be an integer corresponding to the position index of the variable of interest in the list of input variables, or the actual name of the variable.
</p>

In [None]:
plot.component_plot(figsize = (12,2.5),
                    component_name = 2,
                    title_size = 20,
                    legend_title = "Presence",
                    legend_pad = 5,
                    legend_title_size = 12,
                    legend_ticks_size = 10,
                    label_title_xy = (0,0.7))

### Individual Plot (Every Variable/Feature)
<p style="text-align: justify; text-indent: 1.5cm;">
It is possible to directly generate individual plots for all variables or selected variables in the automatically generated "Plots/Component_plots" folder within the working directory. To select the variables for plotting, simply pass a list or array with the indices of the variables in the  <span style="font-style: italic;">"which"</span> parameter of this function (the default value is 'all' for plotting all variables).
</p>

In [None]:
plot.multiple_component_plots(figsize = (12,2.5),
                              title_size = 20,
                              legend_title = "Presence",
                              legend_pad = 5,
                              legend_title_size = 12,
                              legend_ticks_size = 10,
                              label_title_xy = (0,0.7),
                              save=True)

### Component Plot Collage
<p style="text-align: justify; text-indent: 1.5cm;">
It is possible to generate collage plots of component maps directly within the automatically generated "Plotagens/Component_plots" folder in the working directory. To select the variables for plotting, simply pass a list or array with the indices of the variables in the <span style="font-style: italic;">"which"</span> parameter of this function (the default value is 'all' for plotting all variables). To select the number of rows and columns per page, use the <span style="font-style: italic;">"grid"</span> parameter, and to define the page size in pixels, use the <span style="font-style: italic;">"page_size"</span> parameter (the default is A4 - 2480x3508 pixels).
</p>

In [None]:
plot = PlotFactory(som_test)
plot.component_plot_collage(figsize = (12,2.5),
                            grid = (5,2),
                            title_size = 20,
                            legend_title = "Presence",
                            legend_pad = 5,
                            legend_title_size = 12,
                            legend_ticks_size = 10,
                            label_title_xy = (0,0.7))

<div style="text-align: center; padding: 20px;">
  <img src="https://github.com/InTRA-USP/IntraSOM/blob/main/examples/Plotagens/Component_plots/Collage/pages/Component_plots_collage_page1.jpg?raw=1" style="max-width: 900px;">
</div>

# Clustering Module
***

<p style="text-align: justify; text-indent: 1.5cm;">
    The IntraSOM package features a module for clustering trained neurons, as used in various SOM applications. In this release version, the module utilizes the K-means algorithm for neuron clustering and offers various visualization and exploratory analysis functions for the results.
</p>

<p style="text-align: justify; text-indent: 1.5cm;">
    To cluster a trained map, simply instantiate a clustering object using the <strong>ClusterFactory</strong> class and use the SOM object as a parameter. This clustering object can then be used to call the <span style="font-style: italic;">.kmeans()</span> function, where the parameter <span style="font-style: italic;">k</span> indicates the number of clusters to group the trained neurons. The result of this process is a two-dimensional matrix with the cluster value for each neuron in the map.
</p>

In [None]:
clustering = ClusterFactory(som_test)
clusters = clustering.kmeans(k=10)
clusters

## Clustering Results
<p style="text-align: justify; text-indent: 1.5cm;">
    The clustering results for the trained samples can be accessed using the class method <span style="font-style: italic;">.results_cluster()</span> by passing the two-dimensional array created with the <span style="font-style: italic;">.kmeans()</span> function as a parameter. The resulting dataframe is similar to the results dataframe but with an additional column containing the cluster numbers to which the samples were grouped.
</p>

In [None]:
clustering.results_cluster(clusters)

## Clustering Visualization

### U-Matrix with Cluster Overlay

<p style="text-align: justify; text-indent: 1.5cm;">
The clusters can be visualized with individualized cluster colors for each neuron on the map by using the <span style="font-style: italic;">cluster_outline</span> parameter as False.
</p>

<p style="text-align: justify; text-indent: 1.5cm;">
    The color palette used to identify the clusters can be customized using the <span style="font-style: italic;">colormap</span> parameter. By default, it uses "gist_rainbow". However, any other colormap available in the Matplotlib library can be used <a href="https://matplotlib.org/stable/tutorials/colors/colormaps.html">[List of Colormaps]</a>.
</p>

<p style="text-align: justify; text-indent: 1.5cm;">
    The transparency of this cluster overlay can be controlled with the <span style="font-style: italic;">alfa_clust</span> parameter. Other options such as the neuron template and the presence of BMUs (hits) can also be activated.
</p>

#### Plot with "gist_rainbow" colormap, 50% transparency and without neuron template

In [None]:
clustering.plot_kmeans(figsize = (12,5),
                       clusters = clusters,
                       title_size = 18,
                       title_pad = 20,
                       umatrix=True,
                       colormap = "gist_rainbow",
                       alfa_clust=0.5,
                       hits=True,
                       legend_text_size =7,
                       cluster_outline=False,
                       save=True,
                       file_name="cluster_gist_50")

#### Plot with "plasma" colormap, 80% transparency, with neuron template and without U-Matrix

In [None]:
clustering.plot_kmeans(figsize = (12,5),
                       clusters = clusters,
                       title_size = 18,
                       title_pad = 20,
                       umatrix=False,
                       colormap = "plasma",
                       alfa_clust=0.9,
                       watermark_neurons = True,
                       neurons_fontsize = 6,
                       hits=True,
                       legend_text_size =7,
                       cluster_outline=False,
                       save=True,
                       file_name="cluster_plasma_80_noumat")

### Neuron Cluster Merge
<p style="text-align: justify; text-indent: 1.5cm;">
Neurons belonging to the same cluster can be merged by using the <span style="font-style: italic;">cluster_outline</span> parameter set to True. To plot the labels of the neurons, the <span style="font-style: italic;">plot_labels</span> parameter is set to True.
</p>

In [None]:
clustering.plot_kmeans(figsize = (12,5),
                       clusters = clusters,
                       title_size = 18,
                       title_pad = 20,
                       umatrix=True,
                       colormap = "gist_rainbow",
                       alfa_clust=0.5,
                       hits=True,
                       legend_text_size =7,
                       cluster_outline=True,
                       plot_labels= True,
                       clusterout_maxtext_size=12,
                       save=True,
                       file_name="cluster_merge")

### Selective Visualization
<p style="text-align: justify; text-indent: 1.5cm;">
One or more clusters of interest can be visualized using the <span style="font-style: italic;">clusters_highlight</span> parameter, which accepts a list or array with the index(es) of the respective cluster(s) of interest.
</p>

In [None]:
clustering.plot_kmeans(figsize = (12,5),
                       clusters = clusters,
                       title_size = 18,
                       title_pad = 20,
                       umatrix=True,
                       colormap = "gist_rainbow",
                       alfa_clust=0.5,
                       hits=True,
                       legend_text_size =7,
                       cluster_outline=True,
                       plot_labels= True,
                       clusterout_maxtext_size=12,
                       clusters_highlight = [3,5,6],
                       save=True,
                       file_name="cluster_merge_selec")

### Customize Labels

<p style="text-align: justify; text-indent: 1.5cm;">
The labels associated with the clusters can be customized using the <span style="font-style: italic;">custom_labels</span> parameter, which accepts a list with the labels to be associated with each cluster.
</p>

In [None]:
clustering.plot_kmeans(figsize = (12,5),
                       clusters = clusters,
                       title_size = 18,
                       title_pad = 20,
                       umatrix=True,
                       colormap = "gist_rainbow",
                       alfa_clust=0.5,
                       hits=True,
                       legend_text_size =7,
                       cluster_outline=True,
                       plot_labels= True,
                       clusterout_maxtext_size=12,
                       custom_labels = [f"N {i}" for i in range(10)],
                       save=True,
                       file_name="cluster_merge_custom")

<center><img src="https://github.com/InTRA-USP/IntraSOM/blob/main/examples/images/Foot_logo_fundo_branco.svg?raw=1"/></center>