Tutorial Clustering

Francisco García edited this page Mar 14, 2016 · 53 revisions
Clone this wiki locally



1. Select your data
2. Select type of clustering
3. Select method
4. Choose distance
5. Fill information job
6. Press Launch job button




Input data
  • Input data should be a matrix upload as the data type Data matrix expression. See data types here.

  • This file is a plain text and tab-separated file as following:

# some comments
# more comments
#NAMES Cond1 Cond2 Cond3 Cond4 Cond5 Cond6
gen1    -3.06   -2.25   -1.15   -6.64   0.40    1.08
gen2    -1.36   -0.67   -0.17   -0.97   -2.32   -5.06
gen3    -0.17   0.48    1.23    1.52    1.11    
gen4        1.61    -0.27   0.71    -0.62   0.14
gen5    2.09    2.12    2.62    1.95    1.04    2.18
gen6    0.20    -3.06   -0.03   0.64    0.84    
gen7    -2.00   -0.64   -0.29   0.08    -1.00   
gen8    0.93    1.29    -0.23   -0.74   -2.00   -1.25
gen9    0.88    0.31    -0.22   3.25        
gen10   0.71    1.03    -0.25       1.03    
  • Matrix rows correspond to genes and matrix columns correspond to conditions (arrays).
  • All the data items must be separated by tabulators.
  • There is no special character for missing values, simply leave these places empty.
  • All lines beginning with "#" are treated as commentaries. The first one with #NAMES is mandatory.
  • Matrix rows correspond to genes and matrix columns correspond to conditions (arrays).
  • All the data items must be separated by tabulators.
  • There is no special character for missing values, simply leave these places empty.
  • All lines beginning with "#" are treated as commentaries. The first one with #NAMES is mandatory.
Online example

Here you can load a small dataset from our server. You can use them to run this example and see how the tool works. Click on the links to load the data: fibroblasts k-means clustering.


Select your data

First step is to select your data to analyze.

Select type of clustering

Select how you want to cluster input data. You can choose samples and/or genes.

Select method

Select the method you want to use for the analysis:

  • SOTA
  • K-means

See Clustering section for detailed information about methods.

Choose distance

Select the distance you want to use for the analysis.

  • Euclidean (normal)
  • Euclidean (square)
  • Correlation coefficient (Spearman)
  • Pearson correlation coefficient

See Clustering section for details on the algorithms.

Fill information job
  • Select the output folder
  • Choose a job name
  • Specify a description for the job if desired.
Press Launch job button

Press launch button and wait until the results is finished. A normal job may last approximately few minutes but the time may vary depending on the size of data. See the state of your job by clicking the jobs button in the top right at the panel menu. A box will appear at the right of the web browser with all your jobs. When the analysis is finished, you will see the label "Ready". Then, click on it and you will be redirected to the results page.


Input parameters

In this section you will find a reminder of the parameters or settings you have used to run the analysis.

Output files

Clusters in newick format

Here you will find result data files containing gene and sample clusters in "newick format":http://evolution.genetics.washington.edu/phylip/newicktree.html.

Cluster images

For each type of clustering method used, Babelomics will provide a graphical representation as png format:

  • Heatmap with gene and sample dendrograms representing clusters. Only shown if input data has up to 1000 genes.

Worked examples and exercises

A. Online example

  • Go to the Babelomics page and select the Clustering option from the Expression menu.
  • Press the online example and you will see how the parameters and form fields are now filled. As you can notice, this example is prepared to perform a clustering analysis on genes (rows) and conditions (columns) using the K-means algorithm with 5 sample-clusters and 15 gene-clusters. Here, the selected distance is Euclidean (square).
  • Press Launch job, and wait for your job to be finished.
  • When the process finishes, a new blue job is shown at the right side of the web page. Press it to check your results.


These are some questions that you should be able to answer about the previous example:

  • Do you think that the clustering was able to differentiate any group of coexpressed genes?
  • How many sample clusters are there? and gene clusters?
  • Launch this online example using different clustering methods and compare the results. Which are the differences between the results of these results for different methods?

B. Exercises

Exercise 1. Random dataset

  • Download this random dataset and explore this dataset.
  • Perform a clustering analysis.
  • What would we obtain for an analysis of data with no structure?
  • Do you obtain a result?
  • Could you download this image?
  • What can you say about this result?

Exercise 2. Zebrafish embryogenesis data

  • Download this file and perform a hierarchical clustering analysis of its genes. This example file contains the first 999 genes of the 3,657 genes that showed significant levels of differential expression in Mathavan et al. study (2005).
  • Do you see any patterns of gene expression between different developmental stages?
  • Could you download this image?
  • Could you download files with newick format? Do you know this format?
  • Are gene clusters of different developmental stages functionally enriched?

Exercise 3. RNA-Seq data from Breast Invasive Carcinoma (BRCA)

1.Open this file and explore the content: tcga_rnaseq.txt. Description:

  • RNA-Seq data of BRCA samples taken from The Cancer Genome Atlas (TCGA) data portal. (http://cancergenome.nih.gov/)
  • Contains 10 normal samples, 20 tumor samples with 2 subtypes (Basal-like and Her2-enriched).

2.Upload your file to Babelomics 5.0.

3.Go to section Expression > Clustering and try several clustering strategies:

  • UPGMA + Euclidean (square)
  • UPGMA + Correlation coeff. (Spearman)
  • Which distance parameter is better for proper clustering?

4.Repeat the analysis using the same distance parameters and SOTA method.

  • SOTA + Euclidean (square)
  • SOTA + Correlation coeff. (Spearman)
  • Do the results change based on the method or the distance parameter?

5.Try to cluster your samples with K-means.

  • Set k-value 6 and use Correlation coeff. (Spearman)
  • Repeat the same analysis with k-value 3.
  • Check the results of K-means.
  • Are the results acceptable?
  • Is the dendrogram representing any hierarchy between the samples?

6.Try to cluster your samples with K-means.

  • Set k-value 2 and use Correlation coeff. (Spearman).
  • Can we say that K-means is good to distinguish tumor from normal?

Go back to the Clustering page
Go back to the Home page
Go back to the Worked examples for all tools page