Tutorial Clustering
Pages 113
- Home
- Affymetrix
- affymetrix_expression_normalization_with_apt
- Agilent
- Association Analysis
- Association Analysis doc
- Babelomics version
- Babelomics web structure
- Burden test
- Cancer
- CDF
- Changes in this version
- Class comparison. Worked examples and exercises
- Class prediction
- Class prediction. Worked examples and exercises
- Clustering
- Clustering. Worked examples and exercises
- Cross hybridization
- data matrix expression
- Data types
- Define your comparison
- Detailed example of analysis of expression data in Babelomics: from raw data to expression differential and functional profiling
- Differential Expression for arrays
- Differential Expression for RNA Seq
- Dye bias
- Edit
- Edit your data
- example data
- Expression
- Expression array pipeline
- FAQ
- Functional
- Functional Gene Set Network Enrichment
- Functional GO Enrichment
- GAL
- Gene Set Enrichment
- Gene Set Network Enrichment (Network Miner)
- Gene vs annotation
- Genepix
- Genomics
- Genomics doc
- How to cite babelomics
- Id
- Logging in
- Main areas. Cancer
- Main areas. Expression
- Main areas. Functional
- Main areas. Genomics
- Main areas. Processing
- Main areas: Cancer
- Main areas: Expression
- Main areas: Functional
- Main areas: Genomics
- Main areas: Processing
- Network Enrichment (SNOW)
- Other biological data
- Overview and pipelines
- p values adjusted for multiple testing
- PED
- PED_MAP zipped
- Pipelines
- plink.assoc
- plink.assoc.linear
- plink.assoc.logistic
- plink.fisher
- plink.hh
- plink.log
- plink.tdt
- Preprocessing for data matrix
- Preprocessing for microarrays
- Preprocessing for RNA Seq
- Processing
- Ranked
- Requirements
- RNA Seq Normalization
- RNA Seq pipeline
- SDK (Software Development Kit)
- Single Enrichment
- Single Enrichment. Options
- SNPs array pipeline
- Software and databases used
- Technical Info
- The Babelomics Team
- tut_SNP_association
- Tutorial
- Tutorial Affymetrix Expression Microarray Normalization
- Tutorial Agilent One Color Microarray Normalization
- Tutorial Agilent Two Colors Microarray Normalization
- Tutorial Burden test
- Tutorial Class prediction
- Tutorial Clustering
- Tutorial Data matrix preprocessing
- Tutorial Differential Expression for arrays
- Tutorial Differential Expression for RNA Seq
- Tutorial Expression
- Tutorial Expression. Class comparison
- Tutorial Expression. Correlation
- Tutorial Expression. Survival
- Tutorial Functional
- Tutorial Genepix One Color Microarray Normalization
- Tutorial Genepix Two Colors Microarray Normalization
- Tutorial Genomics
- Tutorial OncodriveClust
- Tutorial OncodriveFM
- Tutorial Processing
- Tutorial SNP Association Analysis
- Tutorial SNP stratification
- Upload your data
- VCF 4.0
- VCF file pipeline
- Visualization tools
- Worked examples
- Workflow
- Show 98 more pages…
General
Tutorial
Analysis tools
Worked examples
-
Expression
-
Functional
Clone this wiki locally
INPUT
STEPS
1. Select your data
2. Select type of clustering
3. Select method
4. Choose distance
5. Fill information job
6. Press Launch job button
OUTPUT
INPUT
Input data
-
Input data should be a matrix upload as the data type Data matrix expression. See data types [here](Data Types).
-
This file is a plain text and tab-separated file as following:
# some comments
# more comments
#NAMES Cond1 Cond2 Cond3 Cond4 Cond5 Cond6
gen1 -3.06 -2.25 -1.15 -6.64 0.40 1.08
gen2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06
gen3 -0.17 0.48 1.23 1.52 1.11
gen4 1.61 -0.27 0.71 -0.62 0.14
gen5 2.09 2.12 2.62 1.95 1.04 2.18
gen6 0.20 -3.06 -0.03 0.64 0.84
gen7 -2.00 -0.64 -0.29 0.08 -1.00
gen8 0.93 1.29 -0.23 -0.74 -2.00 -1.25
gen9 0.88 0.31 -0.22 3.25
gen10 0.71 1.03 -0.25 1.03
- Matrix rows correspond to genes and matrix columns correspond to conditions (arrays).
- All the data items must be separated by tabulators.
- There is no special character for missing values, simply leave these places empty.
- All lines beginning with "#" are treated as commentaries. The first one with #NAMES is mandatory.
- Matrix rows correspond to genes and matrix columns correspond to conditions (arrays).
- All the data items must be separated by tabulators.
- There is no special character for missing values, simply leave these places empty.
- All lines beginning with "#" are treated as commentaries. The first one with #NAMES is mandatory.
Online example
Here you can load a small dataset from our server. You can use them to run this example and see how the tool works. Click on the links to load the data: fibroblasts k-means clustering.
STEPS
Select your data
First step is to select your data to analyze.
Select type of clustering
Select how you want to cluster input data. You can choose samples and/or genes.
Select method
Select the method you want to use for the analysis:
- UPGMA
- SOTA
- K-means
See Clustering section for detailed information about methods.
Choose distance
Select the distance you want to use for the analysis.
- Euclidean (normal)
- Euclidean (square)
- Correlation coefficient (Spearman)
- Pearson correlation coefficient
See Clustering section for details on the algorithms.
Fill information job
- Select the output folder
- Choose a job name
- Specify a description for the job if desired.
Press Launch job button
Press launch button and wait until the results is finished. A normal job may last approximately few minutes but the time may vary depending on the size of data. See the state of your job by clicking the jobs button in the top right at the panel menu. A box will appear at the right of the web browser with all your jobs. When the analysis is finished, you will see the label "Ready". Then, click on it and you will be redirected to the results page.

OUTPUT
Input parameters
In this section you will find a reminder of the parameters or settings you have used to run the analysis.
Output files
Clusters in newick format
Here you will find result data files containing gene and sample clusters in "newick format":http://evolution.genetics.washington.edu/phylip/newicktree.html.
Cluster images
For each type of clustering method used, Babelomics will provide a graphical representation as png format:
- Heatmap with gene and sample dendrograms representing clusters. Only shown if input data has up to 1000 genes.
Worked examples and exercises
A. Online example
- Go to the Babelomics page and select the Clustering option from the Expression menu.
- Press the online example and you will see how the parameters and form fields are now filled. As you can notice, this example is prepared to perform a clustering analysis on genes (rows) and conditions (columns) using the K-means algorithm with 5 sample-clusters and 15 gene-clusters. Here, the selected distance is Euclidean (square).
- Press Launch job, and wait for your job to be finished.
- When the process finishes, a new blue job is shown at the right side of the web page. Press it to check your results.
Questions.
These are some questions that you should be able to answer about the previous example:
- Do you think that the clustering was able to differentiate any group of coexpressed genes?
- How many sample clusters are there? and gene clusters?
- Launch this online example using different clustering methods and compare the results. Which are the differences between the results of these results for different methods?
B. Exercises
Exercise 1. Random dataset
- Download this random dataset and explore this dataset.
- Perform a clustering analysis.
- What would we obtain for an analysis of data with no structure?
- Do you obtain a result?
- Could you download this image?
- What can you say about this result?
Exercise 2. Zebrafish embryogenesis data
- Download this file and perform a hierarchical clustering analysis of its genes. This example file contains the first 999 genes of the 3,657 genes that showed significant levels of differential expression in Mathavan et al. study (2005).
- Do you see any patterns of gene expression between different developmental stages?
- Could you download this image?
- Could you download files with newick format? Do you know this format?
- Are gene clusters of different developmental stages functionally enriched?
Exercise 3. RNA-Seq data from Breast Invasive Carcinoma (BRCA)
1.Open this file and explore the content: tcga_rnaseq.txt. Description:
- RNA-Seq data of BRCA samples taken from The Cancer Genome Atlas (TCGA) data portal. (http://cancergenome.nih.gov/)
- Contains 10 normal samples, 20 tumor samples with 2 subtypes (Basal-like and Her2-enriched).
2.Upload your file to Babelomics 5.0.
3.Go to section Expression > Clustering and try several clustering strategies:
- UPGMA + Euclidean (square)
- UPGMA + Correlation coeff. (Spearman)
- Which distance parameter is better for proper clustering?
4.Repeat the analysis using the same distance parameters and SOTA method.
- SOTA + Euclidean (square)
- SOTA + Correlation coeff. (Spearman)
- Do the results change based on the method or the distance parameter?
5.Try to cluster your samples with K-means.
- Set k-value 6 and use Correlation coeff. (Spearman)
- Repeat the same analysis with k-value 3.
- Check the results of K-means.
- Are the results acceptable?
- Is the dendrogram representing any hierarchy between the samples?
6.Try to cluster your samples with K-means.
- Set k-value 2 and use Correlation coeff. (Spearman).
- Can we say that K-means is good to distinguish tumor from normal?
| Go back to the Clustering page |
|---|
| Go back to the Home page |
|---|
| Go back to the Worked examples for all tools page |
|---|
Find the Babelomics suite at http://babelomics.org