Skip to content

Scripts for discovery and genotyping polymorphic Alu element insertions in human genomes


Notifications You must be signed in to change notification settings


Repository files navigation


This GitHub repository stores various scripts required to discover and genotype polymorphic Alu element insertions. There are four different workflows: REF-plus discovery, REF-minus discovery, merging and filtering workflow and genotyping workflow. REF-plus and REF-minus workflows generate text file containing 32-mer pairs that can be subsequently used for genotyping using FastGT package. The scripts are written in PERL and bash.

Downloading to local server:

git clone  

REF-plus discovery scripts

The key steps in REF-plus discovery pipeline are:

  • Search for all potential full-length Alu elements with the script This script searches the reference genome for 10bp Alu element signatures (with 1 mismatch) and for Target Site Duplication sequences within 270-350 bp.
  • BLAST search that checks whether detected candidate elements are homologous to known Alu elements.
  • Search against chimpanzee genome using 25-mer lists. This step removes older elements that are likely to be fixed in both species.

To run the scripts yourself open in text editor and define paths to FASTA files of the reference genome and chimpanzee genome.

cd ~/AluMine/discovery_REF-plus

REF-minus discovery scripts

The key elements in REF-minus discovery pipeline are:, gtester, and

  • Scripts or search for 10bp Alu signature sequences from BAM or FASTQ files and write out all potential signatures together with 25bp flanking sequence.
  • Then 'gtester' is used to localize the 25bp flanking sequences in the reference genome.
  • removes fixed Alu elements (present in the human reference genome) and prints out pair of 32-mers for those that are not present in reference genome.

Before running the REF-minus discovery pipeline, open file in text editor and define paths to sample files and path to human chromosome files.

cd ~/AluMine/discovery_REF-minus
bash supports input in BAM, single FASTQ, multiple FASTQ and gnuzipped FASTQ formats.

Merging and filtering the k-mer databases

The following steps are required for additional filtering using genotype data from real individuals

  • Alu-element candidates with identical k-mers were removed.
  • Alu-element candidates that are located within 25bp of each other were removed.
  • REF-minus and REF-plus k-mers need to be merged with 30M SNV k-mers. SNVs help to build more accurate model for genotype calling.

Remaining Alu-elements were genotyped on 2200 individuals and additional filtering steps were used:

  • Alu elements with >10% calls with unexpected ploidy were removed.
  • Alu elements with Hardy-Weinberg Equilibrium P-value<1.5E-6 were removed.

These steps are performed separately, their description can be seen in file

cd ~/AluMine/discovery_merging_and_filtering


It is possible to skip the discovery phase and use our database of known Alu insertion polymorphisms (32,786 candidate polymorphisms).
The k-mer database for genotyping (ALU_v1.kmer.db) is available at FastGT webpage.

cd ~/AluMine/genotyping


Please cite: Puurand T, Kukuškina V, Pajuste F-D, Remm M. (2019). AluMine: alignment-free method for the discovery of polymorphic Alu element insertions. Mobile DNA 10:31. [](doi:

Additional data

Additional material can be downloaded from our webpage at Pre-compiled human 25-mer index for REF-minus discovery (57GB) and pre-compiled chimp 32-mer list for REF-plus discovery (27GB) can be downloaded from


Scripts for discovery and genotyping polymorphic Alu element insertions in human genomes








No releases published