Skip to content

BLAST SEG masking

Benjamin J. Buchfink edited this page Jun 28, 2026 · 8 revisions

It is a little known fact that NCBI BLAST in blastp mode applies SEG low complexity masking to the target sequences by default. I asked GPT 5.5, Opus 4.8, Gemini and Groq if that was the case and they all told me no!

Let us look at it based on one example:

>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEAALAGALLALAVLATVGGNLLVIVAIAWTPRLQTMTNVFVTSLA
AADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLCVTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVL
VWVVSAAVSFAPIMSQWWRVGADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG
ELGRFPPEESPPAPSRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGTFTLCWLPFFLANVLRALGGP
SLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLLCRCGRRLPPEPCAAARPALFPSGVPAARSSPAQPRLCQRL
DGASWGVS

Now let us run the SEG masker with these (more conservative than default) settings:

$ segmasker -in NP_000016.1.faa -window 10 -locut 1.8 -hicut 2.1 -outfmt fasta
>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEaalagallalavlaTVGGNLLVIV
AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC
VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV
GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG
ELGRFppeesppapsRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT
FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFrrll
crcgrrlPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLDGASWGVS

We can see the masking indicated by lower case letters. We use this sequence as a query:

>KAM6220823.1 beta-3 adrenergic receptor [Rhynchocyon petersi]
MALWPHGNSSLSSWPDVPTLEPNADDTSGLPGLPWAAALAGALLALAVLATVGGNLLVIVAIARTPRLQT
ITNVFVTSLAAADLVVGLLVVPPGATLALTGHWPLGATGCELWTSVDVLCVTASIETLCALAVDRYLAVT
NPLRYGALVTKRRARAAVVLVWVVSTAVSFAPIMSQWWRIGADAEAQRCHSNPRCCAFASNMPYALLSSL
VSFYLPLLVMLFVYARVFIVATRQLRLLRQELGRFPAEDSPPAASRTRPAAPVARSASPPGVPACGRRSA
RLLPLREHRALRTLGLIMGTFTLCWLPFFVANVVRAVGSPSLVPSPAFLALNWLGYANSAFNPLIYCRSP
DFRSAFRRLLCRCGSGRPAEPSAAVPQTRLPSGAPEVLTSPAESRLCPPLDR

We build a database for the first sequence, then search against it with defaults (using BLAST v2.17.0):

$ makeblastdb -in NP_000016.1.faa -out NP_000016.1 -parse_seqids -dbtype prot
$ blastp -query KAM6220823.1.faa -db NP_000016.1

The resulting alignment:

Query= KAM6220823.1 beta-3 adrenergic receptor [Rhynchocyon petersi]

Length=402
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]                 591     0.0


>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
Length=408

 Score = 591 bits (1523),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 339/401 (85%), Positives = 354/401 (88%), Gaps = 0/401 (0%)

Query  1    MALWPHGNSSLSSWPDVPTLEPNADDTSGLPGLPWAAALAGALLALAVLATVGGNLLVIV  60
            MA WPH NSSL+ WPD+PTL PN  +TSGLPG+PW AALAGALLALAVLATVGGNLLVIV
Sbjct  1    MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEAALAGALLALAVLATVGGNLLVIV  60

Query  61   AIARTPRLQTITNVFVTSLAAADLVVGLLVVPPGATLALTGHWPLGATGCELWTSVDVLC  120
            AIA TPRLQT+TNVFVTSLAAADLV+GLLVVPP ATLALTGHWPLGATGCELWTSVDVLC
Sbjct  61   AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC  120

Query  121  VTASIETLCALAVDRYLAVTNPLRYGALVTKRRARAAVVLVWVVSTAVSFAPIMSQWWRI  180
            VTASIETLCALAVDRYLAVTNPLRYGALVTKR AR AVVLVWVVS AVSFAPIMSQWWR+
Sbjct  121  VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV  180

Query  181  GADAEAQRCHSNPRCCAFASNMPYALLSSLVSFYLPLLVMLFVYARVFIVATRQLRLLRQ  240
            GADAEAQRCHSNPRCCAFASNMPY LLSS VSFYLPLLVMLFVYARVF+VATRQLRLLR
Sbjct  181  GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG  240

Query  241  ELGRFPAEDSPPAASRTRPAAPVARSASPPGVPACGRRSARLLPLREHRALRTLGLIMGT  300
            ELGRFP E+SPPA SR+   APV   A P GVPACGRR ARLLPLREHRAL TLGLIMGT
Sbjct  241  ELGRFPPEESPPAPSRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT  300

Query  301  FTLCWLPFFVANVVRAVGSPSLVPSPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL  360
            FTLCWLPFF+ANV+RA+G PSLVP PAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL
Sbjct  301  FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL  360

Query  361  CRCGSGRPAEPSAAVPQTRLPSGAPEVLTSPAESRLCPPLD  401
            CRCG   P EP AA      PSG P   +SPA+ RLC  LD
Sbjct  361  CRCGRRLPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLD  401

No masking, correct? Wrong. We cross check by hard masking the residues in question in the input FASTA file:

>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEXXXXXXXXXXXXXXTVGGNLLVIV
AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC
VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV
GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG
ELGRFXXXXXXXXXXRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT
FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFXXXX
XXXXXXXPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLDGASWGVS

We build a new database for this and search again, resulting in this:

Query= KAM6220823.1 beta-3 adrenergic receptor [Rhynchocyon petersi]

Length=402
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]                 591     0.0


>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
Length=408

 Score = 591 bits (1523),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 310/401 (77%), Positives = 324/401 (81%), Gaps = 0/401 (0%)

Query  1    MALWPHGNSSLSSWPDVPTLEPNADDTSGLPGLPWAAALAGALLALAVLATVGGNLLVIV  60
            MA WPH NSSL+ WPD+PTL PN  +TSGLPG+PW               TVGGNLLVIV
Sbjct  1    MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEXXXXXXXXXXXXXXTVGGNLLVIV  60

Query  61   AIARTPRLQTITNVFVTSLAAADLVVGLLVVPPGATLALTGHWPLGATGCELWTSVDVLC  120
            AIA TPRLQT+TNVFVTSLAAADLV+GLLVVPP ATLALTGHWPLGATGCELWTSVDVLC
Sbjct  61   AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC  120

Query  121  VTASIETLCALAVDRYLAVTNPLRYGALVTKRRARAAVVLVWVVSTAVSFAPIMSQWWRI  180
            VTASIETLCALAVDRYLAVTNPLRYGALVTKR AR AVVLVWVVS AVSFAPIMSQWWR+
Sbjct  121  VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV  180

Query  181  GADAEAQRCHSNPRCCAFASNMPYALLSSLVSFYLPLLVMLFVYARVFIVATRQLRLLRQ  240
            GADAEAQRCHSNPRCCAFASNMPY LLSS VSFYLPLLVMLFVYARVF+VATRQLRLLR
Sbjct  181  GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG  240

Query  241  ELGRFPAEDSPPAASRTRPAAPVARSASPPGVPACGRRSARLLPLREHRALRTLGLIMGT  300
            ELGRF          R+   APV   A P GVPACGRR ARLLPLREHRAL TLGLIMGT
Sbjct  241  ELGRFXXXXXXXXXXRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT  300

Query  301  FTLCWLPFFVANVVRAVGSPSLVPSPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL  360
            FTLCWLPFF+ANV+RA+G PSLVP PAFLALNWLGYANSAFNPLIYCRSPDFRSAF
Sbjct  301  FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFXXXX  360

Query  361  CRCGSGRPAEPSAAVPQTRLPSGAPEVLTSPAESRLCPPLD  401
                   P EP AA      PSG P   +SPA+ RLC  LD
Sbjct  361  XXXXXXXPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLD  401

Same 591 bits, 1523 raw score. The residues are masked by default and do not contribute to the alignment score.

Is this good? Most likely so. Protein alignment has a huge false positive probem, and masking approaches are important to combat it.

Lessons learned: Things can happen inside bioinformatics tools that are not written in any paper or any documentation or anywhere. Sometimes the developer is the only person in the world who knows about them. Respect the people that understand the tools that billions of dollars' worth of science are being built on top of.

Clone this wiki locally