-
-
Notifications
You must be signed in to change notification settings - Fork 201
BLAST SEG masking
It is a little known fact that NCBI BLAST in blastp mode applies SEG low complexity masking
(Wootton and Federhen, 1996) to the target sequences by default. I asked GPT 5.5, Opus 4.8, Gemini
and Groq if that was the case and they all told me no!
Let us look at it based on one example:
>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEAALAGALLALAVLATVGGNLLVIVAIAWTPRLQTMTNVFVTSLA
AADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLCVTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVL
VWVVSAAVSFAPIMSQWWRVGADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG
ELGRFPPEESPPAPSRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGTFTLCWLPFFLANVLRALGGP
SLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLLCRCGRRLPPEPCAAARPALFPSGVPAARSSPAQPRLCQRL
DGASWGVS
Now let us run the SEG masker with these (more conservative than default) settings:
$ segmasker -in NP_000016.1.faa -window 10 -locut 1.8 -hicut 2.1 -outfmt fasta
>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEaalagallalavlaTVGGNLLVIV
AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC
VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV
GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG
ELGRFppeesppapsRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT
FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFrrll
crcgrrlPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLDGASWGVS
We can see the masking indicated by lower case letters. We use this sequence as a query:
>KAM6220823.1 beta-3 adrenergic receptor [Rhynchocyon petersi]
MALWPHGNSSLSSWPDVPTLEPNADDTSGLPGLPWAAALAGALLALAVLATVGGNLLVIVAIARTPRLQT
ITNVFVTSLAAADLVVGLLVVPPGATLALTGHWPLGATGCELWTSVDVLCVTASIETLCALAVDRYLAVT
NPLRYGALVTKRRARAAVVLVWVVSTAVSFAPIMSQWWRIGADAEAQRCHSNPRCCAFASNMPYALLSSL
VSFYLPLLVMLFVYARVFIVATRQLRLLRQELGRFPAEDSPPAASRTRPAAPVARSASPPGVPACGRRSA
RLLPLREHRALRTLGLIMGTFTLCWLPFFVANVVRAVGSPSLVPSPAFLALNWLGYANSAFNPLIYCRSP
DFRSAFRRLLCRCGSGRPAEPSAAVPQTRLPSGAPEVLTSPAESRLCPPLDR
We build a database for the first sequence, then search against it with defaults (using BLAST v2.17.0):
$ makeblastdb -in NP_000016.1.faa -out NP_000016.1 -parse_seqids -dbtype prot
$ blastp -query KAM6220823.1.faa -db NP_000016.1
The resulting alignment:
Query= KAM6220823.1 beta-3 adrenergic receptor [Rhynchocyon petersi]
Length=402
Score E
Sequences producing significant alignments: (Bits) Value
NP_000016.1 beta-3 adrenergic receptor [Homo sapiens] 591 0.0
>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
Length=408
Score = 591 bits (1523), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 339/401 (85%), Positives = 354/401 (88%), Gaps = 0/401 (0%)
Query 1 MALWPHGNSSLSSWPDVPTLEPNADDTSGLPGLPWAAALAGALLALAVLATVGGNLLVIV 60
MA WPH NSSL+ WPD+PTL PN +TSGLPG+PW AALAGALLALAVLATVGGNLLVIV
Sbjct 1 MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEAALAGALLALAVLATVGGNLLVIV 60
Query 61 AIARTPRLQTITNVFVTSLAAADLVVGLLVVPPGATLALTGHWPLGATGCELWTSVDVLC 120
AIA TPRLQT+TNVFVTSLAAADLV+GLLVVPP ATLALTGHWPLGATGCELWTSVDVLC
Sbjct 61 AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC 120
Query 121 VTASIETLCALAVDRYLAVTNPLRYGALVTKRRARAAVVLVWVVSTAVSFAPIMSQWWRI 180
VTASIETLCALAVDRYLAVTNPLRYGALVTKR AR AVVLVWVVS AVSFAPIMSQWWR+
Sbjct 121 VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV 180
Query 181 GADAEAQRCHSNPRCCAFASNMPYALLSSLVSFYLPLLVMLFVYARVFIVATRQLRLLRQ 240
GADAEAQRCHSNPRCCAFASNMPY LLSS VSFYLPLLVMLFVYARVF+VATRQLRLLR
Sbjct 181 GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG 240
Query 241 ELGRFPAEDSPPAASRTRPAAPVARSASPPGVPACGRRSARLLPLREHRALRTLGLIMGT 300
ELGRFP E+SPPA SR+ APV A P GVPACGRR ARLLPLREHRAL TLGLIMGT
Sbjct 241 ELGRFPPEESPPAPSRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT 300
Query 301 FTLCWLPFFVANVVRAVGSPSLVPSPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL 360
FTLCWLPFF+ANV+RA+G PSLVP PAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL
Sbjct 301 FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL 360
Query 361 CRCGSGRPAEPSAAVPQTRLPSGAPEVLTSPAESRLCPPLD 401
CRCG P EP AA PSG P +SPA+ RLC LD
Sbjct 361 CRCGRRLPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLD 401
No masking, correct? Wrong. We cross check by hard masking the residues in question in the input FASTA file:
>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEXXXXXXXXXXXXXXTVGGNLLVIV
AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC
VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV
GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG
ELGRFXXXXXXXXXXRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT
FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFXXXX
XXXXXXXPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLDGASWGVS
We build a new database for this and search again, resulting in this:
Query= KAM6220823.1 beta-3 adrenergic receptor [Rhynchocyon petersi]
Length=402
Score E
Sequences producing significant alignments: (Bits) Value
NP_000016.1 beta-3 adrenergic receptor [Homo sapiens] 591 0.0
>NP_000016.1 beta-3 adrenergic receptor [Homo sapiens]
Length=408
Score = 591 bits (1523), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 310/401 (77%), Positives = 324/401 (81%), Gaps = 0/401 (0%)
Query 1 MALWPHGNSSLSSWPDVPTLEPNADDTSGLPGLPWAAALAGALLALAVLATVGGNLLVIV 60
MA WPH NSSL+ WPD+PTL PN +TSGLPG+PW TVGGNLLVIV
Sbjct 1 MAPWPHENSSLAPWPDLPTLAPNTANTSGLPGVPWEXXXXXXXXXXXXXXTVGGNLLVIV 60
Query 61 AIARTPRLQTITNVFVTSLAAADLVVGLLVVPPGATLALTGHWPLGATGCELWTSVDVLC 120
AIA TPRLQT+TNVFVTSLAAADLV+GLLVVPP ATLALTGHWPLGATGCELWTSVDVLC
Sbjct 61 AIAWTPRLQTMTNVFVTSLAAADLVMGLLVVPPAATLALTGHWPLGATGCELWTSVDVLC 120
Query 121 VTASIETLCALAVDRYLAVTNPLRYGALVTKRRARAAVVLVWVVSTAVSFAPIMSQWWRI 180
VTASIETLCALAVDRYLAVTNPLRYGALVTKR AR AVVLVWVVS AVSFAPIMSQWWR+
Sbjct 121 VTASIETLCALAVDRYLAVTNPLRYGALVTKRCARTAVVLVWVVSAAVSFAPIMSQWWRV 180
Query 181 GADAEAQRCHSNPRCCAFASNMPYALLSSLVSFYLPLLVMLFVYARVFIVATRQLRLLRQ 240
GADAEAQRCHSNPRCCAFASNMPY LLSS VSFYLPLLVMLFVYARVF+VATRQLRLLR
Sbjct 181 GADAEAQRCHSNPRCCAFASNMPYVLLSSSVSFYLPLLVMLFVYARVFVVATRQLRLLRG 240
Query 241 ELGRFPAEDSPPAASRTRPAAPVARSASPPGVPACGRRSARLLPLREHRALRTLGLIMGT 300
ELGRF R+ APV A P GVPACGRR ARLLPLREHRAL TLGLIMGT
Sbjct 241 ELGRFXXXXXXXXXXRSLAPAPVGTCAPPEGVPACGRRPARLLPLREHRALCTLGLIMGT 300
Query 301 FTLCWLPFFVANVVRAVGSPSLVPSPAFLALNWLGYANSAFNPLIYCRSPDFRSAFRRLL 360
FTLCWLPFF+ANV+RA+G PSLVP PAFLALNWLGYANSAFNPLIYCRSPDFRSAF
Sbjct 301 FTLCWLPFFLANVLRALGGPSLVPGPAFLALNWLGYANSAFNPLIYCRSPDFRSAFXXXX 360
Query 361 CRCGSGRPAEPSAAVPQTRLPSGAPEVLTSPAESRLCPPLD 401
P EP AA PSG P +SPA+ RLC LD
Sbjct 361 XXXXXXXPPEPCAAARPALFPSGVPAARSSPAQPRLCQRLD 401
Same 591 bits, 1523 raw score. The residues are masked by default and do not contribute to the alignment score.
This cannot be disabled on the command line, only by setting -comp_based_stats 0 which will
also disable compositional matrix adjust and is usually not a good idea.
Is this good? Most likely so. Protein alignment has a huge false positive probem, and masking approaches are important to combat it.
Lessons learned: Things can happen inside bioinformatics tools that are not written in any paper or any documentation or anywhere. Sometimes the developer is the only person in the world who knows about them. Respect the people that understand the tools that billions of dollars' worth of science are being built on top of.
Wootton, J. C., and S. Federhen. 1996. “Analysis of Compositionally Biased Regions in Sequence Databases.” Methods in Enzymology 266: 554–71.