-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Longer epitope sequences #31
Comments
Hi @p-levy, Glad you are able to run the tool, correct me if I am wrong, you are hoping to get longer sequence flanking the splicing junction (25mer or 40mer) compared to what is currently reported 9mer or 10mer. This is not coded in the current program, as when I started this project, we were particularly interested in MHC-I presented peptides which typically ranges from 8-11mer, and 9mer and 10mer covers the majority. But I think I might have the codes (Another project I am working on) that might be able to help you: Given each SNAF reported Junction chromosome coordinates ( Step 1: Download hg38 fasta and build a dictionary as belowfrom Bio.SeqIO.FastaIO import SimpleFastaParser
dict_fa = {} # {chrom:seq}
with open(path_to_hg38.fa,'r') as in_handle:
for title,seq in SimpleFastaParser(in_handle):
dict_fa[title] = seq Step 2: Given SNAF coordinate, retrieve the flanking sequence as long as you wantfrom Bio.Seq import Seq
MAX_PEP_LEN=11
MIN_PEP_LEN=8
overhang = 50
# phase definition refer to https://www.biostars.org/p/63864/
item = 'chrx:xxxxx-xxxxx(+)'
chrom,coord = item.split(':')
coord,strand = coord.split('(')
strand = strand.rstrip(')')
start,end = coord.split('-')
start,end = int(start),int(end)
if strand == '+':
first_section = dict_fa[chrom][start-overhang:start]
second_section = dict_fa[chrom][end-1:end-1+overhang]
elif strand == '-':
old_first_section = dict_fa[chrom][start-overhang:start]
old_second_section = dict_fa[chrom][end-1:end-1+overhang]
# reverse complement process
first_section = Seq(old_second_section).reverse_complement()
second_section = Seq(old_first_section).reverse_complement()
# consider phase = 0
defacto_first = first_section[-(3*(MAX_PEP_LEN-1)):]
defacto_second = second_section[:(3*(MAX_PEP_LEN-1))]
defacto_junction = defacto_first + defacto_second
defacto_peptide = str(Seq(defacto_junction).translate(to_stop=False))
# consider phase = 1
defacto_first = first_section[-(3*(MAX_PEP_LEN-1)+2):]
defacto_second = second_section[:(3*(MAX_PEP_LEN-1)+1)]
defacto_junction = defacto_first + defacto_second
defacto_peptide = str(Seq(defacto_junction).translate(to_stop=False))
# consider phase = 2
defacto_first = first_section[-(3*(MAX_PEP_LEN-1)+1):]
defacto_second = second_section[:(3*(MAX_PEP_LEN-1)+2)]
defacto_junction = defacto_first + defacto_second
defacto_peptide = str(Seq(defacto_junction).translate(to_stop=False))
if '*' not in defecto_peptide:
print(defecto_peptide) Step 3: run NetMHCpan4.1def run_netMHCpan(software_path,peptides,hlas,length,cmd_num=1,tmp_dir=None,tmp_name=None):
# set the default
if tmp_dir is None:
if not os.path.exists(os.path.join(os.path.dirname(os.path.abspath(__file__)),'scratch')):
os.mkdir(os.path.join(os.path.dirname(os.path.abspath(__file__)),'scratch'))
tmp_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)),'scratch')
if tmp_name is None:
tmp_name = 'input_{}.pep'.format(os.getpid())
# reformat/create to the strings that we need
peptides_path = os.path.join(tmp_dir,tmp_name)
with open(peptides_path,'w') as f:
for pep in peptides:
f.write('{}\n'.format(pep))
# netMHCpan has a hard limit for 1024 characters for -a, so to be safe, 90 HLA as max for one goal, each input format 11 character 'HLA-B57:01,'
if len(hlas) <= 90:
hla_strings = ','.join(hlas)
if cmd_num == 1:
reconstruct = ' || '.join(['$3 == ' + '"{}"'.format(pep) for pep in peptides])
cmd = '{} -p {} -a {} -BA -l {} | awk \'BEGIN {{OFS = "\\t"}} {{if ({}) {{print $3, length($3),$2,$13,$16,$18}}}}\''.format(software_path,peptides_path,hla_strings,length,reconstruct)
'''
../external/netMHCpan-4.1/netMHCpan -p ./test.pep -a HLA-A01:01,HLA-A02:01 -BA -l 9 | awk 'BEGIN {OFS = "\t"} {if ($3 == "AAAWYLWEV" || $3 == "AAGLQDCTM" || $3 == "AARNIVRRA") {print $3, length($3),$2,$13, $15}}'
'''
try:
df = pd.read_csv(StringIO(subprocess.run(cmd,shell=True,capture_output=True).stdout.decode('utf-8')),sep='\t',header=None)
df.columns = ['peptide','mer','hla','score','binding_nM','identity']
except: # no stdout, just no candidates
df = pd.DataFrame(columns=['peptide','mer','hla','score','identity'])
def run_netMHCpan_wrapper(peptide,hlas):
# hlas is a list, each hla allele follow that format[HLA-A02:01]
netMHCpan_path = '/path/to/netMHCpan-4.1/netMHCpan'
df = run_netMHCpan(software_path=netMHCpan_path,
peptides=[peptide],
hlas=hlas,
length=len(peptide),
cmd_num=1,
tmp_dir=None,
tmp_name=None)
return df Step 4: Immunogenicity PredictionCurrently DeepImmuno only supports 9 and 10mer, but there are other tools I am aware of that have more flexible length requirement such as DeepHLApan. I hope this could help a bit! Best, |
Hi Frank, Thanks for your prompt and detailed answer! I played a bit with the code you shared and that's already helpful. However, I realized that what I would really like is to focus on the junctions where at least the first section is within an annotated CDS. Then I'd like to get the AA-seq of custom length (e.g. 25-mer) around the junction but only using the phase of the annotated CDS. Is there an "easy" way to do that? That would be great! All the best, |
Hi @p-levy, I see, for my current project, I actually decided to take all three phases due to now increasing report on non-canonical ORFs (cryptic ORFs) and the well-characterized mechanisms illustrating why it is common in cancer. Ref1: https://www.nature.com/articles/s41587-021-01021-3 But I understand your need, I actually implemented this in the SNAF, but it is kind of obsolete and I haven't really rigorous tested this function, but I can share. In your download database folder, you will find two required input: [1] You followed this code block (https://github.com/frankligy/SNAF/blob/main/snaf/snaf.py#L62-L96) to get two variable I have a function to Hoping this could help, |
Thanks! I'll test this asap. |
Hi Frank, It took me a while but I think that thanks to your help I'm almost getting what I want. I still have two questions:
Thanks again! |
Hi @p-levy, [1] yes, the from snaf import surface
surface.initialize(db_dir=db_dir)
uid = 'ENSG00000198034:E8.4-E9.1'
junction_coord = surface.main.uid_to_coord(uid) [2] I agree with you, there will are cases where exon is shorter than your desired length, so that flanking intron sequence will be included. That's the caveats for this simple makeshift solution. There are definitely ways to achieve what you are looking for here, something came to my mind is first take a normal hg38 gtf file in which you can probably download from Gencode or UCSC genome browser. Then index it using tabix so that you can retrieve certain region of the chromsome.
Now you can use the coordinate, and write some custom script to do that: import pysam
tbx_file = pysam.TabixFile('hg38.gtf.gz')
feature_rows = tbx_file.fetch(chromsome,start-1,end)
for feature in feature_rows:
# here, you will have the coordinate of each exon, cds, trancript, UTR and other features that are overlapping with your junction coordinate, You can save these coordinate into a nested list or dict. So you will know when extending the flanking sequence will exceed the exon boundary so you can jump to the next exon. I hope this helps a bit, |
Hi Frank,
First of all congrats and thanks to share this great tool!
I've managed to run it on a small subset of our own patients. We usually test the immunogenicity of neoantigens as longer epitopes (in general 25-mers). Is it possible to get a longer peptide sequence than the outputted predicted minimal epitopes? Like 25- or even 40-mer around the junctions. Maybe that's already an option but I couldn't find it.
Thanks again!
Pierre
The text was updated successfully, but these errors were encountered: