# Conservation Analysis and Epitope Prediction


#### Author: C. Mazzaferro, K. Fisch
#### Email: cmazzafe@ucsd.edu
#### Date: October 2016
 
## Outline of Notebook
<a id = "toc"></a>
1. <a href = "#background">Background</a>
2. <a href = "#Cons">High Affinity Binding Prediction </a>
    * <a href = "#Agg">Data Aggregation</a>
    * <a href = "#Sim">Similarty Score Calculation</a>
    * <a href = "#Plot">Visualize</a>

### Files of interest
From a simple fasta file containing the proteins of interest, this workflow will create:

1. Binding affinity prediction xls files for the given allele, nmer combinations. Default set to all allele supertypes and [8, 9, 10, 11] nmers
2. Csv file containing all high binding affinity peptides. Threshold can be set to a customized level (either by setting it to Medium or setting it by affinity score. See [CBS's prediction services](http://www.cbs.dtu.dk/services/), in particular [netMHC](http://www.cbs.dtu.dk/services/NetMHC/) for more specific information.
3. Fasta files containing swap data: for each high affinity binding peptide, find every peptide that results from swapping one single AA from it. These fasta files are then fed into netMHC again and their binding affinity determined again. 
4. Csv files for each fasta file with the swapped peptide binding prediction.

Caveats: netMHC local install must be in the system's path. 

In [34]:
import importlib
from nepitope import net_MHC_interface
from nepitope import mhc_utils
from nepitope import pep_utils
import pandas

project_location = '/Users/carlomazzaferro/Desktop/New_General/'
fasta_file = '/Users/carlomazzaferro/Desktop/New_General/fasta_base_new_prots.fasta'
net_mhc_path = '/Users/carlomazzaferro/Desktop/BINF_Tools/netMHC-4.0/netMHC'

"""
Default Parameters:

all_alleles = ['HLA-A0101','HLA-A0201','HLA-A0301','HLA-A2402','HLA-A2601',
               'HLA-B0702','HLA-B0801','HLA-B2705','HLA-B3901','HLA-B4001','HLA-B5801','HLA-B1501']

nmer_list = [8,9,10,11]

"""


net_mhc = net_MHC_interface.netMHCComand(net_mhc_path, fasta_file)

### <a id = "Cons"></a>
# MHC Binding Affinity Prediction 
### Ran on netMHCcons for n-mers of 8, 9, 10, 11 amino acids and for the 12 allele supertype families

Run for each protein sequence retrieved from reference paper.

In [36]:
net_mhc.create_text_command()

'netMHC commands written to run_netMHC.txt located at /Users/carlomazzaferro/Desktop/New_General/mhc_preds_fasta_base_new_prots'

In [37]:
net_mhc.run_netMHC()

Predictions being saved to /Users/carlomazzaferro/Desktop/New_General/mhc_preds_fasta_base_new_prots


In [38]:
### Load predictions 
import importlib
importlib.reload(mhc_utils)
pred_loc = '/Users/carlomazzaferro/Desktop/New_General/mhc_preds_fasta_base_new_prots/'
agg = mhc_utils.FileConsolidation(pred_loc, fasta_file)

In [39]:
csv_dir = '/Users/carlomazzaferro/Desktop/New_General/high_affinity_peps.csv'
high_aa_df = agg.get_all_high_affinity_from_batch(csv_out=True, csv_dir=csv_dir)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [40]:
high_aa_df.head()

Unnamed: 0,Pos,Peptide,nM,Rank,ID,Allele,Affinity Level,n-mer
1003,1003,LESEFVYGDY,33.5,0.06,S__pyogenes_Cas9,HLA-A0101,High,10
2388,1029,YSTDILGNLY,3.7,0.01,Staphylococcus_aureus_Cas9,HLA-A0101,High,10
2389,1030,STDILGNLYE,47.1,0.08,Staphylococcus_aureus_Cas9,HLA-A0101,High,10
2608,205,QITDEFINRY,44.0,0.08,S_CRISPR_1_thermophilus_Cas9,HLA-A0101,High,10
7793,1102,VRTDVLGNKY,38.1,0.07,Streptococcus_pasteurianus_Cas9,HLA-A0101,High,10


# Swap Files

In [41]:
high_aa_df = pandas.read_csv(csv_dir) 
high_aa_df.head()

Unnamed: 0.1,Unnamed: 0,Pos,Peptide,nM,Rank,ID,Allele,Affinity Level,n-mer
0,1003,1003,LESEFVYGDY,33.5,0.06,S__pyogenes_Cas9,HLA-A0101,High,10
1,2388,1029,YSTDILGNLY,3.7,0.01,Staphylococcus_aureus_Cas9,HLA-A0101,High,10
2,2389,1030,STDILGNLYE,47.1,0.08,Staphylococcus_aureus_Cas9,HLA-A0101,High,10
3,2608,205,QITDEFINRY,44.0,0.08,S_CRISPR_1_thermophilus_Cas9,HLA-A0101,High,10
4,7793,1102,VRTDVLGNKY,38.1,0.07,Streptococcus_pasteurianus_Cas9,HLA-A0101,High,10


In [58]:
from nepitope import peptide_utilities
importlib.reload(peptide_utilities)
fasta_files_dir = '/Users/carlomazzaferro/Desktop/New_General/swaps/'

swap = peptide_utilities.Swaps(high_aa_df, fasta_files_dir, net_mhc_path, proteins=['S__pyogenes_Cas9'] )

In [59]:
cmd = swap.find_swaps_write_to_fasta()

Predictions being saved to /Users/carlomazzaferro/Desktop/New_General/swaps/S__pyogenes_Cas9/mhc_preds__swap_LESEFVYGDY_Pos_1003_ID_S--pyogenes-Cas9_Allele_HLA-A0101_nmer_10
Predictions being saved to /Users/carlomazzaferro/Desktop/New_General/swaps/S__pyogenes_Cas9/mhc_preds__swap_NGLFGNLIAL_Pos_234_ID_S--pyogenes-Cas9_Allele_HLA-A0201_nmer_10
Predictions being saved to /Users/carlomazzaferro/Desktop/New_General/swaps/S__pyogenes_Cas9/mhc_preds__swap_AILLSDILRV_Pos_298_ID_S--pyogenes-Cas9_Allele_HLA-A0201_nmer_10
Predictions being saved to /Users/carlomazzaferro/Desktop/New_General/swaps/S__pyogenes_Cas9/mhc_preds__swap_ILTFRIPYYV_Pos_442_ID_S--pyogenes-Cas9_Allele_HLA-A0201_nmer_10
Predictions being saved to /Users/carlomazzaferro/Desktop/New_General/swaps/S__pyogenes_Cas9/mhc_preds__swap_HSLLYEYFTV_Pos_510_ID_S--pyogenes-Cas9_Allele_HLA-A0201_nmer_10
Predictions being saved to /Users/carlomazzaferro/Desktop/New_General/swaps/S__pyogenes_Cas9/mhc_preds__swap_YLQNGRDMYV_Pos_814_ID_S--

## Load file

In [104]:
from nepitope import mhc_utils
import pandas
import glob
import importlib
importlib.reload(mhc_utils)

#files saved from netMHC in  multiple different files, one per each query (nmer, allele)
filepath = '/Users/carlomazzaferro/Desktop/New_General/swaps/S__pyogenes_Cas9/preds_per_swap/'
fasta_file = '/Users/carlomazzaferro/Desktop/New_General/fasta_base_new_prots.fasta'

aggregate_all = mhc_utils.FileConsolidation(filepath, fasta_file)

In [105]:
#returns a dataframe per file used (just one in this case)
df_list = aggregate_all.return_df_list()

In [106]:
#Would run this. But too slow. The 3 cells below are speed-optimized. 
lsss_1 = aggregate_all.optimized_list_df_by_prot(df_list)

In [646]:
sss = pandas.concat(lsss_1)
high_int_aa = sss.loc[sss['Affinity Level'] ==  ('High' or 'Intermediate')]
high_int_aa.to_csv('/Users/carlomazzaferro/Desktop/Test_Workflow_From_Scratch/all_high_aff_aas.csv')

## Load original proteins

In [514]:
from nepitope import pairwise_comp
importlib.reload(pairwise_comp)

<module 'nepitope.pairwise_comp' from '/Users/carlomazzaferro/Documents/Code/neoantigen/antigen_discovery/nepitope/pairwise_comp.py'>

In [210]:
pwcomp = pairwise_comp.PairwiseComp(lsss_1, 5, fasta_file) 
df_pairwise_comps  = pwcomp.pipe_run() 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['Peptide'] = df['Peptide'].str.replace('X', '-')


In [140]:
swaps_df_dir = '/Users/carlomazzaferro/Desktop/Test_IEDB/OtherProtsPrashant/swap_results/S__pyogenes_Cas9/summary_results_per_prediction.csv'
swaps_df = pandas.read_csv(swaps_df_dir)

In [314]:
from importlib import reload
reload(merge_workflows)
from nepitope import merge_workflows
merged = merge_workflows.MergeSwapsAndComp(df_pairwise_comps, swaps_df, 40)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df = pandas.DataFrame(maximal_list)


In [577]:
merged.top_loc_matches[0:20]
for i in merged.top_loc_matches[0:20]:
    print('%i matches occurring at position %i' %(i[1], i[0]))

130 matches occurring at position 657
129 matches occurring at position 447
126 matches occurring at position 656
124 matches occurring at position 446
122 matches occurring at position 658
119 matches occurring at position 448
117 matches occurring at position 813
115 matches occurring at position 814
110 matches occurring at position 659
109 matches occurring at position 445
109 matches occurring at position 812
106 matches occurring at position 514
106 matches occurring at position 655
103 matches occurring at position 513
100 matches occurring at position 449
98 matches occurring at position 512
98 matches occurring at position 515
98 matches occurring at position 985
98 matches occurring at position 987
96 matches occurring at position 986


In [315]:
df_top = merged.top_swap_df
df_ordered_filtered = merged.final_sort_and_filtering(df_top)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [642]:
df_ordered_filtered.columns

Index(['Unnamed: 0', 'nmer', 'allele', 'num high affinity peps',
       'num med affinity peps', 'num low affinity peps',
       'num no affinity peps', 'original peptide', 'original pos',
       'top scoring peptides', 'Range'],
      dtype='object')

In [610]:
reload(modify_data)
from nepitope import modify_data
mod_fasta = '/Users/carlomazzaferro/Desktop/PROJECT_LOC_TEST/mod_fastas/mod_all_fastas/'
orig_fasta = '/Users/carlomazzaferro/Desktop/PROJECT_LOC_TEST/prots_in_fasta.fasta'

mod_data = modify_data.ModifyData(swaps_df, fasta_file, mod_fasta, lsss_1)

In [611]:
######## TO INCLUDE IN TEST SUITE ########

from nepitope import pep_utils
j,k = pep_utils.create_separate_lists(fasta_file)
for i in sgl:
    print(i[1] in k[0])

for i in mod_data.swaps_dic.keys():
    print(i in mod_data.fasta_dic[0][1])
    print(i)

False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
RRYTRRKNR
True
KAILSARLSK
True
KLNREDLLRK
True
WRQLLNAKLI
True
SRFAWMTRK
True
LIHDDSLTF
True
IYLALAHMI
True
YHHAHDAYL
True
SLLYEYFTVY
True
ILLSDILRV
True
MLASAGELQK
True
KQSGKTILDF
True
RRLENLIAQL
True
KTYAHLFDDK
True
AHMIKFRGHF
True
LPKYSLFEL
True
SLHEHIANL
True
SPAIKKGIL
True
RLIYLALAHM
True
HYLDEIIEQI
True
RRKNRICYL
True
REQAENIIHL
True
KVLPKHSLLY
True
MIKFRGHFL
True
LLYEYFTVY
True
ALSLGLTPNF
True
NGL

In [612]:
mod_data.get_modified_fastas_single_swap()

In [615]:
exchange_pairs = []
for i,j in mod_data.swaps_dic.items():
    exchange_pairs.append([i,j])
    

In [603]:
import random

def get_all_exchange_pairs(swaps_df):
    cols = ['original peptide', 'top scoring peptides']
    sliced = swaps_df.ix[:, cols].values.tolist()
    orig = [i[0] for i in sliced]
    swaps = [list(filter(None, re.sub('[^A-Za-z0-9]+', ',', i[1]).split(','))) for i in sliced]
    hashmap = [[orig[i], swaps[i]] for i in range(len(orig))]
    swaps_single = []
    for idx, hsh in enumerate(hashmap):
        swaps_single.append([hsh[0], random.choice(hsh[1])])
    return swaps_single

sgl = get_all_exchange_pairs(swaps_df)



In [616]:
import os
fastas = glob.glob(mod_fasta + '*')
csv_path = '/Users/carlomazzaferro/Desktop/PROJECT_LOC_TEST/mod_csvs/mod_csvs_all_peps/'

list_dfs = []
for pair in exchange_pairs:
    for fasta in fastas:
        if os.path.splitext(os.path.basename(fasta))[0].split('-')[0] == pair[0]:
            my_fasta = fasta
            ls_mod = mod_data.get_singly_modified_df_list(pair[0], pair[1])
            pwcomp_pair_swap = pairwise_comp.PairwiseComp(ls_mod, 5, my_fasta)
            df_swap = pwcomp_pair_swap.pipe_run()
            df_swap = df_swap.drop('Matches Loc', 1)
            df_swap.to_csv(csv_path + pair[0] + '-' + pair[1] + '.csv')
            list_dfs.append(df_swap)
            
            
    


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['Peptide'] = df['Peptide'].str.replace('X', '-')


In [629]:
list_dfs[97].loc['S__pyogenes_Cas9']

Unnamed: 0,10,11,5,6,7,8,9,Num High AA,Total
C__jejuni_Cas9,0,0.0,0,0,0,0,0,0,0
Campylobacter_lari_Cas9,0,0.0,0,0,0,0,0,0,0
Corynebacter_diphtheria_Cas9,0,0.0,0,0,0,0,0,0,0
F__novicida_Cas9,0,0.0,0,0,0,0,0,0,0
Lactobacillus_buchneri_Cas9,0,0.0,15,5,2,1,0,2,23
Listeria_innocua_Cas9,0,0.0,67,41,23,13,5,7,149
N__meningitidis_Cas9,0,0.0,2,0,0,0,0,0,2
Neisseria_cinerea_Cas9,0,0.0,2,0,0,0,0,0,2
P__multocida_Cas9,0,0.0,2,1,0,0,0,0,3
Parvibaculum_lavamentivorans_Cas9,0,0.0,0,0,0,0,0,0,0


In [630]:
list_dfs[74].loc['S__pyogenes_Cas9']

Unnamed: 0,10,11,5,6,7,8,9,Num High AA,Total
C__jejuni_Cas9,0,0.0,0,0,0,0,0,0,0
Campylobacter_lari_Cas9,0,0.0,0,0,0,0,0,0,0
Corynebacter_diphtheria_Cas9,0,0.0,0,0,0,0,0,0,0
F__novicida_Cas9,0,0.0,0,0,0,0,0,0,0
Lactobacillus_buchneri_Cas9,0,0.0,15,5,2,1,0,4,23
Listeria_innocua_Cas9,0,0.0,67,41,23,13,5,8,149
N__meningitidis_Cas9,0,0.0,2,0,0,0,0,0,2
Neisseria_cinerea_Cas9,0,0.0,2,0,0,0,0,0,2
P__multocida_Cas9,0,0.0,2,1,0,0,0,0,3
Parvibaculum_lavamentivorans_Cas9,0,0.0,0,0,0,0,0,0,0


In [165]:
pwcomp_1_swap = pairwise_comp.PairwiseComp(ls_mod, 5, mod_fasta)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['Peptide'] = df['Peptide'].str.replace('X', '-')


In [166]:
df_1_swap = pwcomp_1_swap.pipe_run()

In [156]:
df_1_swap.to_csv('/Users/carlomazzaferro/Desktop/Test_IEDB/OtherProtsPrashant/single_swap_pairwise_comp.csv')

In [168]:
df_1_swap.loc['S__pyogenes_Cas9']

Unnamed: 0,10,11,5,6,7,8,9,Matches Loc,Num High AA,Total
C__jejuni_Cas9,0,0,0,0,0,0,0,[[]],0,0
Campylobacter_lari_Cas9,0,0,0,0,0,0,0,[[]],0,0
Corynebacter_diphtheria_Cas9,0,0,0,0,0,0,0,[[]],0,0
F__novicida_Cas9,0,0,0,0,0,0,0,[[]],0,0
Lactobacillus_buchneri_Cas9,0,0,13,5,2,1,0,"[[[1266, 1267, 1268, 1269, 1270], [1266, 1267,...",5,21
Listeria_innocua_Cas9,0,0,66,40,23,13,5,"[[[8, 9, 10, 11, 12, 13, 14, 15], [7, 8, 9, 10...",118,147
N__meningitidis_Cas9,0,0,2,0,0,0,0,"[[[7, 8, 9, 10, 11], [7, 8, 9, 10, 11]]]",2,2
Neisseria_cinerea_Cas9,0,0,2,0,0,0,0,"[[[7, 8, 9, 10, 11], [7, 8, 9, 10, 11]]]",2,2
P__multocida_Cas9,0,0,2,1,0,0,0,[[]],0,3
Parvibaculum_lavamentivorans_Cas9,0,0,0,0,0,0,0,[[]],0,0


In [161]:
df_pairwise_comps.loc['S__pyogenes_Cas9']

Unnamed: 0,10,11,5,6,7,8,9,Matches Loc,Num High AA,Total
C__jejuni_Cas9,0,0,0,0,0,0,0,[[]],0,0
Campylobacter_lari_Cas9,0,0,0,0,0,0,0,[[]],0,0
Corynebacter_diphtheria_Cas9,0,0,0,0,0,0,0,[[]],0,0
F__novicida_Cas9,0,0,0,0,0,0,0,[[]],0,0
Lactobacillus_buchneri_Cas9,0,0,15,5,2,1,0,"[[[1266, 1267, 1268, 1269, 1270], [1266, 1267,...",7,23
Listeria_innocua_Cas9,0,0,67,41,23,13,5,"[[[8, 9, 10, 11, 12, 13, 14, 15], [7, 8, 9, 10...",120,149
N__meningitidis_Cas9,0,0,2,0,0,0,0,"[[[7, 8, 9, 10, 11], [7, 8, 9, 10, 11]]]",2,2
Neisseria_cinerea_Cas9,0,0,2,0,0,0,0,"[[[7, 8, 9, 10, 11], [7, 8, 9, 10, 11]]]",2,2
P__multocida_Cas9,0,0,2,1,0,0,0,[[]],0,3
Parvibaculum_lavamentivorans_Cas9,0,0,0,0,0,0,0,[[]],0,0
