# Conservation Analysis and Epitope Prediction


#### Author: C. Mazzaferro, K. Fisch
#### Email: cmazzafe@ucsd.edu
#### Date: August 2016
 
## Outline of Notebook
<a id = "toc"></a>
1. <a href = "#background">Background</a>
2. <a href = "#ConsA">Conservation Analysis</a>
    * <a href = "#BLAST">BLAST-P</a>
    * <a href = "#MSA">Multiple Sequence Alignment</a>
    * <a href = "#Cons">Conservation Score Prediction</a>
3. <a href = "#ep_pred">Windowing and New Epitope Prediction</a>
    * <a href = "#clustering">Epitope Clustering Analysis</a>

This workflow aims at finding peptide substitute candidates by swapping 1 AA at a time from a prioritized list of peptides. This prioritized list was found using the workflow - Base Workflow.

## Finding similar peptides by doing simple AA swaps
The peptides will be written to a fasta file which will be sent to netMHCcons. Results will be analyzed subsequentially


In [4]:
import pandas

csv_path = '/Users/carlomazzaferro/Desktop/top_peptides_of_interest.csv'
high_aa_low_cons_df = pandas.read_csv(csv_path)

In [5]:
from nepitope import pep_utils
fasta_files_dir = '/Users/carlomazzaferro/Desktop/BINF_rand_files/CAS9_stuff/swapping_fasta_files/'
pep_utils.find_swaps_write_to_fasta(high_aa_low_cons_df, fasta_files_dir)

### Load Results From netMHCcons
The swaps are sent to netMHCCons, and the results are aggragated under the class FileConsolidation in the module scoring_utils

In [6]:
from nepitope import scoring_utils, summary_data
import glob
nmers = [8, 9,10,11]

#files saved from netMHCcons as split_*i*_prediction_*n*_mer for each split/nmer
filepath = '/Users/carlomazzaferro/Desktop/BINF_rand_files/CAS9_stuff/swapping_fasta_files/swapping_preds/'

file_pattern = '_swap_*'

aggregate_all = scoring_utils.FileConsolidation.load_batch(filepath, file_pattern)

files = glob.glob('/Users/carlomazzaferro/Desktop/BINF_rand_files/CAS9_stuff/swapping_fasta_files/swapping_preds/_s*')
original_peps_and_pos = []

for file in files:
    split_ = file.split('/')[-1]
    split_2 = split_.split('_')
    original_peps_and_pos.append([split_2[2], split_2[4]])

### Return them in a list of dataframes
summary_data is the module that performs all the data aggregation on newly found peptides.

In [7]:
list_results = aggregate_all.return_df_list()

In [8]:
summ_data = summary_data.SummaryData()
summ_data_analysis = summ_data.summarize_data_for_each_mhc_pred(list_results, original_peps_and_pos, high_aa_low_cons_df)

### Return summary data in a dataframe
Other methods and attributes are present as well

In [9]:
summ_data_analysis.summary_df.sort_values(by='original pos').head(8)

Unnamed: 0,nmer,allele,num high affinity peps,num med affinity peps,num low affinity peps,num no affinity peps,original peptide,original pos,top scoring peptides
3,9,HLA-B0702,81,63,11,25,-PSKKFKVL,40,"[-DSKKFKVL, -CSKKFKVL, -WSKKFKVL, -ESKKFKVL, -..."
4,9,HLA-B0801,98,58,22,2,-PSKKFKVL,40,"[-PSKKFKVK, -PSKKFKVR, -PSKKFKVN, -PSKKFKVD, -..."
18,9,HLA-A2601,111,37,22,10,NIVDEVA-Y,196,"[NIVDEVA-G, NIVDEVA-D, NIVDEVA-N, NIVDEVA-P, N..."
13,8,HLA-A0101,0,6,97,57,IVDEVA-Y,197,"[IVDEVA-N, IVDEVA-Q, IVDEVA-G, IVDEVA-K, IVDEV..."
1,9,HLA-A0201,142,13,19,6,-LFGNLIAL,1196,"[-HFGNLIAL, -PFGNLIAL, -DFGNLIAL, -RFGNLIAL, -..."
9,11,HLA-B3901,0,46,146,28,HHQDLTLLKAL,1688,"[HHQDLTLLKAK, HHQDLTLLKAS, HHQDLTLLKAG, HHQDLT..."
8,8,HLA-B3901,0,18,86,56,HHQDLTLL,1688,"[HHQDLTLK, HHQDLTLN, HHQDLTLD, HHQDLTLR, HHQDL..."
10,10,HLA-B3901,0,18,114,68,HQDLTLLKAL,1689,"[HQDLTLLKAK, HQDLTLLKAS, HQDLTLLKAG, HQDLTLLKA..."


### Write files out

In [25]:
fasta_dir = '/Users/carlomazzaferro/Desktop/fasta_swaps_top_20_percent.fasta'
summ_data_analysis.write_peptides_to_fasta(fasta_dir)

In [32]:
csv_path = '/Users/carlomazzaferro/Desktop/CCBB_Results_Immune_Stealth_Analysis/summary_results_per_prediction.csv'
summ_data_analysis.summary_df.sort_values(by='original pos').to_csv(path_or_buf=csv_path, sep=',')

In [43]:
basepath = '/Users/carlomazzaferro/Desktop/CCBB_Results_Immune_Stealth_Analysis/csv_for_each_prediction/'
for idx, val in enumerate(list_results):
    original_pep = original_peps_and_pos[idx][0]
    pos = original_peps_and_pos[idx][1]
    allele = val.Allele.unique()[0]
    nmer = val['n-mer'].unique()[0]
    name = "_".join([original_pep, pos, allele, str(nmer)])
    val['Peptide'] = val['Peptide'].str.replace('X', '-')
    write_to_csv(val, basepath + name + '.csv')
    

In [36]:
def write_to_csv(df, path_name):
    df.to_csv(path_name, sep=',')
    