<a href="https://colab.research.google.com/github/alibekk93/IDP_analysis/blob/RAPID/notebooks/overal_abundance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting UniProt proteomes for Tempura species

## Setup

In [None]:
!pip install BIO



In [None]:
!git clone -b RAPID https://github.com/alibekk93/IDP_analysis
!cd /content/IDP_analysis

fatal: destination path 'IDP_analysis' already exists and is not an empty directory.


In [None]:
from IDP_analysis.packages_import import *
from IDP_analysis.idp_in_bacteria_functions import *



In [None]:
sns.set_theme(context='paper', style='white',  palette='colorblind')

Loading Tempura dataset

In [None]:
# tempura = pd.read_csv('/content/IDP_analysis/datafiles/tempura/200617_TEMPURA.csv', index_col=0)
# tempura = pd.read_csv('/content/IDP_analysis/datafiles/tempura/tempura_bacteria_uniprot.csv', index_col=0)
tempura = pd.read_csv('/content/IDP_analysis/datafiles/tempura/tempura_filtered.csv', index_col=0)

Only keeping bacteria with available assembly or accession numbers

In [None]:
# tempura = tempura[tempura['superkingdom']=='Bacteria']
# tempura.dropna(subset='assembly_or_accession', inplace=True)
# tempura.reset_index(drop=True, inplace=True)

Classifying bacteria into groups:
1. Psychrophile: OGT <= 20
2. Mesophile: 20 < OGT <= 40
2. Thermophile: 40 < OGT

While this may be quite liberal, some "psychrohiles" with OGT = 20 have *antarctica* in their species name, so it should be fair enough

In [None]:
tempura['group'] = pd.cut(tempura['Topt_ave'], bins=[-float('inf'), 20, 40, float('inf')],
                          labels=['psychrophilic', 'mesophilic', 'thermophilic'])

Loading all_proteins

In [None]:
# all_proteins = pd.read_csv('/content/all_proteins.csv', index_col=0)
# all_proteins_filtered = pd.read_csv('/content/all_proteins_filtered.csv', index_col=0)
all_proteins_rapid = pd.read_csv('/content/all_proteins_rapid.csv', index_col=0)

Merging all_proteins with temura

In [None]:
all_protein_tempura = all_proteins_rapid.merge(tempura, left_on='Species', right_on='genus_and_species')

Loading RAPID_disorder values

In [None]:
# rapid_disorder_values = pd.read_csv('/content/IDP_analysis/datafiles/RAPID/RAPID_disorder_values.csv', index_col=0)

## Combining partial RAPID results and loading into all_proteins

In [None]:
rapid_disorder_values = pd.DataFrame(columns = ['Prot. ID', 'Disorder Content %'])

In [None]:
# set number of csv files with partial RAPID calculation results
n = 17
# iterate through each file and concatenate to all_proteins_filtered
for i in tqdm(range(n)):
  # make filename with addition of '0' if 1-9
  if i+1 < 10:
    filename = f'0{i+1}.csv'
  else:
    filename = f'{i+1}.csv'
  # read csv with RAPID result
  rapid_result = pd.read_csv(filename)
  rapid_result = rapid_result[['Prot. ID', 'Disorder Content %']]
  # append RAPID disorder prediciton to overall dataframe
  rapid_disorder_values = pd.concat([rapid_disorder_values, rapid_result], axis=0)

In [None]:
# rapid_disorder_values.to_csv('RAPID_disorder_values.csv')

Appending RAPID disorder values to all_proteins_filtered

In [None]:
# all_proteins_rapid = all_proteins_filtered.join(rapid_disorder_values.set_index('Prot. ID'),
#                                                    on='ID', how='inner')

In [None]:
all_proteins_rapid['RAPID_disorder'] = all_proteins_rapid['Disorder Content %'] / 100
all_proteins_rapid.drop('Disorder Content %', axis=1, inplace=True)

In [None]:
# all_proteins_rapid.to_csv('all_proteins_rapid.csv')

## FCR / NCPR filtering

In [None]:
def compute_fcr(row):
    aa_seq = row['Sequence']
    n = len(aa_seq)
    f_plus = sum(aa_seq.count(char) for char in ['R', 'K', 'H']) / n # Histidine?
    f_minus = sum(aa_seq.count(char) for char in ['D', 'E']) / n
    ncpr = abs(f_plus - f_minus)
    fcr = (f_plus + f_minus)
    return pd.Series([f_plus, f_minus, ncpr, fcr])

In [None]:
all_proteins_types = all_proteins_rapid.copy()

In [None]:
all_proteins_types[['f_plus', 'f_minus', 'ncpr', 'fcr']] = all_proteins_types.apply(compute_fcr, axis=1)

In [None]:
all_proteins_types['idp_type'] = None

In [None]:
all_proteins_types.loc[(all_proteins_types['fcr'] < 0.25) & (all_proteins_types['ncpr'] < 0.25), 'idp_type'] = 1
all_proteins_types.loc[(all_proteins_types['fcr'] >= 0.25) & (all_proteins_types['fcr'] <= 0.35) &\
                      (all_proteins_types['ncpr'] <= 0.35), 'idp_type'] = 2
all_proteins_types.loc[(all_proteins_types['fcr'] > 0.35) & (all_proteins_types['ncpr'] <= 0.35), 'idp_type'] = 3
all_proteins_types.loc[(all_proteins_types['fcr'] > 0.35) & (all_proteins_types['ncpr'] > 0.35) &\
                      (all_proteins_types['f_minus'] > 0.35), 'idp_type'] = 4
all_proteins_types.loc[(all_proteins_types['fcr'] > 0.35) & (all_proteins_types['ncpr'] > 0.35) &\
                      (all_proteins_types['f_plus'] > 0.35), 'idp_type'] = 5

Filter out "candidate IDP" proteins - these are either:
1. IDP type 3, 4 or 5
2. RAPID_disorder >= 0.5
3. Have at least 100 disordered residues as predicted by RAPID

The last point is to include longer proteins that just have a section of IDR. RAPID does not provide disorder at residue level, so it's as good as we can get to having a filter for 30 consecutive disordered residues

In [None]:
candidate_filter = (all_proteins_types['idp_type'] > 2) | \
                    (all_proteins_types['RAPID_disorder'] >= 0.5) | \
                    (all_proteins_types['RAPID_disorder'] * all_proteins_types['Length'] >= 100)
all_proteins_types['candidate_idp'] = candidate_filter

In [None]:
# fig, ax = plt.subplots(figsize=(8,8))

# sns.scatterplot(data=all_proteins_types, x='f_plus', y='f_minus',
#                 hue='idp_type')

# ax.set_xlim(0, 0.75)
# ax.set_ylim(0, 0.75)

# fig.show()

In [None]:
# fig.savefig('IDP_types.svg')

In [None]:
all_proteins_types.to_csv('all_proteins_types.csv')