# Disorder model on larger datasets

Now that we have validated the effectiveness of our language model based system, we want to train it on larger datasets to further boost its accuracy. These are:

- Latest [DisProt](https://disprot.org/download) (2290 entries), 2022_06, all datasets, Disorder function aspect, consensus without ambiguous and obsolete
- Manually curated entries from [MobiDB](https://mobidb.org/help/apidoc)
- All entries in [MobiDB](https://mobidb.org/help/apidoc)

## Latest DisProt Dataset
First, we downloaded the fasta file with the 2290 entries (`data/disprot/DisProt release_2022_06 consensus regions.fasta`). Now, we have to extend it with the actual sequence and replace the labels with 0/1.

In [1]:
import re

from aiohttp import ClientSession
from tqdm.auto import tqdm

from bin.utils.extend_aa_scores import load_sequence_from_uniprot_session

In [2]:
def extract_disprot_dataset(file: str):
    items = []
    item = {"label": ""}
    with open(file) as handle:
        for line in handle:
            if line.startswith('>'):
                if "acc" in item:
                    items.append(item)
                    item = {"label": ""}
                item["acc"] = line.strip()
            elif len(line.strip()) > 0:
                item["label"] += re.sub(r"-", "0", re.sub(r"[DTFS]", "1", line.strip()))

    if "acc" in item:
        items.append(item)
    return items


# There are some sequences with different uniprot sequence info. Use sequence from disprot instead.
overrides = {
    "Q9NX55": "MRRRGEIDMATEGDVELELETETSGPERPPEKPRKHDSGAADLERVTDYAEEKEIQSSNLETAMSVIGDRRSREQKAKQEREKELAKVTIKKEDLELIMTEMEISRAAAERSLREHMGNVVEALIALTN",
    "Q03518": "MAELLASAGSACSWDFPRAPPSFPPPAASRGGLGGTRSFRPHRGAESPRPGRDRDGVRVPMASSRCPAPRGCRCLPGASLAWLGTVLLLLADWVLLRTALPRIFSLLVPTALPLLRVWAVGLSRWAVLWLGACGVLRATVGSKSENAGAQGWLAALKPLAAALGLALPGLALFRELISWGAPGSADSTRLLHWGSHPTAFVVSYAAALPAAALWHKLGSLWVPGGQGGSGNPVRRLLGCLGSETRRLSLFLVLVVLSSLGEMAIPFFTGRLTDWILQDGSADTFTRNLTLMSILTIASAVLEFVGDGIYNNTMGHVHSHLQGEVFGAVLRQETEFFQQNQTGNIMSRVTEDTSTLSDSLSENLSLFLWYLVRGLCLLGIMLWGSVSLTMVTLITLPLLFLLPKKVGKWYQLLEVQVRESLAKSSQVAIEALSAMPTVRSFANEEGEAQKFREKLQEIKTLNQKEAVAYAVNSWTTSISGMLLKVGILYIGGQLVTSGAVSSGNLVTFVLYQMQFTQAVEVLLSIYPRVQKAVGSSEKIFEYLDRTPRCPPSGLLTPLHLEGLVQFQDVSFAYPNRPDVLVLQGLTFTLRPGEVTALVGPNGSGKSTVAALLQNLYQPTGGQLLLDGKPLPQYEHRYLHRQVAAVGQEPQVFGRSLQENIAYGLTQKPTMEEITAAAVKSGAHSFISGLPQGYDTEVDEAGSQLSGGQRQAVALARALIRKPCVLILDDATSALDANSQLQVEQLLYESPERYSRSVLLITQHLSLVEQADHILFLEGGAIREGGTHQQLMEKKGCYWAMVQAPADAPE",
    "Q9UJX3": "MDPGDAAILESSLRILYRLFESVLPPLPAALQSRMNVIDHVRDMAAAGLHSNVRLLSSLLLTMSNNNPELFSPPQKYQLLVYHADSLFHDKEYRNAVSKYTMALQQKKALSKTSKVRPSTGNSASTPQSQCLPSEIEVKYKMAECYTMLKQDKDAIAILDGIPSRQRTPKINMMLANLYKKAGQERPSVTSYKEVLRQCPLALDAILGLLSLSVKGAEVASMTMNVIQTVPNLDWLSVWIKAYAFVHTGDNSRAISTICSLEKKSLLRDNVDLLGSLADLYFRAGDNKNSVLKFEQAQMLDPYLIKGMDVYGYLLAREGRLEDVENLGCRLFNISDQHAEPWVVSGCHSFYSKRYSRALYLGAKAIQLNSNSVQALLLKGAALRNMGRVQEAIIHFREAIRLAPCRLDCYEGLIECYLASNSIREAMVMANNVYKTLGANAQTLTLLATVCLEDPVTQEKAKTLLDKALTQRPDYIKAVVKKAELLSREQKYEDGIALLRNALANQSDCVLHRILGDFLVAVNEYQEAMDQYSIALSLDPNDQKSLEGMQKMEKEESPTDATQEEDVDDMEGSGEEGDLEGSDSEAAQWADQEQWFGMQ",
    "K7J0R2": "MWSPAILLLLIGATFANQQNGWTNGKQYTYAINSRTIATFNQQSKYLSGIVIEAYLTVQPNGEDTLRAKIWQPRYSPIHTQLENGWDSEIPQNLINLQTFPLSGKPFEIKTKNGVVRDLIVDKDVPTWEVNVLKGIVSQLQIDTSGENVKKSKRNQLPEENQPFAFFKAMEDSVGGKCEVLYDISPLPEQVLQNKPELAPMPELREDGDMISLVKTKNYSNCEQRAGYHFNINGRNAWEPGSNENRKYLSRSSVSRVIISGNLRKYTIQSSVTTNKVVHHADNQEENQQGMVASRMNLTLHKVEDMSEPMESPVNPQSTGNLVYNYNSPIDSISARRPNKYNQKGRSDEKNKNSDESDSESDSDGSVFDNNDDSYLQPKPKLTDAPLSPLLPFFIGNNGNSILKNKKVDAVKSATSIAQEIGNEMQNPDIMFAEQTLEKFTILSKLIRTMNSEQIASVQRSLYERAQSLNQLKQNNPEQLSRRNAWVAFRDAVAQAGTGPALVNIKQWVQNKQIEGTEATHVIDTLAKSVRIPTPEYMDTYFELIKMEEVKRELIVRDAAVLSFADLIRHAVVNKKSAHNHYPVHAFGRLLPKNFRQLHEKYIPYLEEELLKAVDAGDSRRIHTYTIALGKTAHPRVLAVFEPYLEGKKPISPYQRLVMVLSLNKLASIFPKVGRSVLYKIYSNTADYHEIRTAAVYLLMQSNPSASMLQRMAEFTNYDTSKYVNSAVKSTIESLAQLHDNHEYQGLLDSARAAQPLLTSESYGPQYSKQMFFNLRNPLTQSDYFIQASTIGSEDSIIPKGVYVITIPTYNGMKMPKIEIGGEVSSLKNLWNFVQQRISNSQRSDSNEKPENQKYSPENLAKLLGIYGEETEQIEGFAFINDKFANHFLTFDNHTLEKIPGMLRQLAEDMKQGRSFDATKLKNFEVTISFPTETGFPFRFTVKNPTITSVSGVSHLKTTSGSGSRSEWPKASLSGNVRIVYGLQTQKRLGFVTPFEHQEYMVGIDKDMQVYLPVRSEIEYDVNKGETRLRIQPNENLDEFKIIQYRTQPFTSKHDILNLEPITKDSNTATVHKNRATSSQIELNDNNNKQRLQFNWERQMRHLEEEIGNSYNKRQNAMEAMCKLTQSISSMFYLNSVDSEYQKYSVKVSPGSDMSAEMRISHDSMITENSENTDNSESWSPNAKTVHLERSLSEQERKQTLLKEASKNINSAEANVVDISLQLNGDMQSSVALTAAFADSNVDRKSRALLYASVETKGGQDYHVSAGFEGKNPNIESLDFEEILKANDRREYDLNVHYGIGTNENDENKQNRIKVRGEIKQTEERKKQIRQSHDARVCMKQQSLHGDKMTSACKRINKRASLADAGDFTVTFPNKSPMREIVMSAWDAAERMTQSVSHSWKNRMIKEEDNKVKVTFEMSPNDEKVDVTVKTPEGQIQLNNIKVALISNKNNGNVKDNRNEDDEELNKLNDNVCQLDKTQARTFDNHRYPLQLGSCWHIAMTPYPKHDPDTPSKKLEIPENMQVSILTRENENGQKELKITLGESLIELSASGPRQTHAKVNGNKVHYSKHKSYKEKKHGKVLFELFELSDESLKLVSKKYDIEIVYDGYRAQIETGERYRDSVRGLCGNNDGESMNDQQTPKGCLLQKPEEFSATYALTNDDQCQGPAIRNADEAKKSQCSYQTIRPGNVISEKEAGRETELSQDSDGAKHCMTHRTKIIRSKNEICFSLRPIPTCLSKCSPSSIKSKAIPFHCVAKNSASQKVAERVEKGANPDLTQKSVSKTLTEQLPINCKA",
    "M0Y2D5": "MIMSDPAMLPPGFRFHPTDEELILHYLRNRAAQSPCPVSIIADVDIYKFDPWALPSKASYGDREWYFFTPRDRKYPNGVRPNRAAGSGYWKATGTDKPIRCSATGESVGVKKALVFYKGRPPKGIKTNWIMHEYRLAAADAHAANTYRPMKFRNASMRLDDWVLCRIYKKTSQVSPMAVPPLSDHELDEPSGAGAYPMSSAGMTMQGGAGGYTLQAAVPGTQRMPKIPSISELLNDYSLAQLFDDSGHALMARHDQHAALFGHPIMSQFHVNSSGNNMSQLGQMDSPASTSVARDGAAGKRKRLSEEDGEHNGSTSQPAAAVTNKKPNSSCFGATTFQVGNNTLQGSLGQPLLHF",
    "P01019": "MRKRAPQSEMAPAGVSLRATILCLLAWAGLAAGDRVYIHPFHLVIHNESTCEQLAKANAGKPKDPTFIPAPIQAKTSPVDEKALQDQLVLVAAKLDTEDKLRAAMVGMLANFLGFRIYGMHSELWGVVHGATVLSPTAVFGTLASLYLGALDHTADRLQAILGVPWKDKNCTSRLDAHKVLSALQAVQGLLVAQGRADSQAQLLLSTVVGVFTAPGLHLKQPFVQGLALYTPVVLPRSLDFTELDVAAEKIDRFMQAVTGWKTGCSLMGASVDSTLAFNTYVHFQGKMKGFSLLAEPQEFWVDNSTSVSVPMLSGMGTFQHWSDIQDNFSVTQVPFTESACLLLIQPHYASDLDKVEGLTFQQNSLNWMKKLSPRTIHLTMPQLVLQGSYDLQDLLAQAELPAILHTELNLQKLSNDRIRVGEVLNSIFFELEADEREPTESTQQLNKPEVLEVTLNRPFLFAVYDQSATALHFLGRVANPLSTA",
    "Q12983": "MGDAAADPPGPALPCEFLRPGCGAPLSPGAQLGRGAPTSAFPPPAAEAHPAARRGLRSPQLPSGAMSQNGAPGMQEESLQGSWVELHFSNNGNGGSVPASVSIYNGDMEKILLDAQHESGRSSSKSSHCDSPPRSQTPQDTNRASETDTHSIGEKNSSQSEEDDIERRKEVESILKKNSDWIWDWSSRPENIPPKEFLFKHPKRTATLSMRNTSVMKKGGIFSAEFLKVFLPSLLLSHLLAIGLGIYIGRRLTTSTSTF",
}


async def add_uniprot_sequence(session: ClientSession, item: dict):
    uniprot_id = re.search(r"full acc=([A-Z\d-]+)", item['acc']).group(1)
    if uniprot_id in overrides:
        item['seq'] = overrides[uniprot_id]
    else:
        item['seq'] = await load_sequence_from_uniprot_session(session, uniprot_id)
    return item

In [3]:
disorder_items = extract_disprot_dataset("../data/disprot/2022/DisProt release_2022_06 consensus regions.fasta")

In [4]:
async with ClientSession() as session:
    disorder_items_with_seqs = await tqdm.gather(*[add_uniprot_sequence(session, i) for i in disorder_items],
                                                 desc=f'Loading sequences')

Loading sequences:   0%|          | 0/2290 [00:00<?, ?it/s]

In [5]:
# Check that seq and label have the same length
diffs = list(filter(lambda i: len(i['seq']) != len(i['label']), disorder_items_with_seqs))
print(f"{len(diffs)} entries have a difference in lengths of sequence and label:\n")
for d in diffs:
    uniprot_id = re.search(r"full acc=([A-Z0-9]+)", d['acc']).group(1)
    print(f"\t{uniprot_id} has seq_len {len(d['seq'])} and label_len {len(d['label'])}")

0 entries have a difference in lengths of sequence and label:



In [6]:
# Write correct sequences to file
filtered = list(filter(lambda i: len(i['seq']) == len(i['label']), disorder_items_with_seqs))
with open('../data/disprot/2022/disprot-disorder-2022-unclustered.txt', 'w') as handle:
    for i in filtered:
        handle.write(f"{i['acc']}\n{i['seq']}\n{i['label']}\n")

with open('../data/disprot/2022/disprot-disorder-2022-seqs.fasta', 'w') as handle:
    for i in filtered:
        handle.write(f"{i['acc']}\n{i['seq']}\n")

One thing I noticed immediately was that the labels for the old disprot dataset changed slightly. The question is now whether the model was "smarter" than the previous labels. It would be interesting to investigate the performance of the model trained on the old dataset on the new dataset (Raven: `disprot_2022_trained_on_2018.out`).

The accuracy on all the new data is to be taken with a grain of salt since it would include training samples too.

| Tested on all 2022 data | BAC   | F1    | MCC   |
|-------------------------|-------|-------|-------|
| 2018 model              | 0.747 | 0.656 | 0.489 |

### Excursion: AlphaFold models for new DisProt

We also want to investigate how the spearman correlation between pLDDT and true labels behaves for all sequences where we have true labels. This can then be compared with the pLDDT-IUPred correlation.

#### Find the experiment in the new_disprot_alphafold_correlation file.

In [10]:
import os

# Create one fasta file for each sequence of interest to run through AlphaFold
fasta_folder = '../data/disprot/2022/sequences'
if not os.path.exists(fasta_folder):
    os.makedirs(fasta_folder)
for i in filtered:
    uniprot_id = re.search(r"full acc=([A-Z0-9]+)", i['acc']).group(1)
    with open(f'{fasta_folder}/{uniprot_id}.fasta', 'w') as handle:
        handle.write(f"{i['acc']}\n{i['seq']}\n")

### Building training, validation, and test sets

To ensure good training and evaluation, we need to remove redundancy in the sets. To do this, we cluster with [CH-HIT](http://weizhong-lab.ucsd.edu/cdhit-web-server/cgi-bin/index.cgi?cmd=cd-hit) using 50% sequence identity cutoff and 70% coverage (aL). There are 1,997 resulting sequences in total that we will split into 70%/10%/20% train/validation/test sets. This means more than 3x the training data compared to flDPnn.

In [7]:
from bin.utils.extend_aa_scores import read_fasta_seqs

clustered_seqs = read_fasta_seqs('../data/disprot/2022/disprot-disorder-2022-clustered-seqs.fasta')
clustered_accs = set(seq['acc'] for seq in clustered_seqs)

In [8]:
nonredundant_disorder_items = list(filter(lambda item: item['acc'] in clustered_accs, filtered))

In [9]:
from sklearn.model_selection import train_test_split

def write_disorder_items_to_file(items, filename):
    with open(filename, 'w') as handle:
        for i in items:
            handle.write(f"{i['acc']}\n{i['seq']}\n{i['label']}\n")

# set aside 20% of train as test data for evaluation
temp_items, test_items = train_test_split(nonredundant_disorder_items, test_size=0.2, random_state=9)
# set aside 10% of train as test data for evaluation (12.5% x 80% = 10%)
train_items, val_items = train_test_split(temp_items, test_size=0.125, random_state=9)

write_disorder_items_to_file(test_items, '../data/disprot/2022/disprot-disorder-2022-test.txt')
write_disorder_items_to_file(val_items, '../data/disprot/2022/disprot-disorder-2022-val.txt')
write_disorder_items_to_file(train_items, '../data/disprot/2022/disprot-disorder-2022-train.txt')

The training dataset has the following composition of labels: 125901 / 493916 (disorder=1 / order or unknown=0).

We now trained the existing DisProt model on the new data (Raven: `train_disprot_2022.out`).

| Validation results                      | BAC  | F1   | MCC  |
|-----------------------------------------|------|------|------|
| Model trained on 2022 validated on 2022 | 0.72 | 0.64 | 0.44 |
| Model trained on 2018 validated on 2018 | 0.75 | 0.64 | 0.52 |

### Hyperparameters on latest DisProt

With more data, the question arises whether the learning rate and model is still the best fit. Therefore, we will also run hyperparam tuning again (`hyperparam_tune_on_disprot_2022.out`).

```
"model_name": tune.choice(['facebook/esm-1b', 'Rostlab/prot_bert_bfd', 'Rostlab/prot_t5_xl_half_uniref50-enc']),
"learning_rate": tune.loguniform(1e-5, 1e-2),
"encoder_learning_rate": tune.loguniform(5e-6, 1e-2),
```

## CheZOD dataset from ODiNPred

Downloaded both sets from https://github.com/protein-nmr/CheZOD

