## <center>Prot2Vec

## Importing Libraries

**Gensim:** This is the primary library used for the **Word2Vec algorithm**, which will create our protein embeddings.

**Biopython:** Essential for reading and parsing biological file formats like FASTA, which contains the amino acid sequences.

**tqdm:** A utility to show progress bars, helping us track how long the heavy data processing will take.

We import libraries that handle different parts of the pipeline:

- gzip and SeqIO are used to open and read the large, compressed protein databases (UniProt).
- Word2Vec is the core AI model that will learn the relationships between protein patterns.
- re (Regular Expressions) ensures we only process valid amino acids, filtering out "noise" or errors in the sequence data.

In [12]:
!pip install gensim biopython tqdm



In [13]:
import os
import gzip
import re
from Bio import SeqIO
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
import time
from tqdm import tqdm

### Dataset 
We downloaded the UniProt (Universal Protein Resource) datasetâ€”specifically the uniprot_sprot.fasta.gz file. We used Swiss-Prot, which is the section of UniProt that contains "manually curated" data. This means every protein in this file has been manually reviewed and verified by human experts, ensuring our that the model learns from high-quality, accurate information. UniProt provides *over 500,000* expertly curated sequences for our model to "study".

#### Defining the K-mers
In protein science, a group of three amino acids is called a "tri-peptide". By breaking long sequences into overlapping sets of three, we help the AI recognize functional biological "words" rather than just looking at individual letters.

In [14]:
FASTA_FILE = "uniprot_sprot.fasta.gz"
KMER_SIZE = 3

**Configuring the AI Architecture**

We set specific values that control how the AI learns:

**VECTOR_SIZE = 200:** Each protein "word" is turned into a list of 200 numbers. This size is chosen to be large enough to capture complex chemical details but small enough to remain computationally efficient.

**TOTAL_EPOCHS = 10:** This will read through the entire dataset 10 times to refine its understanding of the sequences.

**CHECKPOINT_DIR:** We create a folder to save our progress so we don't lose hours of training if the computer crashes.

In [15]:
VECTOR_SIZE = 200
WINDOW = 5
MIN_COUNT = 3
NEGATIVE = 10
WORKERS = 4        # increase if CPU allows
TOTAL_EPOCHS = 10

CHECKPOINT_DIR = "prot2vec_checkpoints"
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

In [16]:
def prot2vec_corpus(fasta_file, k=3, show_progress=False):
    valid_aa = re.compile("^[ACDEFGHIKLMNPQRSTVWY]+$")
    with gzip.open(fasta_file, "rt") as handle:
        records = SeqIO.parse(handle, "fasta")
        if show_progress:
            records = tqdm(records, desc="Streaming proteins", unit="seq")
        for record in records:
            seq = str(record.seq).upper()
            if len(seq) >= k and valid_aa.match(seq):
                yield [seq[i:i+k] for i in range(len(seq)-k+1)]

Protein databases like UniProt are massive. Instead of loading the whole file into RAM (which would crash most computers), this function "streams" the data one sequence at a time. It breaks sequences into overlapping 3-grams, which are the fundamental inputs for our Prot2Vec model.

In [17]:
prot2vec = Word2Vec(
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=1,                 # Skip-gram
    negative=NEGATIVE,
    workers=WORKERS
)

We initialize the Word2Vec model with sg=1. This stands for Skip-gram. While it is slower than other methods, it is much better at predicting the context of rare or infrequent amino acid patterns.

##### Building the Vocabulary

In [18]:
print("Building vocabulary...")
prot2vec.build_vocab(
    prot2vec_corpus(FASTA_FILE, KMER_SIZE, show_progress=True)
)
print("Vocabulary built.")


Building vocabulary...


Streaming proteins: 573661seq [02:43, 3510.44seq/s] 


Vocabulary built.


##### Training the Model with Checkpoints

In [19]:
start_time = time.time()

for epoch in range(1, TOTAL_EPOCHS + 1):
    print(f"\n===== Epoch {epoch}/{TOTAL_EPOCHS} =====")
    epoch_start = time.time()

    prot2vec.train(
        prot2vec_corpus(FASTA_FILE, KMER_SIZE, show_progress=True),
        total_examples=prot2vec.corpus_count,
        epochs=1
    )

    epoch_time = time.time() - epoch_start
    total_time = time.time() - start_time

    checkpoint_path = f"{CHECKPOINT_DIR}/prot2vec_epoch_{epoch}.model"
    prot2vec.save(checkpoint_path)

    print(f"Epoch {epoch} finished in {epoch_time/60:.2f} minutes")
    print(f"Total elapsed time: {total_time/60:.2f} minutes")
    print(f"Checkpoint saved: {checkpoint_path}")


===== Epoch 1/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Epoch 1 finished in 28.53 minutes
Total elapsed time: 28.53 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_1.model

===== Epoch 2/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Streaming proteins: 9009seq [00:29, 990.36seq/s] Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Streaming proteins: 27860seq [01:43, 386.60seq/s]'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.o

Epoch 2 finished in 34.23 minutes
Total elapsed time: 62.76 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_2.model

===== Epoch 3/10 =====


'gensim.models.word2vec_inner.our_dot_float'q/s]
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Streaming proteins: 95937seq [07:01, 178.27seq/s]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Streaming proteins: 103908seq [07:34, 254.04seq/s]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_in

Epoch 3 finished in 41.96 minutes
Total elapsed time: 104.72 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_3.model

===== Epoch 4/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Epoch 4 finished in 30.55 minutes
Total elapsed time: 135.27 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_4.model

===== Epoch 5/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Epoch 5 finished in 33.14 minutes
Total elapsed time: 168.41 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_5.model

===== Epoch 6/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Epoch 6 finished in 32.73 minutes
Total elapsed time: 201.14 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_6.model

===== Epoch 7/10 =====


Streaming proteins: 8151seq [00:30, 526.54seq/s]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception igno

Epoch 7 finished in 34.44 minutes
Total elapsed time: 235.58 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_7.model

===== Epoch 8/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Epoch 8 finished in 42.36 minutes
Total elapsed time: 277.94 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_8.model

===== Epoch 9/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Epoch 9 finished in 25.19 minutes
Total elapsed time: 303.14 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_9.model

===== Epoch 10/10 =====


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Streaming proteins: 39973seq [01:51, 551.16seq/s]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Streaming proteins: 44942seq [02:05, 202.22seq/s]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.m

Epoch 10 finished in 152.26 minutes
Total elapsed time: 455.40 minutes
Checkpoint saved: prot2vec_checkpoints/prot2vec_epoch_10.model


##### Final Save

In [23]:
prot2vec.save("prot2vec.model")

##### Validating the model

In [24]:
prot2vec.wv.most_similar("ACD")

[('CCD', 0.6118519306182861),
 ('CDD', 0.5956080555915833),
 ('ACE', 0.5842300057411194),
 ('ACC', 0.5760337710380554),
 ('ACA', 0.5745081901550293),
 ('QCD', 0.5732702612876892),
 ('HCD', 0.5728904008865356),
 ('AAC', 0.5576863288879395),
 ('ACR', 0.5510157942771912),
 ('ACP', 0.5495207905769348)]