# MassSpecGym De Novo Task - Abril Risso

## 1. Introduction

#### 1.1 Problem Overview

In retrieval, the model only needed to rank a list of pre-defined candidates. Now, in the De Novo setting, the model must construct the molecule from scratch based on the input mass spectrum.

This is a Seq2Seq problem:
- **Input:** A mass spectrum (set of peaks $m/z, I$).
- **Output:** A SMILES string (text sequence representing the molecular structure).


#### 1.2 Molecule Generation - De Novo

In the retrieval task, the answer was in the database, but a generative model must learn the "grammar" of chemistry. So it needs to:
- Underestand the spectrum (map spectral peaks to chemical substructures)
- Generate valid SMILES to ensure that the output syntax is correct
- Respect chemical rules (the molecule generated must match the precursor mass and must construct a chemically valid graph)

#### 1.3 Proposed Solution

In this notebook, it is implemented a Spectral Encoder-Decoder Transformer. The Encoder processes the spectrum (similar to the retrieval model) to create a rich representation of the peaks, and the Decoder generates the SMILES string token by token, attending to the spectral features to decide which atom comes next.

## 2. Spectral Seq2Seq Transformer

Unlike standard regression or classification, our goal for the De Novo task is like translation. We are effectively translating the biological language of mass spectrometry (fragmentation patterns) into the chemical language of SMILES strings.

A molecule is not a single fixed-size label, instead, it is represented as a sequence of tokens with variable length and strict syntax constraints (parentheses, ring closures,...). So, predicting the whole molecule at once would require fixing a maximum length and predicting tokens for all positions in parallel, predicting where the sequence ends and enforcing global consistency. In practice, these constraints are hard to enforce with a single-shot predictor and small local mistakes can make the entire SMILES invalid.

To address this, it was decided to implement an Autoregressive Decoder. This architecture generates the molecule one token at a time, conditioning the prediction of the next character on the spectral input and also on the previously generated tokens, ensuring syntactical validity and chemical consistency.

#### 2.1 The Encoder (Spectrum Processing)

The role of the encoder is to map the sparse list of spectral peaks into a rich representation that the decoder can understand. The base of this encoder is the same as the Spectral Transformer used in the Retrieval task. It treats the spectrum as a set of peaks and has the same improvements detailed in the previous notebook:
- **Fourier Features:** To capture high-precision mass differences (isotopes).
- **Log-Intensity & Precursor Injection:** To handle dynamic range and provide global context about the parent mass.
- **Self-Attention:** To learn non-local relationships between peaks.

However, there is one key addition to the De Novo task, the Rank Embeddings.

In the Retrieval task, the encoder output was pooled into a single global vector (using Attention Pooling) to match a fingerprint. In the De Novo task, the output is not pooled. The decoder needs to attend to individual peaks to decide which atom to generate next. To help the model understand that not all peaks are equally important, a learnable embedding based on the **intensity rank** of the peak is added.

This embedding is not a fixed static number, instead, it is a parameter that the model updates and optimizes during training. Initially, the model treats all ranks randomly. However, as it trains and learns from its errors, it modifies these embeddings to discover the optimal hierarchy on its own.

Unlike Natural Language Processing (NLP), where positional embeddings follow the linear order of words because position defines meaning (syntax), applying a similar logic here (sorting peaks by mass) is suboptimal, since the precise $m/z$ value is already encoded by Fourier Features, with mass-based ordering we would be adding redundant information.

Instead, we use **Intensity Ranking**. In mass spectrometry, intensity tells us which fragments are the most abundant and relevant. It provides the decoder with clear guidance, encouraging the model to attend to the strongest evidence first (to find the main molecular skeleton) before considering weaker signals or noise to fill in the details. 

#### 2.2 The Decoder (Molecule Generation)

The Decoder is responsible for generating the molecular structure (SMILES sequence) token by token based on the encoded spectral representation. So, a standard Transformer Decoder is implemented with 3 key components:

##### 2.2.1 Autoregressive Generation and Masked Self-Attention

The Decoder operates autoregressively, meaning it predicts the next token $x_t$ based on the previously generated tokens $x_{<t}$. Moreover, in the training process, it is necessary to prevent the model from seeing the future. To achieve this, a **Masked Self-Attention** is applied, which masks out future positions (setting their attention weights of the future tokens to $-\infty$). This forces the model to rely only on the **previous generated** tokens to predict the next step, ensuring it effectively learns the sequential grammar of chemistry.

##### 2.2.2 Cross-Attention

This mechanism acts as the bridge connecting the spectral data to the generation process. As explained in the previous section, the Encoder output is not pooled. It remains a matrix of feature vectors representing individual peaks. 

Through Cross-Attention, the Decoder uses the current partial SMILES sequence as the **query** to inspect the spectral peaks. At each generation step, the mechanism calculates attention weights over the encoded spectral peaks (the output of the Encoder). This allows the Decoder to selectively attend to the most relevant spectral peaks (prioritized by the Rank Embeddings) to determine which atom to generate next.

##### 2.2.3 Positional Encoding

Finally, unlike the Encoder where Intensity Ranking was used, the Decoder processes a text sequence (SMILES) where linear order determines syntax and meaning. Therefore, the standard sinusoidal Positional Encodings is used to provide the model information about the position of each token in the sequence.

#### 2.3 Training Strategy: Teacher Forcing

Training a Seq2Seq model can produce **error propagation**. If the model sees its own generated tokens from the previous step as the input for the next step, any mistake in the sequence can confuse the model for all subsequent steps. This makes convergence slow and unstable, as the model spends most of the time trying to recover from its own errors rather than learning the correct structure.

To solve this, the **Teacher Forcing** strategy is implemented. Regardless of what the model predicts, the ground truth is always given as the input for the next step to the Decoder.

This ensures that at every step, the model is trying to predict the next atom given a perfect history, which stabilizes gradients and speeds up the learning process.

#### 2.4 Inference Strategy: Beam Search

During training, Teacher Forcing masks the model's mistakes. However, during inference, the ground truth is unknown, so the model must rely on its own predictions to generate the next token.

Therefore, a simple Greedy Search (selecting only the single highest-probability token at each step) is risky. If the model makes a single mistake early in the sequence, it leads to an error propagation from which it cannot recover, resulting in invalid or incorrect structures.

To mitigate this, **Beam Search** is implemented. So, instead of relying on a single output path, the algorithm explores multiple potential sequences in parallel. It maintains a set of $k$ candidate sequences with the highest cumulative probabilities (where $k$ is the beam width) at each decoding step.

The process is iterative as the model extends all the currently active candidates by predicting the next possible atoms for each one. Then, it ranks the resulting sequences by their cumulative probability and selects the top $k$ candidates to continue with the generation.

It is a safer technique than the Greedy Search, as it mitigates the risk of getting trapped in a local optimum. Beam search increases the probability that the model finds the globally optimal structure, even if some intermediate steps were not the absolute highest probability choices in isolation (although it depends on the parameter k).

#### 2.5 Optimization and Regularization Strategy

To ensure stability and prevent overfitting three key strategies were implemented:

The model is trained to minimize the **Cross-Entropy Loss** between the predicted probability distribution and the ground truth SMILES tokens. However, standard Cross-Entropy can lead to overconfidence, where the model assigns extremely high probabilities to its predictions, making it weak to noise. To mitigate this, Label Smoothing ($\epsilon=0.1$) is applied. Instead of targeting a hard probability of 1.0 for the correct token, the model targets $1.0 - \epsilon$, distributing the remaining probability mass among other tokens. This encourages the network to learn more robust representations and creates a more clustered feature space.

To prevent the model from memorizing specific noisy peaks in the training data, a small percentage of peaks is randomly masked (removed) from the input spectrum during each training step. This forces the network to learn the global fragmentation pattern of the molecule rather than relying on isolated signals.

Transformers are sensitive to the learning rate during the initial phases of training. Therefore, the AdamW optimizer is implemented in combination with a Warmup and Cosine Decay. This strategy ensures the training process begins smoothly and prevents the model from diverging in the early stages.

## 3. Implementation

#### 3.1 Encoder

In [None]:
class FourierFeatures(nn.Module):
    """
    Implements Fourier Feature mapping for high-frequency coordinate transformation.
    """
    def __init__(self, output_dim, sigma=1.0):
        super().__init__()
        self.num_freqs = output_dim // 2
        self.register_buffer('B', torch.randn(self.num_freqs) * sigma)
    
    def forward(self, x):
        # Normalization to prevent extreme values during projection
        projected = 2 * math.pi * torch.clamp(x / 1000.0, 0, 2) * self.B
        return torch.cat([torch.sin(projected), torch.cos(projected)], dim=-1)


class PeakEncoder(nn.Module):
    """
    Transformer-based Encoder for Mass Spectrometry peaks.
    
    Key Components:
    - Fourier Features for m/z representation.
    - Linear projection for log-intensity.
    - Learnable Rank Embeddings to prioritize high-intensity peaks.
    - Precursor m/z injection for conditioning.
    """
    def __init__(self, d_model=256, nhead=4, num_layers=2, dropout=0.1, max_peaks=1000, use_rank_emb=True):
        super().__init__()
        # Features for m/z and linear projection for log-intensity
        self.mz_enc = FourierFeatures(d_model // 2, sigma=10.0)
        self.int_enc = nn.Linear(1, d_model - (d_model // 2))
        
        # Positional/Rank Embedding
        self.use_rank_emb = use_rank_emb
        self.max_peaks = max_peaks
        if self.use_rank_emb:
            self.rank_emb = nn.Embedding(max_peaks, d_model)
        
        # Projection for the precursor m/z
        self.precursor_proj = nn.Linear(1, d_model)
        
        self.input_norm = nn.LayerNorm(d_model)
        
        encoder_layer = nn.TransformerEncoderLayer(
            d_model, nhead, 
            dim_feedforward=d_model * 4,
            dropout=dropout,
            batch_first=True,
            norm_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
    
    def forward(self, spec, precursor_mz, src_key_padding_mask=None):
        mz = spec[:, :, 0:1]
        intensity = spec[:, :, 1:2]
        B_size, N_peaks, _ = spec.shape
        
        mz_emb = self.mz_enc(mz)
        int_emb = self.int_enc(torch.log1p(torch.clamp(intensity, 0, 1e6))) # Use log1p and clamp to avoid log(0) or extremely large values
        
        x = torch.cat([mz_emb, int_emb], dim=-1)
        
        # Add Rank Embeddings
        if self.use_rank_emb:
            positions = torch.arange(N_peaks, device=spec.device).unsqueeze(0).expand(B_size, -1)
            positions = positions.clamp(max=self.max_peaks - 1)
            x = x + self.rank_emb(positions)
        
        # Precursor Injection
        if precursor_mz is not None:
            if precursor_mz.dim() == 1:
                precursor_mz = precursor_mz.unsqueeze(-1)
            prec_norm = torch.clamp(precursor_mz.float() / 1000.0, 0, 2)
            prec_emb = self.precursor_proj(prec_norm).unsqueeze(1)
            x = x + prec_emb
        
        x = self.input_norm(x)
        
        memory = self.transformer(x, src_key_padding_mask=src_key_padding_mask)
        return memory

#### 3.2 Decoder

In [None]:
class LearnedPositionalEncoding(nn.Module):
    """
    Learned positional embeddings for the SMILES sequence.
    Allows the model to understand the order of atoms in the string.
    """
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.encoding = nn.Embedding(max_len, d_model)
        self.register_buffer("positions", torch.arange(max_len))
    
    def forward(self, x):
        seq_len = x.size(1)
        clamped_positions = self.positions[:seq_len].clamp(max=self.encoding.num_embeddings - 1)
        return self.encoding(clamped_positions.unsqueeze(0))


class AutoregressiveDecoder(nn.Module):
    """
    Standard Transformer Decoder.
    Predicts the next SMILES token based on the Encoder memory and previous tokens.
    """
    def __init__(self, vocab_size, d_model, nhead, num_layers, dropout=0.1, max_len=200):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = LearnedPositionalEncoding(d_model, max_len)
        
        decoder_layer = nn.TransformerDecoderLayer(
            d_model, nhead,
            dim_feedforward=d_model * 4,
            dropout=dropout,
            batch_first=True,
            norm_first=True
        )
        self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers)
        self.output_proj = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, tgt, memory, tgt_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None):
        x = self.embedding(tgt) * math.sqrt(self.d_model)
        x = x + self.pos_encoder(tgt)
        x = self.dropout(x)
        
        output = self.transformer_decoder(
            tgt=x,
            memory=memory,
            tgt_mask=tgt_mask,
            tgt_key_padding_mask=tgt_key_padding_mask,
            memory_key_padding_mask=memory_key_padding_mask
        )
        return self.output_proj(output)

#### 3.3 Lightning Module

In [None]:
class SimpleDeNovoTransformer(DeNovoMassSpecGymModel):
    """
    Main PyTorch Lightning Module for De Novo Molecular Generation.
    
    Integrates:
    - PeakEncoder (Spectral processing)
    - AutoregressiveDecoder (SMILES generation)
    - Training logic (Teacher Forcing, Loss calculation)
    - Inference logic (Beam Search)
    """
    def __init__(
        self,
        d_model: int,
        nhead: int,
        num_encoder_layers: int,
        num_decoder_layers: int,
        smiles_tokenizer: SpecialTokensBaseTokenizer,
        dropout: float = 0.1,
        max_smiles_len: int = 150,
        lr: float = 5e-4,
        peak_dropout_p: float = 0.1,
        label_smoothing: float = 0.1,
        warmup_ratio: float = 0.1,
        beam_size: int = 5,
        length_penalty_alpha: float = 0.8,
        max_peaks: int = 1000,
        top_ks: list = None,
        *args,
        **kwargs
    ):
        if top_ks is None:
            top_ks = [1, 5]
        super().__init__(top_ks=top_ks, *args, **kwargs)
        self.save_hyperparameters(ignore=['smiles_tokenizer'])
        
        self.smiles_tokenizer = smiles_tokenizer
        self.vocab_size = smiles_tokenizer.get_vocab_size()
        self.pad_id = smiles_tokenizer.token_to_id(PAD_TOKEN)
        self.sos_id = smiles_tokenizer.token_to_id(SOS_TOKEN)
        self.eos_id = smiles_tokenizer.token_to_id(EOS_TOKEN)
        self.max_len = max_smiles_len
        self.lr = lr
        
        self.peak_dropout_p = peak_dropout_p
        self.warmup_ratio = warmup_ratio
        self.beam_size = beam_size
        self.length_penalty_alpha = length_penalty_alpha
        
        self.max_k_eval = max(top_ks) if top_ks else 1
        
        self.encoder = PeakEncoder(
            d_model, nhead, num_encoder_layers, dropout,
            max_peaks=max_peaks, use_rank_emb=True
        )
        self.decoder = AutoregressiveDecoder(
            self.vocab_size, d_model, nhead, num_decoder_layers,
            dropout, max_smiles_len
        )
        
        self.criterion = nn.CrossEntropyLoss(
            ignore_index=self.pad_id,
            label_smoothing=label_smoothing
        )
    
    def generate_src_padding_mask(self, spec):
        return spec.abs().sum(dim=-1) == 0
    
    def generate_causal_mask(self, sz, device):
        return torch.triu(torch.full((sz, sz), float('-inf'), device=device), diagonal=1)
    
    def _augment_spectrum(self, spec):
        """Applies Peak Dropout only during training to improve robustness."""
        if self.training and self.peak_dropout_p > 0:
            is_content = (spec.abs().sum(dim=-1) > 0).float()
            dropout_mask = torch.bernoulli(torch.full_like(is_content, self.peak_dropout_p))
            final_mask = dropout_mask * is_content
            spec = spec * (1 - final_mask.unsqueeze(-1))
        return spec
    
    def forward(self, batch):
        spec = batch["spec"]
        precursor_mz = batch.get("precursor_mz", None)
        mols = batch["mol"]
        
        spec = self._augment_spectrum(spec)
        
        encoded_mols = self.smiles_tokenizer.encode_batch(mols)
        tgt_ids = torch.tensor([e.ids for e in encoded_mols], device=self.device)
        
        tgt_in = tgt_ids[:, :-1]
        tgt_out = tgt_ids[:, 1:]
        
        src_mask = self.generate_src_padding_mask(spec)
        tgt_pad_mask = (tgt_in == self.pad_id)
        tgt_causal_mask = self.generate_causal_mask(tgt_in.size(1), self.device)
        
        memory = self.encoder(spec, precursor_mz, src_key_padding_mask=src_mask)
        logits = self.decoder(
            tgt_in, memory,
            tgt_mask=tgt_causal_mask,
            tgt_key_padding_mask=tgt_pad_mask,
            memory_key_padding_mask=src_mask
        )
        
        return logits, tgt_out
    
    def step(self, batch, stage: Stage = Stage.NONE):
        logits, tgt_out = self.forward(batch)
        
        loss = self.criterion(logits.reshape(-1, self.vocab_size), tgt_out.reshape(-1))
        
        if torch.isnan(loss):
            print(f"NaN detected in loss at stage {stage}")
            loss = torch.tensor(0.0, device=self.device, requires_grad=True)
        
        mols_pred = None
        if stage not in self.log_only_loss_at_stages:
            mols_pred = self.decode_smiles(batch)
        
        return dict(loss=loss, mols_pred=mols_pred)
    
    def decode_smiles(self, batch):
        """
        Executes Beam Search decoding with Early Stopping.
        
        Strategy:
        1. Maintains top-k sequences (beams) at each step.
        2. Prunes low-probability paths.
        3. Stops when enough hypotheses (k) are finished or max length is reached.
        """
        spec = batch["spec"]
        precursor_mz = batch.get("precursor_mz", None)
        batch_size = spec.size(0)
        beam_size = self.beam_size
        device = self.device
        required_k = self.max_k_eval
        
        src_mask = self.generate_src_padding_mask(spec)
        
        with torch.inference_mode():
            # Encoder forward pass
            memory = self.encoder(spec, precursor_mz, src_key_padding_mask=src_mask)
            memory = memory.repeat_interleave(beam_size, dim=0)
            src_mask = src_mask.repeat_interleave(beam_size, dim=0)
            
            # Initialize Beams
            ys = torch.full((batch_size * beam_size, 1), self.sos_id, dtype=torch.long, device=device)
            beam_scores = torch.zeros((batch_size, beam_size), device=device)
            beam_scores[:, 1:] = float('-1e9') # Only the first beam starts with 0 score
            beam_scores = beam_scores.view(-1)
            
            finished_beams = [[] for _ in range(batch_size)]
            
            for step in range(self.max_len):
                # EARLY STOPPING CHECK
                finished_counts = torch.tensor([len(fb) for fb in finished_beams], device=device)
                has_enough_hyps = (finished_counts >= required_k)
                
                active_scores_view = beam_scores.view(batch_size, beam_size)
                has_active_beams = (active_scores_view > -1e8).any(dim=1)
                
                needs_work = (~has_enough_hyps) & has_active_beams
                if not needs_work.any():
                    break
                
                # Decoder Step
                tgt_mask = self.generate_causal_mask(ys.size(1), device)
                out = self.decoder(ys, memory, tgt_mask=tgt_mask, memory_key_padding_mask=src_mask)
                logits = out[:, -1, :]
                log_probs = F.log_softmax(logits, dim=-1)
                
                next_scores = log_probs + beam_scores.unsqueeze(-1)
                next_scores = next_scores.view(batch_size, -1)
                
                topk_scores, topk_indices = next_scores.topk(beam_size, dim=1)
                beam_indices = torch.div(topk_indices, self.vocab_size, rounding_mode='floor')
                token_indices = topk_indices % self.vocab_size
                
                batch_offset = torch.arange(batch_size, device=device).unsqueeze(1) * beam_size
                global_beam_indices = batch_offset + beam_indices
                
                prev_ys = ys[global_beam_indices.view(-1)]
                new_tokens = token_indices.view(-1, 1)
                ys = torch.cat([prev_ys, new_tokens], dim=1)
                beam_scores = topk_scores.view(-1)
                
                # EOS Check
                current_tokens = token_indices.view(-1)
                is_eos = (current_tokens == self.eos_id)
                
                if is_eos.any():
                    eos_indices = torch.nonzero(is_eos, as_tuple=True)[0]
                    for idx in eos_indices:
                        batch_idx = idx.item() // beam_size
                        score = beam_scores[idx].item()
                        seq = ys[idx].tolist()
                        # Length Penalty application
                        lp = ((5 + len(seq)) / 6) ** self.length_penalty_alpha
                        final_score = score / lp
                        finished_beams[batch_idx].append((final_score, seq))
                        beam_scores[idx] = float('-1e9') # Invalidate this beam
            
            decoded_smiles_batch = []
            for i in range(batch_size):
                # Collect beams that didn't finish
                for j in range(beam_size):
                    idx = i * beam_size + j
                    if beam_scores[idx] > float('-1e8'):
                        seq = ys[idx].tolist()
                        lp = ((5 + len(seq)) / 6) ** self.length_penalty_alpha
                        score = beam_scores[idx].item() / lp
                        finished_beams[i].append((score, seq))
                
                finished_beams[i].sort(key=lambda x: x[0], reverse=True)
                max_k = self.max_k_eval
                top_hyps = finished_beams[i][:max_k]
                
                batch_preds = []
                for _, seq in top_hyps:
                    try:
                        s = self.smiles_tokenizer.decode(seq, skip_special_tokens=True)
                    except:
                        s = ""
                    batch_preds.append(s)
                
                # Fill with empty string if no hypotheses found or fewer than max_k
                if len(batch_preds) == 0:
                    batch_preds = [""] * max_k
                elif len(batch_preds) < max_k:
                    # Fill with the last valid prediction or empty string
                    fill_value = batch_preds[-1] if batch_preds else ""
                    batch_preds += [fill_value] * (max_k - len(batch_preds))
                
                decoded_smiles_batch.append(batch_preds)
            
            return decoded_smiles_batch
    
    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.lr, weight_decay=0.01)
        
        if hasattr(self, 'trainer') and self.trainer is not None:
            total_steps = self.trainer.estimated_stepping_batches
        else:
            total_steps = 1000
        
        warmup_steps = int(total_steps * self.warmup_ratio)
        
        def lr_lambda(current_step):
            if current_step < warmup_steps:
                return float(current_step) / float(max(1, warmup_steps))
            progress = float(current_step - warmup_steps) / float(max(1, total_steps - warmup_steps))
            return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
        
        scheduler = {
            'scheduler': LambdaLR(optimizer, lr_lambda),
            'interval': 'step',
            'frequency': 1
        }
        return [optimizer], [scheduler]

## 4. Results

The following table summarizes the performance of the De Novo model on the test set.

| Metric | Value | Description |
| :--- | :---: | :--- |
| **Test Loss** | 1.653 | Cross-Entropy loss between predicted and actual tokens. |
| **Top-1 Accuracy** | 0.00% | Percentage of exact structure matches (perfect predictions). |
| **Top-1 Tanimoto** | 0.108 | Structural similarity of the best prediction to the ground truth (0-1). |
| **Top-1 MCES Dist** | 37.93 | Max Common Edge Subgraph Distance (lower is better). |
| **Top-5 Accuracy** | 0.00% | Percentage where the correct structure appears in the top 5 beams. |
| **Top-5 Tanimoto** | 0.130 | Structural similarity of the best candidate among the top 5. |
| **Top-5 MCES Dist** | 30.87 | MCES Distance for the best candidate among the top 5. |

The results highlight the significant challenge of the De Novo generation task compared to spectral retrieval. Unlike retrieval, where the answer is selected from a finite dataset, the generative model must construct the molecule atom-by-atom, exploring an infinite chemical space.

While a **Top-1 Accuracy of 0.00%** might initially seem low, it is consistent with baseline performance for this specific task and dataset. As other Transformer-based architectures that are presented in the repository of Massspecgym (such as the SMILES Transformer and SELFIES Transformer) also report 0.0% Top-1 Accuracy. This confirms that exact structure reconstruction is a really difficult objective.

Despite the lack of exact matches, the model demonstrates learning capability. It achieves a Top-1 Tanimoto Similarity of 0.108. It outperforms the baseline SMILES Transformer (0.03) and acts competitively with the SELFIES Transformer (0.08) reported in the benchmark. This suggests that while the model struggles to pinpoint the exact molecule, it is successfully identifying structural relevant molecular fragments and partial structures compatible with the input spectrum, performing better than random generation or basic transformer baselines.

Furthermore, the utility of the inference strategy is evident when comparing Top-1 and Top-5 metrics. As the Tanimoto Similarity improves from 0.108 to 0.130 and MCES Distance improves (decreases) from 37.93 to 30.87. This confirms that the Greedy strategy leads often to a local optimum. By maintaining multiple beams, the model is able to propose alternative structures that are structurally closer to the ground truth than the single best probability prediction.

## 5. Conclusion

This project explored the challenging transition from spectral retrieval to De Novo Molecular Generation, implementing a Transformer-based Encoder-Decoder architecture designed to translate Mass Spectrometry data directly into chemical structures (SMILES).

The contrast between the Retrieval task (previous notebook) and this Generative task is significant. While retrieval relies on matching fingerprints, de novo generation requires constructing a molecule atom-by-atom from an infinite chemical space. The 0.00% Top-1 Accuracy reflects the extreme difficulty of this problem when relying only on supervised learning.

However, although the model did not achieve exact structure reconstruction, it reached a Top-1 Tanimoto similarity score (0.108) superior to standard baselines. This result validates the effectiveness of the proposed Rank Embeddings and Beam Search strategies, showing that the model successfully identifies relevant molecular structures even without large-scale pre-training.