# neuralmt: default program

In [1]:
from default import *
import os, sys

In [None]:
cd ../

## Run the default solution on dev

In [3]:
input_path = 'data/input/dev.txt'
num = 20
model = Seq2Seq(build=False)
model.load(os.path.join('data', 'seq2seq_E049.pt'))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# loading test dataset
test_dl = loadTestData(input_path, model.params['srcLex'],
                       device=hp.device, linesToLoad= num)
results = translate(model, test_dl)
print("\n".join(results))

20it [00:04,  4.01it/s]

i was in my first time . 
i was in and i . 
she was a woman . 
so she was to , and she was , and she was , and she was , and she was , and she was in the . 
i was i . 
a 's a first . 
i got to talk about the woman . 
i i i . 
i did n't . 
so , the the , the the , , the the , , the the , , the the , , 
i was , , " i , " i said , " 
later , later later later later later later later later later later later . 
i was in the . 
i asked to to , i was . 
i . 
she 's , , " she , , " she 's , , " she . 
but , you know , but you . 
the , the is is that it 's is , " 
what 's what is . 
that was the moment that was that the moment . 





## Evaluate the default output

In [4]:
from bleu_check import bleu
ref_t = []
with open(os.path.join('data','reference','dev.out')) as r:
    ref_t = r.read().strip().splitlines()
print(bleu(ref_t, results))

BLEU = 4.51 54.9/16.7/5.6/1.6 (BP = 0.471 ratio = 0.571 hyp_len = 182 ref_len = 319)


## Documentation

### The Attention Layer:

The introduction of the attention layer within autoencoders allows the model to learn which parts of the input data are most relevant for a given context, effectively enabling the model to focus its reconstruction efforts on those elements. The attention mechanism in our model, as provided in the AttentionModule class, comprises key components:

The Weighting Mechanism: The attention mechanism calculates a set of attention weights (alpha) for each element in the input sequence. These weights are learned through the W_enc and W_dec linear transformations and the V_att linear layer, making them adapt to the specific relationships between the encoder's output and the decoder's hidden state.

The Context Calculation: Once the attention weights are calculated, they are used to compute a context vector, which summarizes the most relevant information from the input sequence. In the provided implementation, the context is computed efficiently using a batched matrix multiplication, which significantly improves computational performance.

As was mentioned in the documentation, attention is defined as follows:
\[\mathrm{score}_i = W_{enc}( h^{enc}_i ) + W_{dec}( h^{dec} )\]
Define the $\alpha$ vector as follows:

\[\alpha = \mathrm{softmax}(V_{att} \mathrm{tanh} (\mathrm{score}))\]
The we define the context vector using the $\alpha$ weights for each source side index $i$:

\[c = \sum_i \alpha_i \times h^{enc}_i\]

### Effect of Attention Mechanism:

The introduction of the attention mechanism into the autoencoder model has a profound impact on its performance. This can be observed through the significant increase in the BLEU score from 1.8637 to 14.2469, indicating the substantial improvement in the model's ability to capture and reconstruct input sequences.

The attention mechanism allows the model to focus on the most critical elements of the input data during the decoding process, resulting in improved reconstruction quality. This is especially beneficial for tasks where certain elements in the input sequence are more important than others, such as machine translation or text summarization.

#### here is the implementation:


In [5]:
class AttentionModule(nn.Module):
    def __init__(self, attention_dim):
        """
        You shouldn't deleted/change any of the following defs, they are
        essential for successfully loading the saved model.
        """
        super(AttentionModule, self).__init__()
        self.W_enc = nn.Linear(attention_dim, attention_dim, bias=False)
        self.W_dec = nn.Linear(attention_dim, attention_dim, bias=False)
        self.V_att = nn.Linear(attention_dim, 1, bias=False)
        self.softmax = nn.Softmax(dim = 0)

    # Start working from here, both 'calcAlpha' and 'forward' need to be fixed
    def calcAlpha(self, decoder_hidden, encoder_out):
        """
        param encoder_out: (seq, batch, dim),
        param decoder_hidden: (seq, batch, dim)
        """
        enc = self.W_enc( encoder_out )
        dec = self.W_dec( decoder_hidden )
        scores = enc + dec
        beta = self.V_att( torch.nn.functional.tanh( scores ) )
        alpha = self.softmax( beta )
        return alpha

    def forward(self, decoder_hidden, encoder_out):
        """
        encoder_out: (seq, batch, dim),
        decoder_hidden: (seq, batch, dim)
        """
        alpha = self.calcAlpha(decoder_hidden, encoder_out) # seq, batch, dim=1
        context = torch.sum(alpha * encoder_out, dim=0).unsqueeze(0)
        return context, alpha.permute(2, 1, 0)

## beam search implementation


The provided beam search implementation efficiently decodes the output sequence from a sequence-to-sequence model while incorporating beam search to improve search performance. The function takes the decoder model, encoder output, encoder hidden state, maximum sequence length, beam width, and an optional maximum iteration limit as input parameters.

### Initialization:

    Initialize variables to store output sequences, target vocabulary size, decoder hidden state, and start token.
    Create a priority queue nodes_cach to maintain beam candidates.
    Initialize an empty list end_nodes to store finalized sequences.

### Decoding Steps:

    Start with the start token and initial decoder hidden state.
    Calculate the log-probability and token with the highest probability using softmax.
    Create an initial beam node with the token, score, and decoder hidden state.
    Push the initial node onto the priority queue.

### Beam Search Loop:

    While the number of finalized sequences is less than the beam width:
    Check if the maximum iteration limit is exceeded. If so, finalize all remaining beam candidates.
    Pop the beam node with the highest score from the priority queue.
    Update the current token and decoder hidden state based on the popped node.
    Calculate the top-K most probable tokens and their corresponding log-probabilities.
    Iterate through the top-K tokens:
    Create a new beam node for each token with updated score, previous node, length, logits, and decoder hidden state.( as suggested in n (Bahdanau et al., 2014), to break beams search curse, normalizing the score by the sequence length helps us achive a better blue score since it is observed that higher beam width results in smaller BLEU score)
    If the new node's length reaches the maximum sequence length, add it to the finalized sequences list.
    Otherwise, push the new node onto the priority queue.
    Increment the iteration counter.
    
### Finalization and Output:

    Sort the finalized sequences based on their scores.
    Select the best sequence with the highest score.
    Reconstruct the output sequence and target sequences using the best node's get_seq() method.
    Return the output sequence, attention weights (None in this case), and target sequences.
    
This implementation effectively utilizes beam search to explore multiple promising paths during decoding, leading to improved sequence generation compared to greedy search. The maximum iteration limit ensures that the decoding process doesn't get stuck in an endless loop.

## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

The addition of an attention layer to the model improved the BLEU score significantly, indicating that the model's ability to focus on relevant parts of the input sequence during decoding enhanced its translation quality. We also tried Ensembling and Beam Searching:

### Ensembling
We load all the check point files existing in the directory that --model command line option point to, and create a list of Seq2Seq models. Then we run the translation on all the models and obtain the scores of each token in the destination vocabulary for the current word. First we tried to add those scores for all the instances of Seq2Seq models and in the end, choose the maximum value. However, it worsened the results. Then, we decided to choose the maximum-score (best) token for each of the models, and then select the token that has been chosen with most of the models. In this case, the results were better but still worse than the baseline. So, we couldn't succeed in Ensembling.
Note that, the --model command line option now can point to a checkpoint file or a directory. In the first case, only the selected checkpoint file is loaded and of course no ensembling happens. In the second case, the code loads all the checkpoint files available in the directory and runs ensembling inference.

### Beam Searching
However, the beam search algorithm did not have a noticeable impact on the results even width higher beam width results remained the same. This could be due to several factors:


#### Model Complexity:
    The increased complexity of the model with the attention layer might have made it more difficult for the beam search algorithm to effectively prune less promising paths. The beam search might not have been able to differentiate between equally probable translations, leading to similar outputs.

#### Data Quality:
    The quality of the training data might have played a role in the beam search's performance. If the training data did not contain enough examples of diverse and complex translations, the beam search might not have been able to learn effective strategies for distinguishing between different translation candidates.

#### Hyperparameter Tuning: 
    The hyperparameters of the beam search algorithm, such as the beam width and maximum iteration limit, might not have been optimally tuned for the specific model and dataset. Fine-tuning these hyperparameters could potentially improve the performance of the beam search.

#### Attention Mechanism:
    The specific attention mechanism used in the model might not have been well-suited for the task or the beam search algorithm. Different attention mechanisms can have varying effects on the model's ability to focus on relevant parts of the input sequence and guide the beam search process.

In conclusion, while the attention layer significantly improved the model's translation quality, the beam search algorithm did not have a noticeable impact in this particular case. Further investigation into the factors mentioned above could help identify ways to optimize the beam search algorithm and enhance its effectiveness in conjunction with the attention layer.