In [3]:
import numpy as np
import pickle
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset

### Ding et. al.'s Paper: Shape through Model

In Ding et. al.'s paper, the decoder is parameterized by another fairly straightforward neural net of mostly a series of linear-tanh layers. The only unique part is the last layer, a softmax.

Note that line 49 of Ding et. al.'s `train.py` flattens the three-dimensional `seq_msa_binary` numpy array from shape `(# seqs)` x `(alignment len in # a.a.'s)` x `(20 a.a. + 1 for gap)` to a two-dimensional array of one vector for every sequence by just concatenating all the one-hot vectors together: 
```python
seq_msa_binary = seq_msa_binary.reshape((num_seq, -1))
```

Then in the decoder (in `VAE_model.py`), the activations of the second last layer are reshaped from a flat vector into a two-dimensional array with one vector per amino acid position that represents the probability distribution for each possible a.a. type.

In [17]:
alignment_len = 4
num_seqs = 2
num_types = 6

A = np.tile(np.arange(num_types), (alignment_len*num_seqs)//num_types)
T_A = F.one_hot(torch.from_numpy(A), num_classes=num_types)
T_A = T_A.reshape((num_seqs, -1))

# a tuple (# of sequences, )
fixed_shape = tuple(T_A.shape[0:-1])

print(T_A.shape, " shaped ")
T_A

torch.Size([2, 18])  shaped 


tensor([[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]])

In [18]:
print("reshape to ", fixed_shape + (-1, num_types))
T_A = T_A.view(fixed_shape + (-1, num_types))
T_A

reshape to  (2, -1, 6)


tensor([[[1, 0, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 0],
         [0, 0, 1, 0, 0, 0]],

        [[0, 0, 0, 1, 0, 0],
         [0, 0, 0, 0, 1, 0],
         [0, 0, 0, 0, 0, 1]]])

In conclusion, Ding et. al. do indeed _first_,
1. __completely flatten each protein's sequence__ before feeding into the (fully connected) neural network, simply _concatenating together all the one-hot vectors_ for each amino acid position. Then,
2. the __decoder__ reshapes the activations of the second last layer into a two-dimensional array with _one vector per amino acid position_ that represents a probability distribution over all 20 a.a. types.