Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does n_frame_samples represent? #23

Closed
williamFalcon opened this issue Sep 21, 2018 · 13 comments
Closed

What does n_frame_samples represent? #23

williamFalcon opened this issue Sep 21, 2018 · 13 comments

Comments

@williamFalcon
Copy link

Terminology is a bit confusing. What does n_frame_samples mean? Is this the number of samples per frame? or number of frames?

Is the RNN taking in a sequence of frames (ie: a frame per timestep), and the dimension of each frame is the "n_frame_samples"?

@williamFalcon
Copy link
Author

williamFalcon commented Sep 21, 2018

ie:

data = [1,2,3,4,5,6,7,8,9,10]
frames = [[1,2], [3,4], [5,6], [7,8], [9, 10]]
# so.. in this case we have 5 frames each frame with 2 samples? therefore n_frame_samples = 2?```

@koz4k
Copy link
Member

koz4k commented Sep 21, 2018

That's correct. In our terminology frame_size is size of the frame in terms of lower tier frames and n_frame_samples is size of the frame in terms of samples. Relationship between the two can be seen at https://github.com/deepsound-project/samplernn-pytorch/blob/master/model.py#L20.

@williamFalcon
Copy link
Author

Thanks!

@williamFalcon
Copy link
Author

williamFalcon commented Sep 21, 2018

To be 100% clear:

ns_frame_samples = list(map(int, np.cumprod(frame_sizes)))
frames = []
for (frame_size, n_frame_samples) in zip(frame_sizes, ns_frame_samples):
    print(f'{frame_size} frames each with {n_frame_samples} samples. (ie: (batch_size=128, seq_len={frame_size}, dim={n_frame_samples})')
    f = FrameLevelRNN(frame_size, n_frame_samples, n_rnn, dim, learn_h0, weight_norm)
    frames.append(f)

Do these print statements make sense?

16 frames each with 16 samples. (ie: (batch_size=128, seq_len=16, dim=16)       # this is the Tier 3 frame on the paper
4 frames each with 64 samples. (ie: (batch_size=128, seq_len=4, dim=64)   # this is the Tier 2 frame on the paper

@koz4k

@koz4k
Copy link
Member

koz4k commented Sep 21, 2018

Is there a question?

@williamFalcon
Copy link
Author

@koz4k yeah, do the print statements make sense?

@koz4k
Copy link
Member

koz4k commented Sep 21, 2018

Not really. I assume that by seq_len you mean the number of frames in the entire sequence and by dim you mean the number of samples in a single frame. If so, then seq_len = n_seq_samples // n_frame_samples where n_seq_samples is the number of samples in the sequence. dim is ok.

@williamFalcon
Copy link
Author

williamFalcon commented Sep 21, 2018

ok, that's what I was thinking. So what's the interpretation for the way you guys have it set up?
The first layer (tier 3) takes input like:

# does 16 here represent the original sequence divided into 16 chunks?   
x = np.shape(128, 16, 64)   
out = tier_3_frame(x)    
# out = (128, 64, 1024)   
# does 64 here represent 64 non-overlapping parts of the original sequence?   

I guess maybe my confusion is what do 16, 4 represent as related to the paper?

@koz4k
Copy link
Member

koz4k commented Sep 21, 2018

Yeah, that seems right. 16, 4 probably don't mean anything in the paper, there "frame size" relates to te number of samples.

@williamFalcon
Copy link
Author

williamFalcon commented Sep 21, 2018

@koz4k Thanks btw!

I think I got it now. Looks like in your code, 64 is your "frame size" ie: for a sequence that is 1024 long and you want 64 samples per frame, you get 16 frames. In their code, they used 8 instead of 64 here.

Then every tier after divides by 4... so tier 2 would be 16, tier 1 gets 1?

    # init example data
    time_steps = 1024
    bs = 128
    seq_len = 1024
    frame_size = 64
    steps = seq_len // frame_size
    X = Variable(torch.FloatTensor(bs, steps, frame_size).random_(0, 256))
    
    # get conditioning context for tier 3
    tier_3 = FrameLevel(frame_size=frame_size, upsample_ratio=4)
    conditioning_c = tier_3(X, None)    

Using linear layers would then look like (without the init, weight norm and hidden reset stuff, just high level):

class FrameLevel(nn.Module):
    """
    Generates conditioning context from raw signal    
    """

    def __init__(self, frame_size, upsample_ratio):
        super(FrameLevel, self).__init__()

        self.frame_size = frame_size
        self.nb_layers = 2
        self.batch_size = 128
        self.hidden_size = 1024
        self.context_c = None
        self.hidden = None
        self.upsample_ratio = upsample_ratio

        # input expand projection
        self.x_proj_fc = TimeDistributedLinear(self.frame_size, self.hidden_size)

        # rnn
        self.rnn = nn.GRU(
            input_size=self.hidden_size,
            hidden_size=self.hidden_size,
            num_layers=self.nb_layers,
            batch_first=True
        )

        # context projection
        # this will be passed to the next module as conditioning
        context_out_size = self.upsample_ratio * self.hidden_size
        self.c_proj_fc = TimeDistributedLinear(self.hidden_size, context_out_size)

    def init_hidden(self):
        hidden = Variable(torch.zeros(self.nb_layers, self.batch_size, self.hidden_size))
        if self.on_gpu: hidden = hidden.cuda()
        return hidden

    def forward(self, x_seq, above_context_c):

        # project input to rnn hidden. Add context from prior frame if not the first frame
        # inp = Wf + c (only if not last layer)
        rnn_input = self.x_proj_fc(x_seq)
        if above_context_c is not None:
            rnn_input += above_context_c

        # run through RNN
        output, self.hidden = self.rnn(rnn_input, self.hidden)

        # project h_t so it can be used at next level
        output = self.c_proj_fc(output)
        output = output.view(output.size(0), -1, self.hidden_size)

        # output will be used as conditioning context to next frame
        # hidden is not passed back because it is saved in the frame object
        # whenever TBPTT starts a new seq, must reset hidden externally
        self.context_c = output
        return self.context_c    

@koz4k
Copy link
Member

koz4k commented Sep 21, 2018

Yes, that's right.

@williamFalcon
Copy link
Author

@koz4k sweet... thanks!

@koz4k
Copy link
Member

koz4k commented Sep 21, 2018

Np :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants