Currently, if the sequence length of context is more than max_length (512) it will be truncated before scoring relevancy. Instead, such contexts should be chunked to sequence < 512 tokens before scoring and then averaged.
Changes for this could be made here