I ran into the problem of using reinforcement learning dueling double deep q network on sagemaker. Basically the endpoint predictions almost always predicts 1 values, while the same dataset in evaluation (same model checkpoint) gives even predictions among the classes. I used this operation for the output node main_level/agent/main/online/network_0/dueling_q_values_head_0/output, which is the last operation in a dueling q network.
Is this a bug within the checkpoint deployment? or my output operation is wrong? Any help would be appreciated.