Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Request cannot span more than 24 hours Error #61

Closed
mr337 opened this issue Jun 25, 2020 · 15 comments · Fixed by #64
Closed

Request cannot span more than 24 hours Error #61

mr337 opened this issue Jun 25, 2020 · 15 comments · Fixed by #64
Assignees
Labels
bug Something isn't working

Comments

@mr337
Copy link

mr337 commented Jun 25, 2020

I am starting to see odd errors in our logs from CloudWatch Log node when running the application live.

[ WARN] [1593087736.376730661, 30.901000000]: [AWSErrorMarshaller] Encountered AWSError 'InvalidParameterException': T
he batch of log events in a single PutLogEvents request cannot span more than 24 hours.
[ERROR] [1593087736.376854123, 30.901000000]: [AWSClient] HTTP response code: 400
Exception name: InvalidParameterException
Error message: The batch of log events in a single PutLogEvents request cannot span more than 24 hours.
5 response headers:
connection : close
content-length : 139
content-type : application/x-amz-json-1.1
date : Thu, 25 Jun 2020 12:22:15 GMT
x-amzn-requestid : 6e2c66d4-baf8-48b8-af82-bb56bad020f0
[ WARN] [1593087736.376925890, 30.901000000]: [AWSClient] If the signature check failed. This could be because of a ti
me skew. Attempting to adjust the signer.
[ERROR] [1593087736.377026834, 30.901000000]: [SendLogsRequest] Send log request failed due to: The batch of log event
s in a single PutLogEvents request cannot span more than 24 hours., with error code: 131
[ERROR] [1593087736.377057525, 30.901000000]: [SendLogsToCloudWatch] Failed to send to CloudWatch in Log Group: ab200s
im Log Stream: 4a54bac8-efb7-47a4-9ee5-493d928ae446 with error code: 1

What is really odd is after this error we still get application shipped to CloudWatch and have been using them for weeks.

Things I have checked:

  • The timestamp is consistent with the epoch time, no skewed system time
  • Latest updates for the package

Using Version: 2.3.1-1xenial-20200330-153719+0000

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.83. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the bug Something isn't working label Jun 25, 2020
@mr337
Copy link
Author

mr337 commented Jun 25, 2020

Looking at the rosout_agg node the header info does look to be uptime timestamp unlike the metrics.

header: 
  seq: 745
  stamp: 
    secs: 690
    nsecs: 501000000
  frame_id: ''
level: 2
name: "/cloudwatch_logger"
msg: "[onPublisherStateChange] Publisher state has changed to: available"
file: "/tmp/binarydeb/ros-kinetic-aws-ros1-common-2.0.1/src/sdk_utils/logging/aws_ros_logger.cpp"
function: "Utils::Logging::AWSROSLogger::LogInfo"
line: 32
topics: [/rosout]

@mr337
Copy link
Author

mr337 commented Jun 25, 2020

Also a related error in other RoboMaker ROS node that is seems to be related to time if it helps. aws-robotics/cloudwatchmetrics-ros1#46

@dabonnie
Copy link
Contributor

Hey @mr337

Thanks for filing this issue as well. Can you please verify that there is no stale data in the ~/.ros/cwlogs/ folder, similar to aws-robotics/cloudwatchmetrics-ros1#46?

@mr337
Copy link
Author

mr337 commented Jun 26, 2020

Following up from aws-robotics/cloudwatchmetrics-ros1#46 (comment) logs we like to think are more important and would like to always be uploaded. This looks like the way the logs are being pushed up is in a payload that has log events with timestamps greater than 24hr thus CWL rejects the payload.

As a work around would reducing file_upload_batch_size from 50 to something small like 1 reduce the likelihood of an encountering such an error?

@dabonnie
Copy link
Contributor

Given the documentation of the PutLogEvents, I would expect log data to be accepted if it's up to 2 weeks old. If the system was unable to upload logs for a period of time greater than two weeks then currently we are unable to upload :-(

Since the data in ~/.ros/cwlogs/ is specific to the cloudwatch-logs upload functionality, I would suggest purging the old files and attempting to try again. However, if that is not an option it is possible that using a smaller upload batch size could avoid attempting to upload stale data, but it comes at a cost of decreased bandwidth.

How does this sound?

@mr337
Copy link
Author

mr337 commented Jun 26, 2020

TIL that CWL doesn't except logs older than 2 weeks as well. I think for our application we are OK with that. The concern I have is error on the the log node seems to be when uploading logs that are within the 2 week period.

For example these files are within the last 4ish days

$ ls ~/.ros/cwlogs 
cwlog2020-06-23_09-10-58-0.log  cwlog2020-06-25_08-42-32-0.log  token_store.info
cwlog2020-06-23_09-16-23-0.log  cwlog2020-06-26_18-15-05-0.log

And still receive the same PutLogEvents request cannot span more than 24 hours., with error code: 13 error.

[ INFO] [1593213347.383233723, 30.229000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1593213347.537750453, 30.311000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1593213347.688859787, 30.428000000]: [SendLogs] Sending logs to CW
[ INFO] [1593213347.689853891, 30.428000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1593213347.973261900, 30.645000000]: [onPublisherStateChange] Publisher state has changed to: available
[ INFO] [1593213347.973564004, 30.645000000]: [readBatch] Reading Logbatch
[ INFO] [1593213347.978477776, 30.647000000]: [readBatch] Reading Logbatch
[ INFO] [1593213347.979023272, 30.647000000]: [SendLogs] Sending logs to CW
[ INFO] [1593213347.979830340, 30.647000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1593213347.981544168, 30.647000000]: [readBatch] Reading Logbatch
[ INFO] [1593213347.992540090, 30.648000000]: [read] Data is no longer available to read.
[ INFO] [1593213348.185021522, 30.768000000]: [onPublisherStateChange] Publisher state has changed to: available
[ INFO] [1593213348.185087880, 30.768000000]: [fileUploadCompleteStatus] Total items uploaded: 50
[ INFO] [1593213348.185470798, 30.768000000]: [SendLogs] Sending logs to CW
[ INFO] [1593213348.187257322, 30.768000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ WARN] [1593213348.405147533, 30.961000000]: [AWSErrorMarshaller] Encountered AWSError 'InvalidParameterException': T
he batch of log events in a single PutLogEvents request cannot span more than 24 hours.
[ERROR] [1593213348.405273183, 30.961000000]: [AWSClient] HTTP response code: 400
Exception name: InvalidParameterException
Error message: The batch of log events in a single PutLogEvents request cannot span more than 24 hours.
5 response headers:
connection : close
content-length : 139
content-type : application/x-amz-json-1.1
date : Fri, 26 Jun 2020 23:15:47 GMT
x-amzn-requestid : fb38f16b-1445-4ef1-a9c7-f6b21ab9de2d
[ WARN] [1593213348.405358547, 30.961000000]: [AWSClient] If the signature check failed. This could be because of a ti
me skew. Attempting to adjust the signer.
[ERROR] [1593213348.406029900, 30.962000000]: [SendLogsRequest] Send log request failed due to: The batch of log event
s in a single PutLogEvents request cannot span more than 24 hours., with error code: 131
[ERROR] [1593213348.406071218, 30.962000000]: [SendLogsToCloudWatch] Failed to send to CloudWatch in Log Group: ab200s
im Log Stream: 4a54bac8-efb7-47a4-9ee5-493d928ae446 with error code: 1
[ WARN] [1593213348.406098423, 30.962000000]: [SendLogs] Unable to send logs to CloudWatch, retrying ...
[ INFO] [1593213348.406315606, 30.962000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ WARN] [1593213348.843280890, 31.312000000]: [SendLogs] Unable to send logs to CloudWatch
[ WARN] [1593213348.843455163, 31.312000000]: [publishData] error finishing SendLogs 1
[ INFO] [1593213348.843488525, 31.312000000]: [onPublisherStateChange] Publisher state has changed to: unavailable

All those logs are within the 2 week and should be able to be uploaded.

@dabonnie
Copy link
Contributor

You're absolutely right. Please let me know if you can't workaround this issue by setting a smaller batch upload size and the proposed options below.

I think we can try a few other things with the config pararms:

  • run the node without subscribing to any log output: this will allow you to upload from disk without listening to any new data
  • adjust the file_max_queue_size to 1 as well to only send data from a single file
  • adjust the file_upload_batch_size to 1 to ensure the uploaded batch doesn't span 24h

@mr337
Copy link
Author

mr337 commented Jun 26, 2020

I will try that. I'm wasn't for sure if I was interpreting those config params correctly. Will report back in a few days.

@mr337
Copy link
Author

mr337 commented Jul 7, 2020

I have been doing some testing to try to find a good set of parameters and so far no luck.

Here is what I have tried:
file_max_queue_size : 1, this makes sense as each file in ~/.ros/cwlogs should be a 24 hour period?
file_upload_batch_size: 1-10 (5 in the below log), at first glance seems to work but will hit a throttle error

The sending of cached logs does not seem to honor the publish frequency and then runs into the CW Log API throttle.

[ INFO] [1594123577.100182436, 34.639000000]: [fileUploadCompleteStatus] Total items uploaded: 215
[ INFO] [1594123577.101845707, 34.639000000]: [SendLogs] Sending logs to CW
[ INFO] [1594123577.102641180, 34.643000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1594123577.115053568, 34.659000000]: [readBatch] Reading Logbatch
[ INFO] [1594123577.279790079, 34.768000000]: [onPublisherStateChange] Publisher state has changed to: available
[ INFO] [1594123577.280345720, 34.768000000]: [fileUploadCompleteStatus] Total items uploaded: 220
[ INFO] [1594123577.280768058, 34.768000000]: [SendLogs] Sending logs to CW
[ INFO] [1594123577.281258849, 34.768000000]: [readBatch] Reading Logbatch
[ INFO] [1594123577.282001411, 34.768000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1594123577.449216726, 34.842000000]: [onPublisherStateChange] Publisher state has changed to: available
[ INFO] [1594123577.449806603, 34.842000000]: [fileUploadCompleteStatus] Total items uploaded: 225
[ INFO] [1594123577.450003619, 34.842000000]: [SendLogs] Sending logs to CW
[ INFO] [1594123577.450735318, 34.842000000]: [readBatch] Reading Logbatch
[ INFO] [1594123577.451164743, 34.842000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1594123577.616988298, 34.905000000]: [onPublisherStateChange] Publisher state has changed to: available
[ INFO] [1594123577.617050819, 34.905000000]: [fileUploadCompleteStatus] Total items uploaded: 230
[ INFO] [1594123577.617249252, 34.905000000]: [SendLogs] Sending logs to CW
[ INFO] [1594123577.617523011, 34.905000000]: [readBatch] Reading Logbatch
[ INFO] [1594123577.618274838, 34.905000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1594123577.784455858, 34.991000000]: [onPublisherStateChange] Publisher state has changed to: available
[ INFO] [1594123577.784533900, 34.991000000]: [fileUploadCompleteStatus] Total items uploaded: 235
[ INFO] [1594123577.784772493, 34.991000000]: [SendLogs] Sending logs to CW
[ INFO] [1594123577.785212344, 34.991000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ INFO] [1594123577.785425092, 34.991000000]: [readBatch] Reading Logbatch
[ INFO] [1594123577.950393045, 35.102000000]: [onPublisherStateChange] Publisher state has changed to: available
[ INFO] [1594123577.951176582, 35.102000000]: [fileUploadCompleteStatus] Total items uploaded: 240
.
.
.
[ WARN] [1594123581.721042767, 37.140000000]: [AWSErrorMarshaller] Encountered AWSError 'ThrottlingException': Rate exceeded
[ERROR] [1594123581.721934338, 37.140000000]: [AWSClient] HTTP response code: 400
Exception name: ThrottlingException
Error message: Rate exceeded
5 response headers:
connection : close
content-length : 58
content-type : application/x-amz-json-1.1
date : Tue, 07 Jul 2020 12:06:21 GMT
x-amzn-requestid : 70ebd053-3a73-4114-890d-1f3e81dc0eda

Setting anything larger than 10 I have hit a 24 hour period issue as before which is odd. Here is a test with a logs file a week old.

 $ ls ~/.ros/cwlogs                  
cwlog2020-07-01_09-37-37-0.log  cwlog2020-07-07_07-05-07-0.log

The ROS log of it failing on 24hour when file_upload_batch_size : 50 (default)

[ INFO] [1594123961.172877999, 31.332000000]: [fileUploadCompleteStatus] Total items uploaded: 350
[ INFO] [1594123961.203803237, 31.336000000]: [SendLogs] Sending logs to CW
[ INFO] [1594123961.206706029, 31.336000000]: [CurlHandleContainer] Connection has been released. Continuing.
[ WARN] [1594123961.410863358, 31.363000000]: [AWSErrorMarshaller] Encountered AWSError 'InvalidParameterException': T
he batch of log events in a single PutLogEvents request cannot span more than 24 hours.
[ERROR] [1594123961.411046135, 31.363000000]: [AWSClient] HTTP response code: 400
Exception name: InvalidParameterException
Error message: The batch of log events in a single PutLogEvents request cannot span more than 24 hours.
5 response headers:
connection : close
content-length : 139
content-type : application/x-amz-json-1.1
date : Tue, 07 Jul 2020 12:12:41 GMT
x-amzn-requestid : f1bf30fd-3ebd-46d9-8d1e-17a3e32145a9

My guess is the file_max_queue_size is reading more than a single file thus hitting the 24 hour span error. Thoughts?

@dabonnie
Copy link
Contributor

dabonnie commented Jul 8, 2020

@mr337 that is unfortunate: I think I have an idea why and we will investigate further. Thanks for attempting the workaround and providing your results in great detail!

@jikawa-az jikawa-az self-assigned this Aug 6, 2020
@dabonnie
Copy link
Contributor

dabonnie commented Aug 6, 2020

@mr337 Sorry for the delay: we are actively looking at this issue.

@mr337
Copy link
Author

mr337 commented Aug 6, 2020

No problem, thanks for the update.

@dabonnie
Copy link
Contributor

Hi @mr337,

We've recently pushed fixes for aws-robotics/cloudwatch-common#53 and aws-robotics/cloudwatch-common#54 to a dependency of this project.

If you're installing from source then you should see the changes when #64 is merged. Otherwise we can keep you updated on when we bloom-release the fixes to the packages.

@mr337
Copy link
Author

mr337 commented Sep 10, 2020

Awesome and thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants