This repository has been archived by the owner. It is now read-only.

Getting less responses than expected? #42

Closed
agustintorres opened this Issue Sep 23, 2016 · 11 comments

Comments

Projects
None yet
3 participants
@agustintorres
Copy link

agustintorres commented Sep 23, 2016

I am executing the following:

ParallelTask task = pc.prepareHttpGet("/networth/$" + NMUID_VARIABLE).async()
                .setReplaceVarMapToSingleTargetSingleVar(NMUID_VARIABLE, clientIds, NETWORTH_HOST)
                .setResponseContext(responseContext)
                .setConcurrency(concurrency)
                .execute(responseHandler);

The size of the "clientIds" collection is 19749. However, after the task is done, I get the following results:

Result Brief Summary
 {
  "FAIL_GET_RESPONSE: java.util.concurrent.TimeoutException: No response received after 14000": 274,
  "200 OK": 10657
}

How is this possible?

@jeffpeiyt

This comment has been minimized.

Copy link
Contributor

jeffpeiyt commented Sep 23, 2016

@agustintorres , do you mean the count does not match?

Normally this happens with duplicated requests; as we deduplicate and use a hashmap to store the different requests.

Could you try to execute just 100 of the clientIds and see if you miss any? And try to put into a hashset first for the clientIds and see if there is any duplicate?

@agustintorres

This comment has been minimized.

Copy link

agustintorres commented Sep 23, 2016

@jeffpeiyt Yes, that's what I mean.

I have also confirmed that there are no duplicates. They're 19749 unique requests.

If I make smaller requests(100 or even 8000), I do not miss any. It is only the 19,749 set for which I am missing results.

@jeffpeiyt

This comment has been minimized.

Copy link
Contributor

jeffpeiyt commented Sep 23, 2016

Interesting.

Could you please try

  • pass your hashmap to responseContext and save the responses / APIs to this hashmap during onCompleted ; then after the task is done to examine which api you miss. You can tell if any api was not able to go thru the onComplete..
  • debug and examine parallelTaskResult see the entry size of this hashmap.
  • Check the logs.... the logs have pretty good display of the progress such as 5000/6000 completes etc, is the total number correct when the task is running?

Some internal team has similar to single target usage and scale and they did not report this. I have not encountered this before either.

@agustintorres

This comment has been minimized.

Copy link

agustintorres commented Sep 23, 2016

@jeffpeiyt Here are some findings:

The job steadily progresses and at some point it jumps from about 54% progress to 100%. It seems that it gets terminated mid-way. These are some things I see in the logs:

9/23/2016 17:31:58:816 .c.a.OperationWorker [atcher-6] INFO  - asyncWorker has not been initilized (null). Will not tell it cancel

09/23/2016 17:31:58:826 c.a.ExecutionManager [atcher-3] INFO  - ExecutionManager sending cancelPendingRequest at time: 2016-09-23 17:31:58.826-0400

09/23/2016 17:31:58:831 c.a.ExecutionManager [atcher-3] INFO  - task.totalJobNumActual : 19749 InitCount: 19749

09/23/2016 17:31:58:833 c.a.ExecutionManager [atcher-3] INFO  - task.response received Num 10716 

09/23/2016 17:31:58:836 c.a.ExecutionManager [atcher-3] INFO  - COMPLETED_WITH_ERROR.  19749 at time: 2016.09.23.17.31.58.836-0400

09/23/2016 17:31:58:839 .ParallelTaskManager [Thread-5] INFO  - !!COMPLETED sendTaskToExecutionManager : PT_19749_20160923172157463_df655ac0-66f at 2016-09-23 17:31:58.839-0400          GenericResponseMap in future size: 10716
09/23/2016 17:31:58:839 c.a.ExecutionManager [atcher-3] INFO  - 
Time taken to get all responses back : 601.318 secs

09/23/2016 17:31:58:842 .ParallelTaskManager [Thread-5] INFO  - Removed task PT_19749_20160923172157463_df655ac0-66f from the running inprogress map... . This task should be garbage collected if there are no other pointers.

I am putting everything in the responseContext on the onCompleted method and the size of the map at the end is: 10718.

Further, task.getParallelTaskResult().keySet().size() size is 19749.

Keep in mind that the job runs for about 10 minutes before it gets cancelled. Why would it terminate? Maybe there is some sort of timeout?

@jeffpeiyt

This comment has been minimized.

Copy link
Contributor

jeffpeiyt commented Sep 23, 2016

Sorry. My bad, we got internal users report the same issue. Very easy to fix. It is a global timeout that kills the whole job.

@jeffpeiyt

This comment has been minimized.

Copy link
Contributor

jeffpeiyt commented Sep 23, 2016

#38 has been tracked this issue. Please set it to a larger value. Default is 600 seconds ....

defaults:

    /**
     * The command manager internal timeout and cancel itself time in seconds
     * Note this may need to be adjusted for long polling jobs.
     */
    public static long timeoutInManagerSec = 600;

    /** The timeout the director send to the manager to cancel it from outside. */
    public static long timeoutAskManagerSec = timeoutInManagerSec + 10;

@jeffpeiyt jeffpeiyt self-assigned this Sep 23, 2016

@agustintorres

This comment has been minimized.

Copy link

agustintorres commented Sep 23, 2016

@jeffpeiyt Thanks! That fixed my problem and it works great now. I'm actually planning to do something similar for around 300,000 clientIds, even if it takes 5+ hours. Do you foresee any problems with it running for this long?

@jeffpeiyt

This comment has been minimized.

Copy link
Contributor

jeffpeiyt commented Sep 23, 2016

@agustintorres Great! I am updating the documents to be more clear on this.

I do not see any problems. We do run jobs that are on 100,000+ hosts and it runs fine. Please let us know with any issues you encounter.

@jeffpeiyt jeffpeiyt closed this Sep 23, 2016

@jeffpeiyt

This comment has been minimized.

Copy link
Contributor

jeffpeiyt commented Sep 23, 2016

@harjitdotsingh

This comment has been minimized.

Copy link

harjitdotsingh commented Jun 23, 2018

I'm still seeing this....

2018-06-23 00:46:03.401 INFO 20095 --- [ParallecActorSystem-akka.actor.default-dispatcher-10] io.parallec.core.actor.ExecutionManager :
[4]__RESP_RECV_IN_MGR 4 (+0) / 4 (100.00%) AFT 14.133 S @ API_2 @ 2018.06.23.00.46.03.401-0400 , TaskID : 0f39b86b-82a , CODE: NA, RESP_BRIEF: EMPTY , ERR: java.util.concurrent.TimeoutException: No response received after 14000

I have the following config Set as per the Doc

private ParallelTaskConfig genParallelTaskConfig() {

        ParallelTaskConfig config = new ParallelTaskConfig();
        config.setActorMaxOperationTimeoutSec(120);

        config.setTimeoutInManagerSec(120);
        config.setTimeoutAskManagerSec(710);

        return config;
    }

This is my task call

.setHttpHeaders(new ParallecHeader().addPair("x-user", env.getProperty("ifi.user")).addPair("x-password", env.getProperty("ifi.password"))).setProtocol(RequestProtocol.HTTPS)
                .setHttpPort(443)
                .setConfig(genParallelTaskConfig())
                .setTcpConnectTimeoutMillis(100*120)
                .async()
                .setReplaceVarMapToSingleTargetSingleVar("QUERY", queryList, "cdws21.ificlaims.com")
                .setResponseContext(returnMap)
                .execute((res, responseContext) -> 

@jeffpeiyt

This comment has been minimized.

Copy link
Contributor

jeffpeiyt commented Jun 23, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.