Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock with dataloader #555

Closed
servonlewis opened this issue Jun 29, 2021 · 28 comments
Closed

Possible deadlock with dataloader #555

servonlewis opened this issue Jun 29, 2021 · 28 comments
Labels
bug Something isn't working

Comments

@servonlewis
Copy link

Hey guys, my dataloader seems to be locking up and not accepting any new request after using the request several times.

This seems to be the only dataloader with this issue, but it is also the only dataloader that i use reqwest rather than my tiberius sql db.

It happens after receiving several interupted request from the client, but never happens if i send a single request using the playground.

That dataloader will not process any further request, and the rest of the query can succeed if i take that part out of the query

The expected behavior is that it does not hang ever.
Please let me know if you require the implementation of any of the functions shown.

#[derive(
    Default,
    Clone,
    PartialEq,
    serde_derive::Serialize,
    serde_derive::Deserialize,
    SimpleObject,
    Debug,
)]
#[graphql(complex)]
#[serde(rename_all = "camelCase")]
pub struct ServiceNowRelCi {
    #[serde(rename = "connection_strength")]
    pub connection_strength: ConnectionStrength,
    pub parent: Parent,
    #[serde(rename = "sys_mod_count")]
    pub sys_mod_count: SysModCount,
    #[serde(rename = "sys_updated_on")]
    pub sys_updated_on: SysUpdatedOn,
    #[serde(rename = "sys_tags")]
    pub sys_tags: SysTags,
    #[serde(rename = "type")]
    pub type_field: Type,
    #[serde(rename = "sys_id")]
    pub sys_id: SysId,
    #[serde(rename = "sys_updated_by")]
    pub sys_updated_by: SysUpdatedBy,
    pub port: Port,
    #[serde(rename = "sys_created_on")]
    pub sys_created_on: SysCreatedOn,
    #[serde(rename = "percent_outage")]
    pub percent_outage: PercentOutage,
    #[serde(rename = "sys_created_by")]
    pub sys_created_by: SysCreatedBy,
    pub child: Child,
}

#[ComplexObject]
impl ServiceNowRelCi {
    pub async fn children(&self, ctx: &Context<'_>) -> FieldResult<Option<Vec<ServiceNowRelCi>>> {
        let data: Option<Vec<ServiceNowRelCi>> = ctx
            .data_unchecked::<DataLoader<ServiceNowRelCiChildrenLoader>>()
            .load_one(self.child.value.clone())
            .await?;

        Ok(data)
    }
//rest
}
pub struct ServiceNowRelCiChildrenLoader(Client);

impl ServiceNowRelCiChildrenLoader {
    pub fn new(pool: Client) -> Self {
        Self(pool)
    }
}

fn query_rel_ci_from_sn_url(keys: &[String]) -> String {
    format!("https://nb.service-now.com/api/now/table/cmdb_rel_ci?sysparm_query=parentIN{}&sysparm_display_value=all",keys.join(","))
}

///DataLoader with an array as child.  
///DataLoader using a rest endpoint
#[async_trait]
impl Loader<String> for ServiceNowRelCiChildrenLoader {
    type Value = Vec<ServiceNowRelCi>;
    type Error = FieldError;

    async fn load(&self, keys: &[String]) -> Result<HashMap<String, Self::Value>, Self::Error> {
        let query = query_rel_ci_from_sn_url(keys);

        let stream: Vec<ServiceNowRelCi> = CmdbCiService::query_service_now_get_builder::<RootServiceNowRelCi>(
            self.0.clone(),
            query.as_str(),
        )
        .await?
        .result;

        let stream = stream
            .clone()
            .into_iter()
            .map(|data: ServiceNowRelCi| {
                (
                    data.parent.value.clone(),
                    stream
                        .clone()
                        .into_iter()
                        .filter(|filtered| &data.parent.value == &filtered.parent.value)
                        .collect(),
                )
            })
            .collect::<HashMap<String, Vec<ServiceNowRelCi>>>();

        Ok(stream)
    }
}
    ///Get request from service now, just requires url
    async fn query_service_now_get_builder<T>(
        client: reqwest::Client,
        url: &str,
    ) -> anyhow::Result<T>
    where
        T: DeserializeOwned,
    {
        dotenv().ok();
        let username = env::var("SN_API_KEY_USER").expect("SN_API_KEY_USER must be set");
        let password = env::var("SN_API_KEY_PASS").expect("SN_API_KEY_PASS must be set");
        client
            .get(url)
            .basic_auth(username, Some(password))
            .send()
            .await?
            .json()
            .await
            .map_err(Into::into)
    }
@servonlewis servonlewis added the bug Something isn't working label Jun 29, 2021
@servonlewis
Copy link
Author

This was my first time doing a rest api call rather than a database call inside a dataloader, and this is the only dataloader which has this issue.

Wonder if there is any correlation. Maybe interrupted request never finish and get locked? Not sure

@sunli829
Copy link
Collaborator

sunli829 commented Jul 1, 2021

Can you provide a minimal use case that can reproduce this problem? 😁

@servonlewis
Copy link
Author

Can you provide a minimal use case that can reproduce this problem? 😁

Sorry I am not sure how to reduce it, as all other data loaders I've ever created do not have this issue. Only difference was not using Tiberius for data but instead reqwest

@oeed
Copy link
Contributor

oeed commented Jul 12, 2021

I am intermittently encountering what appears to be a similar issue with Elasticsearch. After the .await for the request the server just deadlocks, but only occasionally.

@servonlewis are you happening to use the work around mentioned at the end of #489 to get actix-web v4 working with Tokio 1.0? Wonder if that's to blame.

@servonlewis
Copy link
Author

I am intermittently encountering what appears to be a similar issue with Elasticsearch. After the .await for the request the server just deadlocks, but only occasionally.

@servonlewis are you happening to use the work around mentioned at the end of #489 to get actix-web v4 working with Tokio 1.0? Wonder if that's to blame.

Hey I am actually using warp with tokio 1

@tombowditch
Copy link

I've just ran into this exact issue - in developing my backend app I've had no issues querying via the playground but taking a proper client (apollo client in this case) into it and developing a frontend I've ran into this.

I'm unsure what exactly is causing this (I'm using foundationdb in my dataloader), but I'll attempt to produce a small POC for further debugging.

One of the many times I ran into this, I did manage to produce a panic out of it:

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Canceled', /Users/tom/.cargo/registry/src/github.com-1ecc6299db9ec823/async-graphql-2.9.9/src/dataloader/mod.rs:334:18

I am unsure if this is related though; it is possible it is entirely unrelated.

@tombowditch
Copy link

I think I have narrowed it down:

  • the deadlock seems to happen when a pending request is cancelled (see screenshot below) (I am using apollo client in react for example)
  • I can call the GQL service with 10+ requests/sec using curl with no deadlocks
  • the action inside the dataloader is Delay, whereas other requests it is StartFetch

cancelled screenshot: all localhost requests are POSTs to the graphql endpoint, you can see the first is cancelled which makes the third pending forever
Screenshot 2021-11-15 at 19 59 59

In this code here in the dataloader: https://github.com/async-graphql/async-graphql/blob/master/src/dataloader/mod.rs#L359-L366

When a request is successful, the action is StartFetch
when I get a deadlock, the action is Delay

I added a println here https://github.com/tombowditch/async-graphql/blob/dataloader-race/src/dataloader/mod.rs#L379

Working request:
Screenshot 2021-11-15 at 20 03 15

Deadlocked request:
Screenshot 2021-11-15 at 20 03 45

Which leads me to believe the await here https://github.com/async-graphql/async-graphql/blob/master/src/dataloader/mod.rs#L368 is awaiting forever?

@sunli829 your input would be appreciated, this is where my lack of knowledge of the codebase fails me :) I hope that is helpful at least :)

@sunli829
Copy link
Collaborator

Thank you, these can definitely help me. @tombowditch

@servonlewis
Copy link
Author

Thank you, these can definitely help me. @tombowditch

Hey there,

I appreciate you guys taking this up, as I am starting to see the error again.

As the individual said above, it is only occurring after canceling a request. I am also using Apollo client, so I wonder if there is a proper cancel mechanism on that library.

Has there been any update to this issue?

@sunli829
Copy link
Collaborator

sunli829 commented Dec 4, 2021

I think I have found the cause of the problem and I am fixing it.

sunli829 added a commit that referenced this issue Dec 4, 2021
@sunli829
Copy link
Collaborator

sunli829 commented Dec 4, 2021

I fixed this in the master branch.

Now the dataLoader::new function adds a parameter to specify the function that spawning task, because I don't want async-graphql to depend on a specific runtime.

DataLoader::new(MyLoader, tokio::spawn)
DataLoader::new(MyLoader, async_std::task::spawn)

@tombowditch
Copy link

@sunli829 thank you - I just pulled down the latest commit in Cargo and can confirm I'm not seeing the locks anymore! This is with tokio

@servonlewis
Copy link
Author

I fixed this in the master branch.

Now the dataLoader::new function adds a parameter to specify the function that spawning task, because I don't want async-graphql to depend on a specific runtime.

DataLoader::new(MyLoader, tokio::spawn)

DataLoader::new(MyLoader, async_std::task::spawn)

Thank you! How can I pull it down to test? Can you please show me the toml snippet?

@tombowditch
Copy link

@servonlewis like this:

async-graphql = { git = "https://github.com/async-graphql/async-graphql.git", features = ["log", "apollo_tracing", "tracing", "dataloader"], rev = "d2a71377a933eb32f8f3f0e23854638ac52ca927" }
async-graphql-warp = { git = "https://github.com/async-graphql/async-graphql.git", rev = "d2a71377a933eb32f8f3f0e23854638ac52ca927" }

(make sure to change if you're not using warp & the features)

@servonlewis
Copy link
Author

@servonlewis like this:


async-graphql = { git = "https://github.com/async-graphql/async-graphql.git", features = ["log", "apollo_tracing", "tracing", "dataloader"], rev = "d2a71377a933eb32f8f3f0e23854638ac52ca927" }

async-graphql-warp = { git = "https://github.com/async-graphql/async-graphql.git", rev = "d2a71377a933eb32f8f3f0e23854638ac52ca927" }

(make sure to change if you're not using warp & the features)

Thank you! Will give it a test now. How long does a typical change like this take before it is updated in the main branch and crates.io?

I really appreciate the help

@tombowditch
Copy link

How long does a typical change like this take before it is updated in the main branch and crates.io?

All depends on when @sunli829 publishes a new version - but once it's published it's super quick :)

@sunli829
Copy link
Collaborator

sunli829 commented Dec 5, 2021

Release in v3.0.12 🙂

@servonlewis
Copy link
Author

Release in v3.0.12 slightly_smiling_face

Thanks!

I've updated to that rev, but now i get these errors:

| #[Object]
| ^^^^^^^^^ the trait OutputType is not implemented for VsphereMetrics

and same for InputType.

I am not sure if that has documentation for it yet, as i am still looking around.

This is one my InputObject structs that it is complaining about.

#[derive(Serialize, Deserialize, Debug, InputObject, Clone)]
#[serde(rename_all = "camelCase")]
pub struct VsphereMetricsInput {
    pub interval_type: Option<IntervalType>,
    pub roll_up_type: Option<RollupType>,
    pub resource_id: String,
    pub begin: Option<i64>,
    pub end: Option<i64>,
}

@tombowditch
Copy link

@servonlewis did you go from 2.x to 3.x or from 3.x to 3.0.12?

@servonlewis
Copy link
Author

@servonlewis did you go from 2.x to 3.x or from 3.x to 3.0.12?

2.0

@sunli829
Copy link
Collaborator

sunli829 commented Dec 7, 2021

Please upgrade to 3.x 😁

@servonlewis
Copy link
Author

Please upgrade to 3.x grin

It looks like the book is still on 2.x, do you have any migration docs?

@servonlewis
Copy link
Author

Please upgrade to 3.x grin

looks like it's having trouble with an InputObject that has an enum as a field.

#[derive(Serialize, Deserialize, Debug, InputObject, Clone)]
#[serde(rename_all = "camelCase")]
pub struct VsphereMetricsInput {
    pub interval_type: Option<IntervalType>,
    pub roll_up_type: Option<RollupType>,
    pub resource_id: String,
    pub begin: Option<i64>,
    pub end: Option<i64>,
}

#[derive(Enum, Copy, Clone, Eq, PartialEq, Serialize, Deserialize, Debug)]
pub enum RollupType {
    Sum,
    Avg,
    Min,
    Max,
    None,
    Latest,
    Count,
}

#[derive(Enum, Copy, Clone, Eq, PartialEq, Serialize, Deserialize, Debug)]
pub enum IntervalType {
    Hours,
    Minutes,
    Days,
    Weeks,
    Months,
    Years,
}
error[E0277]: the trait bound `RollupType: InputType` is not satisfied
   --> query_tiberius\src\v_sphere.rs:145:41
    |
    |                                         ^^^^^^^^^^^ the trait `InputType` is not implemented for `RollupType`
    |
    = note: required because of the requirements on the impl of `InputType` for `std::option::Option<RollupType>`

@servonlewis
Copy link
Author

Please disregard, I had forgot to update the version in the other libs in my workspace :(

my apologies!

@servonlewis
Copy link
Author

Looking to test, but the new version doesn't allow for multiple objects with the same name...been going through them all for a while now. Will let you know how it works soon!

@servonlewis
Copy link
Author

It works!!!! Thank you!

@sunli829
Copy link
Collaborator

sunli829 commented Dec 9, 2021

Please upgrade to 3.x grin

It looks like the book is still on 2.x, do you have any migration docs?

I forgot to update the version in the book, it has been changed now. 🙂

@servonlewis
Copy link
Author

Please upgrade to 3.x grin

It looks like the book is still on 2.x, do you have any migration docs?

I forgot to update the version in the book, it has been changed now. 🙂

I really appreciate your help with this. The library is too awesome man.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants