Problem with Bulk Inserts And Unique Constraint Violations #3847

msullivan · 2022-05-10T15:53:32Z

Discussed in #3839

^{Originally posted by ngriffiths13 May 7, 2022}
I am having some issues trying to do a bulk insert of my data using the built in for loop as shown here. In my schema I have an exclusivity constraint on one of the properties of my types. If I try to bulk insert any data into my DB, I get a constraint violation error thrown even though i am very clearly not violating the constraints. I am wondering if I am doing something wrong on my insert?

Here is my schema:

module default {
    type Tweet {
        annotation title := "Tweet object from tweepy.";
        annotation description := "All scraped data from a tweet from Twitter API 2.0.";

        required property tweet_id -> int64 {
            constraint exclusive;
        }
        required property tweet_text -> str {
            annotation description := "The body of text from a tweet.";
        }
        required property created_at -> datetime;
        required link account -> Account {
            on target delete delete source;
        }
        link traffic_report := .<source_tweet[is TrafficReport];
        link metrics -> TweetMetrics
    }
    type TweetMetrics {
        annotation description:= 'Publicly available tweet metrics.';
        property retweet_count -> int64 {
            constraint min_value(0)
        }
        property reply_count -> int64 {
            constraint min_value(0)
        }
        property like_count -> int64 {
            constraint min_value(0)
        }
        property quote_count -> int64 {
            constraint min_value(0)
        }
    }
    type Account {
        required property account_id -> int64{
            constraint exclusive;
        }
        required property name -> str{
            constraint exclusive;
        }
        required property username -> str{
            constraint exclusive;
        }
        multi link tweets := .<account[is Tweet]
    }
  }

And here is my insert query:

  WITH 
      raw_data := <json>$data,
  for data in json_array_unpack(raw_data) union(
      INSERT Tweet {
          account := (SELECT Account filter .username = <str>data['username']),
          metrics := (INSERT TweetMetrics {
                          like_count := <int64>data['like_count'],
                          quote_count := <int64>data['quote_count'],
                          reply_count := <int64>data['reply_count'],
                          retweet_count := <int64>data['retweet_count'],
                      }
                  ),
          tweet_id := <int64>data['tweet_id'],
          tweet_text := <str>data['text'],
          created_at := <datetime>data['created_at']
      }
      UNLESS conflict on .tweet_id
  );

From an empty database, I insert one account, elonmusk. Then trying to insert tweet data I get the error. Here is what my data looks like (just 2 entries but enough to recreate the error):

'[{"username": "elonmusk", "tweet_id": 1523046265097158656, "text": "@SawyerMerritt Yes! Stop the outrageous false ESG assessments, where Tesla gets a bad grade, but an oil company can get a good grade. Total gaming of the system!", "created_at": "2022-05-07 21:05:01+00:00", "like_count": 14203, "quote_count": 83, "reply_count": 758, "retweet_count": 1499}, {"username": "elonmusk", "tweet_id": 1523045544536723456, "text": "@WholeMarsBlog At risk of stating the obvious, unless something changes to cause the birth rate to exceed the death rate, Japan will eventually cease to exist. This would be a great loss for the world.", "created_at": "2022-05-07 21:02:09+00:00", "like_count": 38529, "quote_count": 768, "reply_count": 3315, "retweet_count": 4140}]'

I insert them using the python client library like this after connecting where query is the above query and tweets is the json.
conn.query(query, data=tweets)

I have found if I run this query with only 1 entry it works. But 2 or more fails. If i do it with 2 but 1 was already in the DB, then it succeeds. It fails with this error:

~/personal/traffic-project-v2/traffic_project/etl/data_utils.py in import_user_tweet(client, tweet)
     45         );
     46         """
---> 47     client.query(query, data=json.dumps(tweet))
     48 
     49 

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/abstract.py in query(self, query, *args, **kwargs)
     95 
     96     def query(self, query: str, *args, **kwargs) -> datatypes.Set:
---> 97         return self._query(QueryContext(
     98             query=QueryWithArgs(query, args, kwargs),
     99             cache=self._get_query_cache(),

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/blocking_client.py in _query(self, query_context)
    314 
    315     def _query(self, query_context: abstract.QueryContext):
--> 316         return self._iter_coroutine(super()._query(query_context))
    317 
    318     def execute(self, query: str) -> None:

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/blocking_client.py in _iter_coroutine(self, coro)
    303     def _iter_coroutine(self, coro):
    304         try:
--> 305             coro.send(None)
    306         except StopIteration as ex:
    307             if ex.args:

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/base_client.py in _query(self, query_context)
    681         con = await self._impl.acquire()
    682         try:
--> 683             result, _ = await con.raw_query(query_context)
    684             return result
    685         finally:

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/base_client.py in raw_query(self, query_context)
    212                     raise
    213                 if not e.has_tag(errors.SHOULD_RETRY):
--> 214                     raise e
    215                 if capabilities is None:
    216                     cache_item = query_context.cache.query_cache.get(

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/base_client.py in raw_query(self, query_context)
    197                 if reconnect:
    198                     await self.connect(single_attempt=True)
--> 199                 return await self._protocol.execute_anonymous(
    200                     query=query_context.query.query,
    201                     args=query_context.query.args,

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/protocol/protocol.pyx in execute_anonymous()

~/personal/traffic-project-v2/.venv/lib/python3.9/site-packages/edgedb/protocol/protocol.pyx in _execute()

ConstraintViolationError: tweet_id violates exclusivity constraint

Sorry for all of the code snippets, but any help in solving this would be much appreciated. Not sure if this is a bug or just user error on my part?

The text was updated successfully, but these errors were encountered:

Since the relevant rel wasn't in the hierarchy, the search for the enclosing volatility ref wasn't succeeding, and when nonconflict CTEs were used in a for loop, sometimes things could get duplicated. (Depending on the vagaries of how pg orders the loops, I think). Fixes #3847.

msullivan self-assigned this May 10, 2022

msullivan added the high priority label May 10, 2022

msullivan mentioned this issue May 10, 2022

Fix generation of dummy_pathid in nonconflict ctes #3848

Merged

msullivan closed this as completed in #3848 May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with Bulk Inserts And Unique Constraint Violations #3847

Problem with Bulk Inserts And Unique Constraint Violations #3847

msullivan commented May 10, 2022

Problem with Bulk Inserts And Unique Constraint Violations #3847

Problem with Bulk Inserts And Unique Constraint Violations #3847

Comments

msullivan commented May 10, 2022

Discussed in #3839