Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null pointer in equalTupleDescs crashes the server #3825

Closed
mtuncer opened this issue May 7, 2020 · 18 comments
Closed

Null pointer in equalTupleDescs crashes the server #3825

mtuncer opened this issue May 7, 2020 · 18 comments
Assignees

Comments

@mtuncer
Copy link
Member

mtuncer commented May 7, 2020

This trace was taken from a recent select query crash dump
Citus version is 9.2-2 on PG 12.2

Query is select * with 2 filters on 2 non-distribution columns

#0  equalTupleDescs (tupdesc1=0x0, tupdesc2=0x1b9f3f0) at tupdesc.c:417
417	tupdesc.c: No such file or directory.
#0  equalTupleDescs (tupdesc1=0x0, tupdesc2=0x1b9f3f0) at tupdesc.c:417
#1  0x000000000085b51f in record_type_typmod_compare (a=<optimized out>, b=<optimized out>, size=<optimized out>) at typcache.c:1761
#2  0x0000000000869c73 in hash_search_with_hash_value (hashp=0x1c10530, keyPtr=keyPtr@entry=0x7ffcfd3117b8, hashvalue=3194332168, action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7ffcfd3117c0) at dynahash.c:987
#3  0x000000000086a3fd in hash_search (hashp=<optimized out>, keyPtr=keyPtr@entry=0x7ffcfd3117b8, action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7ffcfd3117c0) at dynahash.c:911
#4  0x000000000085d0e1 in assign_record_type_typmod (tupDesc=<optimized out>, tupDesc@entry=0x1b9f3f0) at typcache.c:1801
#5  0x000000000061832b in BlessTupleDesc (tupdesc=0x1b9f3f0) at execTuples.c:2056
#6  TupleDescGetAttInMetadata (tupdesc=0x1b9f3f0) at execTuples.c:2081
#7  0x00007f2701878dee in CreateDistributedExecution (modLevel=ROW_MODIFY_READONLY, taskList=0x1c82398, hasReturning=<optimized out>, paramListInfo=0x1c3e5a0, tupleDescriptor=0x1b9f3f0, tupleStore=<optimized out>, targetPoolSize=16, xactProperties=0x7ffcfd311960, jobIdList=0x0) at executor/adaptive_executor.c:951
#8  0x00007f270187ba09 in AdaptiveExecutor (scanState=0x1b9eff0) at executor/adaptive_executor.c:676
#9  0x00007f270187c582 in CitusExecScan (node=0x1b9eff0) at executor/citus_custom_scan.c:182
#10 0x000000000060c9e2 in ExecProcNode (node=0x1b9eff0) at ../../../src/include/executor/executor.h:239
#11 ExecutePlan (execute_once=<optimized out>, dest=0x1abfc90, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1b9eff0, estate=0x1b9ed50) at execMain.c:1646
#12 standard_ExecutorRun (queryDesc=0x1c3e660, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#13 0x00007f27018819bd in CitusExecutorRun (queryDesc=0x1c3e660, direction=ForwardScanDirection, count=0, execute_once=true) at executor/multi_executor.c:177
#14 0x00007f27000adfee in pgss_ExecutorRun (queryDesc=0x1c3e660, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:891
#15 0x000000000074f97d in PortalRunSelect (portal=portal@entry=0x1b8ed00, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x1abfc90) at pquery.c:929
#16 0x0000000000750df0 in PortalRun (portal=portal@entry=0x1b8ed00, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=<optimized out>, dest=dest@entry=0x1abfc90, altdest=altdest@entry=0x1abfc90, completionTag=0x7ffcfd312090 "") at pquery.c:770
#17 0x000000000074e745 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x1abf880 "") at postgres.c:2090
#18 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1b4a0e8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4308
#19 0x00000000006de9d8 in BackendRun (port=0x1b37230, port=0x1b37230) at postmaster.c:4437
#20 BackendStartup (port=0x1b37230) at postmaster.c:4128
#21 ServerLoop () at postmaster.c:1704
#22 0x00000000006df955 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1aba280) at postmaster.c:1377
#23 0x0000000000487a4e in main (argc=3, argv=0x1aba280) at main.c:228
@mtuncer
Copy link
Member Author

mtuncer commented May 7, 2020

There is another crash on the same coordinator with the exact call stack, with another select * query

@onderkalaci
Copy link
Member

onderkalaci commented May 7, 2020

We've seen this issue before, and had the internal conversations (e-mail titled Citus 8.3.1 / Postgres 11.4 core dump: Null pointer in equalTupleDescs):

From: Marco Slot <Marco.Slot>
Sent: Wednesday, August 7, 2019 7:38 AM
To: Onder Kalaci ; Daniel Farina ; Citus Dev Team
Subject: Re: Citus 8.3.1 / Postgres 11.4 core dump: Null pointer in equalTupleDescs

Thanks for debugging this. It seems that an internal hash table in PostgreSQL ended up having a NULL entry, but Citus code explicitly checks whether tupleDesc is NULL before calling TupleDescGetAttInMetadata, meaning the NULL entry is not coming from Citus. This is likely a bug in another extension or Postgres itself. An action item could be to send a patch to Postgres to at least check for NULL in assign_record_type_typmod to kick out the offenders.

So, we think that this might be a PostgreSQL issue. Though, we should investigate more and probably submit a patch.

@onderkalaci onderkalaci changed the title crash in executor while running a select query Null pointer in equalTupleDescs crashes the server May 7, 2020
@mtuncer
Copy link
Member Author

mtuncer commented May 20, 2020

We have 5 new crash dumps with the same call stack on a coordinator with PG11 and citus 9.0 on may 12-13

@ghost
Copy link

ghost commented May 23, 2020

Another crash on a citus 9.2 and pg 12 formation's coordinator node with the same stack

@onderkalaci
Copy link
Member

Is it happening on the same cluster or on different clusters @kileri ?

@ghost
Copy link

ghost commented May 26, 2020

The formation I looked into was called <Customer 1> Production Cluster. I am not sure if @mtuncer came across with the previous 5 occurrences on the same formation though.

@onlined
Copy link

onlined commented Jun 11, 2020

Another crash has happened on <Customer 1> Production Cluster two days ago.

@onderkalaci
Copy link
Member

@onlined It'd be useful to check the schema of <Customer 1> and see if there is anything unusual? Like a [custom] data type. Or, queries returning unusual tuples? Or anything that seems unusual?
That might give us something to chase after.

@onderkalaci onderkalaci added bug and removed bug labels Jun 17, 2020
@onlined
Copy link

onlined commented Jun 17, 2020

Another crash has happened on <Customer 1> Production Cluster two days ago.

I investgated this further and I reached that there is a single task in the task list passed to CreateDistributedExecution. The task's query and the table definition related to it are below (PII redacted):

SELECT a, b, c, d, e, f, g, h, i, j, k
    FROM public.tbl_123456 tbl
    WHERE ((b OPERATOR(pg_catalog.=) 123123) AND 
           ((g)::text OPERATOR(pg_catalog.=) 'text1'::text) AND
           ((d)::text OPERATOR(pg_catalog.=) 'text2'::text) AND
           ((e)::text OPERATOR(pg_catalog.=) 'text3'::text))
                                            Table "public.tbl"
  Column  |            Type             | Collation | Nullable |            Default
----------+-----------------------------+-----------+----------+--------------------------------
 a        | integer                     |           | not null | nextval('tbl_a_seq'::regclass)
 b        | integer                     |           | not null |
 c        | character varying           |           |          |
 d        | character varying           |           |          |
 e        | character varying           |           |          |
 f        | character varying           |           |          |
 g        | character varying           |           |          |
 h        | text                        |           |          |
 i        | timestamp without time zone |           |          |
 j        | timestamp without time zone |           |          |
 k        | character varying           |           |          |

@mtuncer
Copy link
Member Author

mtuncer commented Jul 1, 2020

I hit another core dump, at the same formation, and most likely the same table. our dump_backup folder has now 8 core dumps.

query was a single table router select query where coordinator has almost nothing to do for execution.
I run the crashing query again, but it worked just fine. It would be difficult to reproduce this.

@pinodeca
Copy link
Contributor

pinodeca commented Nov 4, 2020

We had 3 occurrences on the same server: Oct. 16, 22, and Nov. 4

@marcocitus
Copy link
Member

@onlined
Copy link

onlined commented Jan 5, 2021

It appeared on two coordinators at Citus Cloud on Dec. 18 and 28.

@mtuncer
Copy link
Member Author

mtuncer commented Jan 14, 2021

Logged another 4 hits from Dec 30 to Jan 9

@mtuncer
Copy link
Member Author

mtuncer commented Jan 27, 2021

2 more on Jan 26

@SaitTalhaNisanci
Copy link
Contributor

SaitTalhaNisanci commented Mar 24, 2021

my current understanding of why this might happen is this:

I have manually added an error to find_or_make_matching_shared_tupledesc and with two subsequent calls (basic SELECT queries) I was able to get this crash:

Program received signal SIGSEGV, Segmentation fault.
0x000055e31989ad2e in equalTupleDescs ()
(gdb) where
#0  0x000055e31989ad2e in equalTupleDescs ()
#1  0x000055e319cb469f in record_type_typmod_compare ()
#2  0x000055e319cc44c2 in hash_search_with_hash_value ()
#3  0x000055e319cb61d4 in assign_record_type_typmod ()
#4  0x000055e319cc2b53 in internal_get_result_type ()
#5  0x000055e3199d37dd in pg_available_extensions ()
#6  0x00007fcc841ec101 in AvailableExtensionVersion () from /home/talha/.pgenv/pgsql-13.0/lib/citus.so
#7  0x00007fcc841ec9b5 in CheckAvailableVersion () from /home/talha/.pgenv/pgsql-13.0/lib/citus.so
#8  0x00007fcc841ed5ee in CheckCitusVersion () from /home/talha/.pgenv/pgsql-13.0/lib/citus.so
#9  0x00007fcc841ef1c6 in LookupCitusTableCacheEntry () from /home/talha/.pgenv/pgsql-13.0/lib/citus.so
#10 0x00007fcc841ef5d9 in IsCitusTable () from /home/talha/.pgenv/pgsql-13.0/lib/citus.so
#11 0x00007fcc8420300a in ListContainsDistributedTableRTE ()
   from /home/talha/.pgenv/pgsql-13.0/lib/citus.so
#12 0x00007fcc842049b8 in distributed_planner () from /home/talha/.pgenv/pgsql-13.0/lib/citus.so
#13 0x00007fcc7de9fe75 in pgss_planner (parse=0x55e31b5b91c0,
    query_string=0x55e31b5b8348 "SELECT b from k;", cursorOptions=256, boundParams=0x0)
    at pg_stat_statements.c:988
#14 0x000055e319b9c40d in pg_plan_query ()
#15 0x000055e319b9c503 in pg_plan_queries ()
#16 0x000055e319b9c824 in exec_simple_query ()
#17 0x000055e319b9e44d in PostgresMain ()
#18 0x000055e319b21dad in ServerLoop ()
#19 0x000055e319b22d10 in PostmasterMain ()

I believe the above has the same root cause as this issue.

One thing that came to my mind was to remove this line that sets it to NULL: https://github.com/postgres/postgres/blob/1509c6fc29c07d13c9a590fbd6f37c7576f58ba6/src/backend/utils/cache/typcache.c#L1984

@SaitTalhaNisanci
Copy link
Contributor

I have sent a patch to postgres for this issue and it is still active: https://www.postgresql.org/message-id/flat/3229167.1617210650%40sss.pgh.pa.us#9ac0555c613861cc8b4d2934185018d9

@SaitTalhaNisanci
Copy link
Contributor

This is merged to Postgres and backported to until PG11.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dd0e37cc1598050ec38fa289908487d4f5c96dca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants