[SYSTEMDS-3482] Parallel Hadoop IO Startup #1757

Baunsgaard · 2023-01-03T11:45:42Z

I observed that the compile time if we include IO operations increase to ~0.6 sec. While if we do not have IO operations it is ~0.2 sec. This is due to the hadoop IO we are using taking up to 70% of the compile time in cases where we have simple scripts with only read and a single operation. This is a constant overhead on the fist IO operation that does not effect subsequent IO operations, to improve this I have moved this to a parallel operation when we construct the JobConfiguration. This improve the compile time of systemds in general from ~0.6 sec when using IO to ~0.2 sec.

Baunsgaard · 2023-01-03T11:47:52Z

I chose to also start the threadpool here, since this improve the startup time of the first instruction that use the pool. If i used a simple thread it becomes slightly faster, but any parallel subsequent operation would have to pay the pool startup time, and the extra thread would be kept around without being used, or would have to be closed again.

Baunsgaard · 2023-01-03T11:57:14Z

Before :

SystemDS Statistics:
Total elapsed time:		0.732 sec.
Total compilation time:		0.635 sec.
Total execution time:		0.097 sec.
Cache hits (Mem/Li/WB/FS/HDFS):	1/0/0/0/1.
Cache writes (Li/WB/FS/HDFS):	0/0/0/0.
Cache times (ACQr/m, RLS, EXP):	0.060/0.000/0.000/0.000 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		0.641 sec.
Total JVM GC count:		0.
Total JVM GC time:		0.0 sec.
Heavy hitter instructions:
 #  Instruction  Time(s)  Count
 1  rightIndex     0.067      1
 2  createvar      0.015      2
 3  toString       0.014      1
 4  print          0.000      1
 5  rmvar          0.000      2

After

SystemDS Statistics:
Total elapsed time:		0.279 sec.
Total compilation time:		0.222 sec.
Total execution time:		0.056 sec.
Cache hits (Mem/Li/WB/FS/HDFS):	1/0/0/0/1.
Cache writes (Li/WB/FS/HDFS):	0/0/0/0.
Cache times (ACQr/m, RLS, EXP):	0.036/0.000/0.000/0.000 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		0.213 sec.
Total JVM GC count:		0.
Total JVM GC time:		0.0 sec.
Heavy hitter instructions:
 #  Instruction  Time(s)  Count
 1  rightIndex     0.040      1
 2  createvar      0.009      2
 3  toString       0.007      1
 4  print          0.000      1
 5  rmvar          0.000      2

mboehm7 · 2023-01-03T12:01:12Z

well good, just two comments: (1) if the thread pool is not the JVM-internal pool, the shutdown would essentially wait for successful completion of the task, and (2) please double check that this method your calling is internally properly synchronized.

Baunsgaard · 2023-01-03T17:30:38Z

1: verified . the JVM is correctly allocating and using the correct pool, there is no difference to normal execution. But it seems like we need to analyze if we have some other locations that are allocating threads that are not needed, since the pool does not allocate the 1,2 threads, indicating maybe that there is something unintended happening somewhere.

2: The filesystem to get is documented and coded in a way that the context is cached if created by a request. It seems like adding a synchronization block is not needed, since in the case two threads ask they will just both create a FileSystem object and overwrite the cached.

Baunsgaard closed this in 6a759ce Jan 4, 2023

Baunsgaard deleted the StartUpParallel branch January 4, 2023 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-3482] Parallel Hadoop IO Startup #1757

[SYSTEMDS-3482] Parallel Hadoop IO Startup #1757

Baunsgaard commented Jan 3, 2023

Baunsgaard commented Jan 3, 2023

Baunsgaard commented Jan 3, 2023

mboehm7 commented Jan 3, 2023

Baunsgaard commented Jan 3, 2023

[SYSTEMDS-3482] Parallel Hadoop IO Startup #1757

[SYSTEMDS-3482] Parallel Hadoop IO Startup #1757

Conversation

Baunsgaard commented Jan 3, 2023

Baunsgaard commented Jan 3, 2023

Baunsgaard commented Jan 3, 2023

mboehm7 commented Jan 3, 2023

Baunsgaard commented Jan 3, 2023