Why opened task less than available executors in case of insert into/load data #4160

01lin · 2021-06-23T03:16:29Z

In case of insert into or load data, the total number of tasks in the stage is almost equal to the number of hosts, and in general it is much smaller than the available executors. The low parallelism of the stage results in slower execution. Why must the parallelism be constrained on the distinct host? Can start more tasks to increase parallelism and improve resource utilization? Thanks

org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala: loadDataFrame

  /**
   * Execute load process to load from input dataframe
   */
  private def loadDataFrame(
      sqlContext: SQLContext,
      dataFrame: Option[DataFrame],
      carbonLoadModel: CarbonLoadModel
  ): Array[(String, (LoadMetadataDetails, ExecutionErrors))] = {
    try {
      val rdd = dataFrame.get.rdd
      val nodeNumOfData = rdd.partitions.flatMap[String, Array[String]] { p =>
        DataLoadPartitionCoalescer.getPreferredLocs(rdd, p).map(_.host)
      }.distinct.length
      val nodes = DistributionUtil.ensureExecutorsByNumberAndGetNodeList(
        nodeNumOfData,
        sqlContext.sparkContext) 
      val newRdd = new DataLoadCoalescedRDD[Row](sqlContext.sparkSession, rdd, nodes.toArray
        .distinct)

      new NewDataFrameLoaderRDD(
        sqlContext.sparkSession,
        new DataLoadResultImpl(),
        carbonLoadModel,
        newRdd
      ).collect()
    } catch {
      case ex: Exception =>
        LOGGER.error("load data frame failed", ex)
        throw ex
    }
  }

The text was updated successfully, but these errors were encountered:

QiangCai · 2021-06-28T01:46:23Z

It only works for the local_sort loading.
It can help to avoid data shuffle during executors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why opened task less than available executors in case of insert into/load data #4160

Why opened task less than available executors in case of insert into/load data #4160

01lin commented Jun 23, 2021 •

edited

QiangCai commented Jun 28, 2021

Why opened task less than available executors in case of insert into/load data #4160

Why opened task less than available executors in case of insert into/load data #4160

Comments

01lin commented Jun 23, 2021 • edited

QiangCai commented Jun 28, 2021

01lin commented Jun 23, 2021 •

edited