Skip to content

[Bug] [LocalFile] Bug LocalFile Model uses spark local mode to double the data #6868

Closed
@AdkinsHan

Description

@AdkinsHan

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When I used spark local mode to read the local csv file into the hive table, the data was multiplied by 3N times, but this did not happen when I used spark yarn mode. Because I used seatunnnel 1.5 before, the migration process was local, but when I tested version 2.3.5, the data was doubled.
summary :
--master local --deploy-mode client 3 times
--master yarn --deploy-mode client 3 times
--master yarn --deploy-mode cluster right
I have 2076 in my cvs file ,but select count(1) from xx then shows 3*2076

SeaTunnel Version

2.3.5

SeaTunnel Config

env {
  # seatunnel defined streaming batch duration in seconds
  execution.parallelism = 4
  job.mode = "BATCH"
  spark.executor.instances = 4
  spark.executor.cores = 4
  spark.executor.memory = "4g"
  spark.sql.catalogImplementation = "hive"
  spark.hadoop.hive.exec.dynamic.partition = "true"
  spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
}

source {
    LocalFile {
    schema {
            fields {
                  sku = string
                  sku_group = string
                  pb = string
                  series = string
                  pn = string
                  mater_n = string
                }
    }
      path = "/data/ghyworkbase/uploadfile/h019-ods_file_pjp_old_new_sku_yy.csv"
      file_format_type = "csv"
      skip_header_row_number=1
      result_table_name="ods_file_pjp_old_new_sku_yy_source"
    }
}

transform {
  Sql {
    source_table_name="ods_file_pjp_old_new_sku_yy_source"
    query = "select sku,sku_group,pb,series,pn,mater_n,TO_CHAR(CURRENT_DATE(),'yyyy') as dt_year from ods_file_pjp_old_new_sku_yy_source "
    result_table_name="ods_file_pjp_old_new_sku_yy"

  }
}

sink {

#   Console {
#      source_table_name = "ods_file_pjp_old_new_sku_yy"
#    }

   Hive {
     source_table_name="ods_file_pjp_old_new_sku_yy"
     table_name = "ghydata.ods_file_pjp_old_new_sku_yy"
     metastore_uri = "thrift://"
   }

}

Running Command

sh /data/seatunnel/seatunnel-2.3.4/bin/start-seatunnel-spark-3-connector-v2.sh \
  --master local \
  --deploy-mode client \
  --queue ghydl \
  --executor-instances 4 \
  --executor-cores 4 \
  --executor-memory 4g \
  --name "h019-ods_file_pjp_old_new_sku_yy" \
  --config /2.3.5/h019-ods_file_pjp_old_new_sku_yy.conf

Error Exception

nothing but data 3*

Zeta or Flink or Spark Version

No response

Java or Scala Version

/usr/local/jdk/jdk1.8.0_341

Screenshots

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions