Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Query and Generate mismatch #185

Open
william-wang opened this issue Oct 12, 2019 · 0 comments
Open

The Query and Generate mismatch #185

william-wang opened this issue Oct 12, 2019 · 0 comments

Comments

@william-wang
Copy link

I splited the Generate data and Query query in two jar files. Firstly generated data and then parallelly query the data. Most of tasks cost about 20ms. However some tasks cost 700ms. The reason is these taks access non exist key(files). Why the query task access non-generated files.

Following is a part of my program.

  1. Generate date code
    val tables = new TPCDSTables(sqlContext,
    dsdgenDir = "/opt/tpcds-kit/tools", // location of dsdgen
    scaleFactor = 70,
    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
    useStringForDate = false) // true to replace DateType with StringType

    tables.genData(
    location = rootDir,
    format = format,
    overwrite = true, // overwrite the data that is already there
    partitionTables = true, // create the partitioned fact tables
    clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.
    filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
    tableFilter = "", // "" means generate all tables
    numPartitions = 100) // how many dsdgen part

  2. Query code
    val tables = new TPCDSTables(sqlContext,
    dsdgenDir = "/opt/tpcds-kit/tools", // location of dsdgen
    scaleFactor = 70,
    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
    useStringForDate = false) // true to replace DateType with StringType

    // Create the specified database
    sql(s"create database $databaseName")

    // Create metastore tables in a specified database for your data.
    // Once tables are created, the current database will be switched to the specified database.
    tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = false)
    val tpcds = new TPCDS (sqlContext = sqlContext)
    // Set:
    val resultLocation = args(3) // place to write results
    val iterations = 1 // how many iterations of queries to run.
    //val queries = tpcds.tpcds2_4Queries // queries to run.
    val timeout = 246060 // timeout, in seconds.

    def queries = {
    if (args(4) == "all") {
    tpcds.tpcds2_4Queries
    } else {
    val qa = args(4).split(",", 0)
    val qs = qa.toSeq
    tpcds.tpcds2_4Queries.filter(q => {
    qs.contains(q.name)
    })
    }
    }

    println(queries.size)
    sql(s"use $databaseName")
    val experiment = tpcds.runExperiment(
    queries,
    iterations = iterations,
    resultLocation = resultLocation,
    forkThread = true)
    experiment.waitForFinish(timeout)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant