Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions iceberg/iceberg-handler/src/test/results/positive/col_stats.q.out
Original file line number Diff line number Diff line change
Expand Up @@ -244,14 +244,14 @@ Stage-0
Stage-1
Reducer 2 vectorized
File Output Operator [FS_8]
Select Operator [SEL_7] (rows=9 width=95)
Select Operator [SEL_7] (rows=9 width=192)
Output:["_col0","_col1","_col2"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_6]
Select Operator [SEL_5] (rows=9 width=95)
Select Operator [SEL_5] (rows=9 width=192)
Output:["_col0","_col1","_col2"]
TableScan [TS_0] (rows=9 width=95)
default@tbl_ice_puffin,tbl_ice_puffin,Tbl:COMPLETE,Col:COMPLETE,Output:["a","b","c"]
TableScan [TS_0] (rows=9 width=192)
default@tbl_ice_puffin,tbl_ice_puffin,Tbl:COMPLETE,Col:NONE,Output:["a","b","c"]

PREHOOK: query: drop table if exists tbl_ice_puffin
PREHOOK: type: DROPTABLE
Expand Down Expand Up @@ -339,17 +339,16 @@ POSTHOOK: type: DESCTABLE
POSTHOOK: Input: default@tbl_ice_puffin
col_name a
data_type int
min 1
Copy link
Member Author

@dengzhhu653 dengzhhu653 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @simhadri-g, seems like the desc formatted tbl_ice_puffin a gets status from metastore though hive.iceberg.stats.source=iceberg, cloud you please check?

Copy link
Member

@simhadri-g simhadri-g Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dengzhhu653 ,
Yes, we recently merged a PR to store hive column stats in puffin files for iceberg tables. a8a0ae7
Hive will use stats from this file whenever hive.iceberg.stats.source=iceberg

There are few more follow up tasks that i am currently working as a part of the epic.

max 333
num_nulls 0
distinct_count 7
min
max
num_nulls
distinct_count
Comment on lines -343 to +345
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removed the iceberg column stats? If so we should revert and make sure it doesn't affect iceberg

Copy link
Member Author

@dengzhhu653 dengzhhu653 May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The desc formatted tbl_ice_puffin a doesn't fetch the stats from puffin files though with hive.iceberg.stats.source=iceberg, instead it goes to metastore for the stats.

The tbl_ice_puffin is an external table and recreated(inserted) multiple times before the desc, so this time when the table created, the legacy data files left behind make HMS believe that the column stats is stale(eg, cann't assume the row number is 0 and the min/max of column a),
as a result stats of the insertion("values (1, 'one', 50), (2, 'two', 51),(2, 'two', 51),(2, 'two', 51), (3, 'three', 52), (4, 'four', 53)") after cann't be merged in HMS.

There is an explain select * from tbl_ice_puffin order by a, b, c; before the desc, as we can see, the stats stored in puffin files are not removed.

Copy link
Member

@simhadri-g simhadri-g May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the output corresponds to the following code snippet.

set hive.iceberg.stats.source=iceberg;
drop table if exists tbl_ice_puffin;
create external table tbl_ice_puffin(a int, b string, c int) stored by iceberg tblproperties ('format-version'='2');
insert into tbl_ice_puffin values (1, 'one', 50), (2, 'two', 51),(2, 'two', 51),(2, 'two', 51), (3, 'three', 52), (4, 'four', 53), (5, 'five', 54), (111, 'one', 55), (333, 'two', 56);
explain select * from tbl_ice_puffin order by a, b, c;
select * from tbl_ice_puffin order by a, b, c;
select count(*) from tbl_ice_puffin ;
desc formatted tbl_ice_puffin a;

In this case, the output of desc formatted tbl_ice_puffin a; is accurate and not stale.
(min = 1, max=333.)

I think we should either:

  1. Source the stats for desc table from puffin files for iceberg tables or.
  2. Add additional logic in hms to address this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This perhaps true for Iceberg table, as it will track its own metadata.
Think about a native external table, if there are some files under the table directory on creation, then assumption probably is wrong(e,g min = 1, max=333) after insertion.

If every time we create a new Iceberg table in HMS, the legacy files under the table directory won't be read, e.g, the row number is 0 regardless of the legacy files, then we can put "COLUMN_STATS":{"a":"true","b":"true","c":"true" into table parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the iceberg table is recreated - a new snapshot is generated, so any leftover files would be ignored

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will create another PR to address this

avg_col_len
max_col_len
num_trues
num_falses
bit_vector HL
bit_vector
comment
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"a\":\"true\",\"b\":\"true\",\"c\":\"true\"}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic stats are removed as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as min, max stats above, the HMS thinks it's unable to provide the accurate stats due to the data files left behind before the table tbl_ice_puffin creation.

PREHOOK: query: drop table if exists tbl_ice
PREHOOK: type: DROPTABLE
POSTHOOK: query: drop table if exists tbl_ice
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ Retention: 0
#### A masked pattern was here ####
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\",\"value\":\"true\"}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COLUMN_STATS should be stale after the first truncation, otherwise we may get the wrong statistics, for example, max(id) is 10, which is not expected, as max(id) should be 5.

COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
EXTERNAL TRUE
bucketing_version 2
current-schema {\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":false,\"type\":\"int\"},{\"id\":2,\"name\":\"value\",\"required\":false,\"type\":\"string\"}]}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,13 @@


import java.io.Serializable;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.fs.Path;
Expand All @@ -34,6 +37,7 @@
import org.apache.hadoop.hive.metastore.api.ColumnStatistics;
import org.apache.hadoop.hive.metastore.api.ColumnStatisticsDesc;
import org.apache.hadoop.hive.metastore.api.FieldSchema;
import org.apache.hadoop.hive.metastore.api.ObjectDictionary;
import org.apache.hadoop.hive.metastore.api.Order;
import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants;
import org.apache.hadoop.hive.metastore.api.SQLCheckConstraint;
Expand Down Expand Up @@ -921,14 +925,23 @@ public Table toTable(HiveConf conf) throws HiveException {
// When replicating the statistics for a table will be obtained from the source. Do not
// reset it on replica.
if (replicationSpec == null || !replicationSpec.isInReplicationScope()) {
if (!this.isCTAS && (tbl.getPath() == null || (!isExternal() && tbl.isEmpty()))) {
if (!tbl.isPartitioned() && conf.getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
StatsSetupConst.setStatsStateForCreateTable(tbl.getTTable().getParameters(),
MetaStoreUtils.getColumnNames(tbl.getCols()), StatsSetupConst.TRUE);
}
} else {
StatsSetupConst.setStatsStateForCreateTable(tbl.getTTable().getParameters(), null,
StatsSetupConst.FALSE);
// Remove COLUMN_STATS_ACCURATE=true from table's parameter, let the HMS determine if
// there is need to add column stats dependent on the table's location.
StatsSetupConst.setStatsStateForCreateTable(tbl.getTTable().getParameters(), null,
StatsSetupConst.FALSE);
if (!this.isCTAS && !tbl.isPartitioned() && !tbl.isTemporary() &&
conf.getBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER)) {
// Put the flag into the dictionary in order not to pollute the table,
// ObjectDictionary is meant to convey repeatitive messages.
ObjectDictionary dictionary = tbl.getTTable().isSetDictionary() ?
tbl.getTTable().getDictionary() : new ObjectDictionary();
List<ByteBuffer> buffers = new ArrayList<>();
String statsSetup = StatsSetupConst.ColumnStatsSetup.getStatsSetupAsString(true,
storageHandler != null && storageHandler.isMetadataTableSupported() ? "metadata" : null, // Skip metadata directory for Iceberg table
MetaStoreUtils.getColumnNames(tbl.getCols()));
buffers.add(ByteBuffer.wrap(statsSetup.getBytes(StandardCharsets.UTF_8)));
dictionary.putToValues(StatsSetupConst.STATS_FOR_CREATE_TABLE, buffers);
tbl.getTTable().setDictionary(dictionary);
}
}

Expand Down
9 changes: 9 additions & 0 deletions ql/src/test/queries/clientpositive/stats_external_location.q
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
set hive.stats.column.autogather=true;
set hive.stats.autogather=true;
dfs ${system:test.dfs.mkdir} ${system:test.tmp.dir}/test1;

create external table test_custom(age int, name string) stored as orc location '/tmp/test1';
insert into test_custom select 1, 'test';
desc formatted test_custom age;

drop table test_custom;
14 changes: 14 additions & 0 deletions ql/src/test/results/clientpositive/llap/default_file_format.q.out
Original file line number Diff line number Diff line change
Expand Up @@ -170,8 +170,13 @@ Retention: 0
#### A masked pattern was here ####
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"c\":\"true\"}}
EXTERNAL TRUE
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
#### A masked pattern was here ####

# Storage Information
Expand Down Expand Up @@ -234,9 +239,12 @@ Retention: 0
#### A masked pattern was here ####
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"c\":\"true\"}}
EXTERNAL TRUE
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
#### A masked pattern was here ####

Expand Down Expand Up @@ -470,9 +478,12 @@ Retention: 0
#### A masked pattern was here ####
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"c\":\"true\"}}
EXTERNAL TRUE
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
#### A masked pattern was here ####

Expand Down Expand Up @@ -536,9 +547,12 @@ Retention: 0
#### A masked pattern was here ####
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"c\":\"true\"}}
EXTERNAL TRUE
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
#### A masked pattern was here ####

Expand Down
4 changes: 2 additions & 2 deletions ql/src/test/results/clientpositive/llap/mm_exim.q.out
Original file line number Diff line number Diff line change
Expand Up @@ -312,8 +312,8 @@ Table Type: MANAGED_TABLE
Table Parameters:
bucketing_version 2
numFiles 3
numRows 0
rawDataSize 0
numRows 6
rawDataSize 37
totalSize 43
transactional true
transactional_properties insert_only
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#### A masked pattern was here ####
PREHOOK: type: CREATETABLE
#### A masked pattern was here ####
PREHOOK: Output: database:default
PREHOOK: Output: default@test_custom
#### A masked pattern was here ####
POSTHOOK: type: CREATETABLE
#### A masked pattern was here ####
POSTHOOK: Output: database:default
POSTHOOK: Output: default@test_custom
PREHOOK: query: insert into test_custom select 1, 'test'
PREHOOK: type: QUERY
PREHOOK: Input: _dummy_database@_dummy_table
PREHOOK: Output: default@test_custom
POSTHOOK: query: insert into test_custom select 1, 'test'
POSTHOOK: type: QUERY
POSTHOOK: Input: _dummy_database@_dummy_table
POSTHOOK: Output: default@test_custom
POSTHOOK: Lineage: test_custom.age SIMPLE []
POSTHOOK: Lineage: test_custom.name SIMPLE []
PREHOOK: query: desc formatted test_custom age
PREHOOK: type: DESCTABLE
PREHOOK: Input: default@test_custom
POSTHOOK: query: desc formatted test_custom age
POSTHOOK: type: DESCTABLE
POSTHOOK: Input: default@test_custom
col_name age
data_type int
min 1
max 1
num_nulls 0
distinct_count 1
avg_col_len
max_col_len
num_trues
num_falses
bit_vector HL
comment from deserializer
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"age\":\"true\",\"name\":\"true\"}}
PREHOOK: query: drop table test_custom
PREHOOK: type: DROPTABLE
PREHOOK: Input: default@test_custom
PREHOOK: Output: default@test_custom
POSTHOOK: query: drop table test_custom
POSTHOOK: type: DROPTABLE
POSTHOOK: Input: default@test_custom
POSTHOOK: Output: default@test_custom
Original file line number Diff line number Diff line change
Expand Up @@ -98,14 +98,11 @@ Retention: 0
#### A masked pattern was here ####
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"a\":\"true\"}}
EXTERNAL TRUE
TRANSLATED_TO_EXTERNAL TRUE
bucketing_version 2
external.table.purge TRUE
numFiles 2
numRows 1
rawDataSize 1
totalSize 4
#### A masked pattern was here ####

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ public String getAggregator(Configuration conf) {

public static final String CASCADE = "CASCADE";

public static final String STATS_FOR_CREATE_TABLE = "setStatsStateForCreateTable";

public static final String TRUE = "true";

public static final String FALSE = "false";
Expand Down Expand Up @@ -219,6 +221,55 @@ public Boolean deserialize(JsonParser jsonParser,

}

/**
* Class for marking the column statistics when creating tables.
*/
public static class ColumnStatsSetup {
private static ObjectReader objectReader;
private static ObjectWriter objectWriter;
static {
ObjectMapper objectMapper = new ObjectMapper();
objectReader = objectMapper.readerFor(ColumnStatsSetup.class);
objectWriter = objectMapper.writerFor(ColumnStatsSetup.class);
}

@JsonInclude(JsonInclude.Include.NON_DEFAULT)
public boolean enabled;
@JsonInclude(JsonInclude.Include.NON_DEFAULT)
public String fileToEscape;
@JsonInclude(JsonInclude.Include.NON_EMPTY)
public List<String> columnNames = new ArrayList<>();

public static ColumnStatsSetup parseStatsSetup(String statsSetup) {
if (statsSetup == null) {
return new ColumnStatsSetup();
}
try {
return objectReader.readValue(statsSetup);
} catch (Exception e) {
return new ColumnStatsSetup();
}
}

/**
* Get json representation of the ColumnStatsSetup
*/
public static String getStatsSetupAsString(boolean enabled,
String fileToEscape,
List<String> columns) {
try {
ColumnStatsSetup statsSetup = new ColumnStatsSetup();
statsSetup.enabled = enabled;
statsSetup.columnNames = new ArrayList<>(columns);
statsSetup.fileToEscape = fileToEscape;
return objectWriter.writeValueAsString(statsSetup);
} catch (Exception e) {
// this should not happen
throw new RuntimeException(e);
}
}
}

public static boolean areBasicStatsUptoDate(Map<String, String> params) {
if (params == null) {
return false;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
import java.util.regex.Pattern;

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.hive.common.TableName;
import org.apache.hadoop.hive.metastore.api.Catalog;
import org.apache.hadoop.hive.metastore.api.DatabaseType;
Expand Down Expand Up @@ -500,16 +501,23 @@ public void recycleDirToCmPath(Path f, boolean ifPurge) throws MetaException {
}

public boolean isEmptyDir(Path path) throws IOException, MetaException {
return isEmptyDir(path, null);
}

public boolean isEmptyDir(Path path, PathFilter pathFilter)
throws IOException, MetaException {
try {
int listCount = getFs(path).listStatus(path).length;
if (listCount == 0) {
return true;
final int listCount;
if (pathFilter == null) {
listCount = getFs(path).listStatus(path).length;
} else {
listCount = getFs(path).listStatus(path, pathFilter).length;
}
return listCount == 0;
} catch (FileNotFoundException fnfe) {
// File named by path doesn't exist; nothing to validate.
return false;
}
return false;
}

public boolean isWritable(Path path) throws IOException {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2361,10 +2361,8 @@ private void create_table_core(final RawStore ms, final CreateTableRequest req)
madeDir = true;
}
}
if (MetastoreConf.getBoolVar(conf, ConfVars.STATS_AUTO_GATHER) &&
!MetaStoreUtils.isView(tbl)) {
MetaStoreServerUtils.updateTableStatsSlow(db, tbl, wh, madeDir, false, envContext);
}

MetaStoreServerUtils.updateTableStatsForCreateTable(wh, db, tbl, envContext, conf, tblPath, madeDir);

// set create time
long time = System.currentTimeMillis() / 1000;
Expand Down
Loading