Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHOENIX-2288 Phoenix-Spark: PDecimal precision and scale aren't carried through to Spark DataFrame #124

Closed
wants to merge 3 commits into from

Conversation

navis
Copy link

@navis navis commented Oct 22, 2015

from jira description

When loading a Spark dataframe from a Phoenix table with a 'DECIMAL' type, the underlying precision and scale aren't carried forward to Spark.

The Spark catalyst schema converter should load these from the underlying column. These appear to be exposed in the ResultSetMetaData, but if there was a way to expose these somehow through ColumnInfo, it would be cleaner.

I'm not sure if Pig has the same issues or not, but I suspect it may.

It seemed enough just for current usage in spark-interagation. But in long term, PDataType should contain meta information like maxLength or precision, etc.

@jmahonin
Copy link
Contributor

This looks great @navis

The Spark portion looks fine. I'll leave the updates to ColumnInfo for @ravimagham @JamesRTaylor et. al. to review

if (pColumn.getMaxLength() == null) {
return new ColumnInfo(pColumn.toString(), sqlType);
}
if (sqlType == Types.CHAR || sqlType == Types.VARCHAR) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than check for particular types, it'd be more general to check for null like this:

Integer maxLength = pColumn.getMaxLength();
Integer scale = pColumn.getScale();
return new ColumnInfo(pColumn.toString(), sqlType, maxLength, scale);

Then make sure that ColumnInfo handles a null maxLength and scale.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesRTaylor How about to move the logic above into PColumn? Then we can access full information in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnInfo is a kind of lightweight transport class solely for passing in the necessary column metadata for the MR and Spark integration to run. It's passed in through the config so it has some simple to/from string methods - this prevents us from having to lookup the metadata from Phoenix metadata using the regular JDBC metadata APIs (which would be another option). Having this ColumnInfo class was deemed slightly easier.

PColumn has more information than we need and it'd be best to keep this as an internal/private class as much as possible. It's the object representation of our column metadata.

@JamesRTaylor
Copy link
Contributor

Thanks for the pull request, @navis. A couple of minor comments, but overall it looks great. FYI, our PDataType class is stateless (it was an enum originally), so we currently access maxLength/precision and scale through the PDatum interface (from which PColumn and Expression are derived). Now that PDataType is no longer an enum, it might be nice to allow instantiation with maxLength and scale provided at construction time. Please file a JIRA.

@jmahonin
Copy link
Contributor

How's this look, @JamesRTaylor / @ravimagham ?

@stoty
Copy link
Contributor

stoty commented Aug 1, 2023

Already merged.

@stoty stoty closed this Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants