Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow writing InputRowParser extensions that use hadoop/any libraries #1695

Merged

Conversation

himanshug
Copy link
Contributor

This PR is a result of discussion #1682 (diff)
in #1682 . This patch

  1. now allows users to write custom InputRowParser for specific cases while hadoop ingestion and can depend upon hadoop or any other libs
  2. reverts the special handling introduced for BytesWritable in Support parsing of BytesWritable strings in HadoopDruidIndexerMapper #1682 and handles it via a HadoopyStringInputRowParser

@himanshug
Copy link
Contributor Author

most of the changes are updates due to constructor arg changes in DataSchema. changes to note are in DataSchema.java, HadoopDruidIndexerConfig.java, IndexingHadoopModule.java and HadoopyStringInputRowParser.java

@himanshug
Copy link
Contributor Author

@gianm pls see this when you can, it is the continuation to discussion in #1682

I see there are some merge conflicts now, will resolve them when review is done/makes-progresses.

null,
ImmutableList.of("timestamp", "host", "host2", "visited_num")
)
MAPPER.convertValue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work recursively? Or would you get a Map where the values are regular objects rather than other Maps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does convert recursively, everything to Map. also, it wouldn't matter for this test as long as

mapper.convertValue(mapper.convertValue(InputRowParser parser, Map.class), InputRowParser.class)

produces the right thing. Specific changes for DataSchema serde are tested in DataSchemaTest .

@gianm
Copy link
Contributor

gianm commented Sep 8, 2015

@himanshug had a couple minor comments but overall looks good to me. I don't see a better way of having hadoopy parsers and still respecting the classpath situation…

@gianm
Copy link
Contributor

gianm commented Sep 8, 2015

Discussed this on the dev sync, @cheddar thought that in the future it would make sense for the hadoop input stuff to be loadable from extensions, similar to how the firehoses can be loadable from extensions.

👍 on this pr for now though, since that's a longer term approach

@himanshug himanshug force-pushed the allow_hadoop_based_input_row_parser branch from 70d4478 to 7b305c2 Compare September 9, 2015 16:05
@himanshug
Copy link
Contributor Author

@gianm addressed review comments and rebased against latest master.

private static final ObjectMapper jsonMapper;
static {
jsonMapper = new DefaultObjectMapper();
jsonMapper.setInjectableValues(new InjectableValues.Std().addValue(ObjectMapper.class, jsonMapper));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Injected ObjectMappers should be tagged with @Json or @Smile. Is there somewhere that is not doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

@drcrallen
Copy link
Contributor

@gianm / @himanshug the problem of lazy converting of Map<String, Object> --> ClassOfInterest is not unique to here. It is also present in LoadSpec as well as QTL (of which QTL is not fixed yet, routers require it in order to serde the query)

public HadoopIngestionSpecUpdateDatasourcePathSpecSegmentsTest()
{
jsonMapper = new DefaultObjectMapper();
jsonMapper.setInjectableValues(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same @Json comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

@xvrl
Copy link
Member

xvrl commented Sep 9, 2015

@himanshug does this change introduce any performance overheads?

@himanshug
Copy link
Contributor Author

@xvrl none observable . That said, in theory, the InputRowParser is created each time getParser() is called (see #1695 (comment) ) , however that never happens in a tight loop so it does not matter.

@himanshug
Copy link
Contributor Author

@drcrallen this PR is pending response on
#1695 (comment)
#1695 (comment)
#1695 (comment)

pls see whenever you have time. thanks.

@drcrallen
Copy link
Contributor

Thanks @himanshug, responded.

so that user hadoop related InputRowParsers are created only when needed
this allows overlord to accept a HadoopIndexTask with a hadoopy InputRowParser
and not fail because hadoopy InputRowParser might need hadoop libraries
@himanshug himanshug force-pushed the allow_hadoop_based_input_row_parser branch from 7b305c2 to e8b9ee8 Compare September 16, 2015 15:58
@himanshug
Copy link
Contributor Author

@drcrallen all review comments have been addressed.

@drcrallen
Copy link
Contributor

👍

drcrallen added a commit that referenced this pull request Sep 16, 2015
…parser

Allow writing InputRowParser extensions that use hadoop/any libraries
@drcrallen drcrallen merged commit 42bd4f6 into apache:master Sep 16, 2015
@himanshug himanshug deleted the allow_hadoop_based_input_row_parser branch September 22, 2015 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants