Hadoop InputRowParser for Orc file #3019

sirpkt · 2016-05-26T01:54:01Z

Related with #3017

user SHOULD supply schema information as a form of string representation of Orc struct type info.
for example, string column col1 and array of string column col2 is represented by struct<col1:string,col2:array<string>>
only support java primitive type columns and array of java primitive type columns
as shown in hadoop_orc_job.json example, inputFormat Should be set as org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat

fjy · 2016-05-26T02:01:15Z

@sirpkt can we also get some docs on usage and also update extensions.md file?

sirpkt · 2016-05-26T02:10:06Z

@fjy Sure, I'll update doc.

sirpkt · 2016-05-26T08:06:11Z

This patch used hive orc reader so that it has complex dependencies including hive-exec.
Apache Orc(https://github.com/apache/orc) is working to separate Orc code from Hive but only done for org.apache.hadoop.mapred (old API) currently.
As pull request for org.apache.hadoop.mapreduce already exists, I think it will soon support new MapReduce API.
I'll watch the progress and update the patch to use that library.

sirpkt · 2016-05-30T00:13:30Z

During testing in the real server cluster, I experienced library dependency problem.
So, I close this PR until I resolve that problem.

sirpkt · 2016-05-30T04:09:29Z

maven dependency updated

fjy · 2016-06-09T01:02:05Z

docs/content/development/extensions-contrib/orc.md

+      "inputSpec": {
+        "type": "static",
+        "inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat",
+        "paths": "no_metrics"


what should be filled out in the paths field?

fjy · 2016-06-09T01:07:10Z

overall this PR looks reasonable to me however I have little experience with ORC files. Can someone who has more experience with ORC files take a look and comment on the API?

fjy · 2016-06-13T23:18:28Z

@sirpkt there are some merge errors

sirpkt · 2016-06-16T02:46:27Z

rebase and update based on comments

nishantmonu51 · 2016-06-23T17:04:40Z

...ns-contrib/orc-extensions/src/main/java/io/druid/data/input/orc/OrcHadoopInputRowParser.java

+    builder.append(parseSpec.getTimestampSpec()).append(":string");
+    if (parseSpec.getDimensionsSpec().getDimensionNames().size() > 0) {
+      builder.append(",");
+      builder.append(StringUtils.join(parseSpec.getDimensionsSpec().getDimensionNames(), ":string,")).append(":string");


do we need to append ":string" twice here ?

fjy · 2016-07-01T19:36:52Z

@sirpkt any chance we can finish this up?

fjy · 2016-07-06T21:14:50Z

@sirpkt

sirpkt · 2016-07-08T05:18:36Z

@fjy sorry for late response.
I just moved my house so had not enough time to follow review.
I'll update based on the review comment

sirpkt · 2016-07-11T02:56:06Z

updated based on the review comments

nishantmonu51 · 2016-07-12T08:28:58Z

👍 , all my comments are addressed.

fjy · 2016-07-18T16:44:39Z

extensions-contrib/orc-extensions/pom.xml

@@ -0,0 +1,134 @@
+<?xml version="1.0" encoding="UTF-8"?>


this is missing a license header

fjy · 2016-07-18T16:45:12Z

👍 after the missing license header is added

fjy · 2016-07-21T00:09:25Z

@sirpkt

…f orc list from array to list

sirpkt · 2016-07-26T05:50:18Z

@fjy sorry for late response.
I rebased code and added license header.

b-slim · 2017-02-17T03:56:17Z

@sirpkt can you help with this issue https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/druid-user/wiXVMMmXOgU/fHXn6hR0CQAJ

fjy added the Feature label May 26, 2016

fjy added this to the 0.9.2 milestone May 26, 2016

sirpkt closed this May 30, 2016

sirpkt reopened this May 30, 2016

sirpkt force-pushed the hadoop-orc branch 4 times, most recently from 333720a to 7740b66 Compare June 3, 2016 06:31

fjy reviewed Jun 9, 2016
View reviewed changes

sirpkt force-pushed the hadoop-orc branch from 7740b66 to 318cce4 Compare June 16, 2016 02:45

sirpkt force-pushed the hadoop-orc branch from 318cce4 to d33c615 Compare June 22, 2016 02:05

nishantmonu51 reviewed Jun 23, 2016
View reviewed changes

sirpkt force-pushed the hadoop-orc branch from d33c615 to 6db967c Compare July 11, 2016 02:54

fjy reviewed Jul 18, 2016
View reviewed changes

extensions-contrib/orc-extensions/pom.xml

@@ -0,0 +1,134 @@

<?xml version="1.0" encoding="UTF-8"?>

Copy link

Contributor

fjy Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing a license header

sirpkt added 13 commits July 26, 2016 14:48

InputRowParser to decode OrcStruct from OrcNewInputFormat

f60b47b

add unit test for orc hadoop indexing

93f759d

update docs and fix test code bug

3dc0bf7

doc updated

8eb4242

resove maven dependency conflict

5e55183

remove unused imports

223dea9

fix returning array type from Object[] to correct primitive array type

4596ed4

fix to support getDimension() of MapBasedRow : changing return type o…

1f4ded0

…f orc list from array to list

rebase and updated based on comments

ce2257e

updated based on comments

70ce88e

on reflecting review comments

5d6d3c0

fix bug in typeStringFromParseSpec() and add unit test

345c056

add license header

67eedff

sirpkt force-pushed the hadoop-orc branch from 6db967c to 67eedff Compare July 26, 2016 05:49

fjy merged commit 95a5809 into apache:master Jul 26, 2016

sirpkt deleted the hadoop-orc branch July 27, 2016 07:27

nishantmonu51 mentioned this pull request Jul 28, 2016

[Feature] Hadoop Ingestion with Orc Support #3017

Closed

gianm mentioned this pull request Sep 23, 2016

Druid 0.9.2 release notes #3503

Closed

seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020

apache#3019 Add granularity dimension to QueryMetrics

45b6547

seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Feb 25, 2022

apache#3019 Add granularity dimension to QueryMetrics

1505c1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop InputRowParser for Orc file #3019

Hadoop InputRowParser for Orc file #3019

sirpkt commented May 26, 2016

fjy commented May 26, 2016

sirpkt commented May 26, 2016

sirpkt commented May 26, 2016

sirpkt commented May 30, 2016

sirpkt commented May 30, 2016

fjy Jun 9, 2016

fjy commented Jun 9, 2016

fjy commented Jun 13, 2016

sirpkt commented Jun 16, 2016

nishantmonu51 Jun 23, 2016

fjy commented Jul 1, 2016

fjy commented Jul 6, 2016

sirpkt commented Jul 8, 2016

sirpkt commented Jul 11, 2016

nishantmonu51 commented Jul 12, 2016

fjy Jul 18, 2016

fjy commented Jul 18, 2016

fjy commented Jul 21, 2016

sirpkt commented Jul 26, 2016

b-slim commented Feb 17, 2017

Hadoop InputRowParser for Orc file #3019

Hadoop InputRowParser for Orc file #3019

Conversation

sirpkt commented May 26, 2016

fjy commented May 26, 2016

sirpkt commented May 26, 2016

sirpkt commented May 26, 2016

sirpkt commented May 30, 2016

sirpkt commented May 30, 2016

fjy Jun 9, 2016

Choose a reason for hiding this comment

fjy commented Jun 9, 2016

fjy commented Jun 13, 2016

sirpkt commented Jun 16, 2016

nishantmonu51 Jun 23, 2016

Choose a reason for hiding this comment

fjy commented Jul 1, 2016

fjy commented Jul 6, 2016

sirpkt commented Jul 8, 2016

sirpkt commented Jul 11, 2016

nishantmonu51 commented Jul 12, 2016

fjy Jul 18, 2016

Choose a reason for hiding this comment

fjy commented Jul 18, 2016

fjy commented Jul 21, 2016

sirpkt commented Jul 26, 2016

b-slim commented Feb 17, 2017