Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop InputRowParser for Orc file #3019

Merged
merged 13 commits into from
Jul 26, 2016
Merged

Hadoop InputRowParser for Orc file #3019

merged 13 commits into from
Jul 26, 2016

Conversation

sirpkt
Copy link
Contributor

@sirpkt sirpkt commented May 26, 2016

Related with #3017

  • user SHOULD supply schema information as a form of string representation of Orc struct type info.
    for example, string column col1 and array of string column col2 is represented by struct<col1:string,col2:array<string>>
  • only support java primitive type columns and array of java primitive type columns
  • as shown in hadoop_orc_job.json example, inputFormat Should be set as org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat

@fjy fjy added the Feature label May 26, 2016
@fjy fjy added this to the 0.9.2 milestone May 26, 2016
@fjy
Copy link
Contributor

fjy commented May 26, 2016

@sirpkt can we also get some docs on usage and also update extensions.md file?

@sirpkt
Copy link
Contributor Author

sirpkt commented May 26, 2016

@fjy Sure, I'll update doc.

@sirpkt
Copy link
Contributor Author

sirpkt commented May 26, 2016

This patch used hive orc reader so that it has complex dependencies including hive-exec.
Apache Orc(https://github.com/apache/orc) is working to separate Orc code from Hive but only done for org.apache.hadoop.mapred (old API) currently.
As pull request for org.apache.hadoop.mapreduce already exists, I think it will soon support new MapReduce API.
I'll watch the progress and update the patch to use that library.

@sirpkt
Copy link
Contributor Author

sirpkt commented May 30, 2016

During testing in the real server cluster, I experienced library dependency problem.
So, I close this PR until I resolve that problem.

@sirpkt sirpkt closed this May 30, 2016
@sirpkt
Copy link
Contributor Author

sirpkt commented May 30, 2016

maven dependency updated

@sirpkt sirpkt reopened this May 30, 2016
@sirpkt sirpkt force-pushed the hadoop-orc branch 4 times, most recently from 333720a to 7740b66 Compare June 3, 2016 06:31
"inputSpec": {
"type": "static",
"inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat",
"paths": "no_metrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should be filled out in the paths field?

@fjy
Copy link
Contributor

fjy commented Jun 9, 2016

overall this PR looks reasonable to me however I have little experience with ORC files. Can someone who has more experience with ORC files take a look and comment on the API?

@fjy
Copy link
Contributor

fjy commented Jun 13, 2016

@sirpkt there are some merge errors

@sirpkt
Copy link
Contributor Author

sirpkt commented Jun 16, 2016

rebase and update based on comments

builder.append(parseSpec.getTimestampSpec()).append(":string");
if (parseSpec.getDimensionsSpec().getDimensionNames().size() > 0) {
builder.append(",");
builder.append(StringUtils.join(parseSpec.getDimensionsSpec().getDimensionNames(), ":string,")).append(":string");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to append ":string" twice here ?

@fjy
Copy link
Contributor

fjy commented Jul 1, 2016

@sirpkt any chance we can finish this up?

@fjy
Copy link
Contributor

fjy commented Jul 6, 2016

@sirpkt

@sirpkt
Copy link
Contributor Author

sirpkt commented Jul 8, 2016

@fjy sorry for late response.
I just moved my house so had not enough time to follow review.
I'll update based on the review comment

@sirpkt
Copy link
Contributor Author

sirpkt commented Jul 11, 2016

updated based on the review comments

@nishantmonu51
Copy link
Member

👍 , all my comments are addressed.

@@ -0,0 +1,134 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing a license header

@fjy
Copy link
Contributor

fjy commented Jul 18, 2016

👍 after the missing license header is added

@fjy
Copy link
Contributor

fjy commented Jul 21, 2016

@sirpkt

@sirpkt
Copy link
Contributor Author

sirpkt commented Jul 26, 2016

@fjy sorry for late response.
I rebased code and added license header.

@fjy fjy merged commit 95a5809 into apache:master Jul 26, 2016
@sirpkt sirpkt deleted the hadoop-orc branch July 27, 2016 07:27
@gianm gianm mentioned this pull request Sep 23, 2016
@b-slim
Copy link
Contributor

b-slim commented Feb 17, 2017

seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
seoeun25 added a commit to seoeun25/incubator-druid that referenced this pull request Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants