Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-2608] SDK Support JSON data loading directly (without AVRO conversion) #2384

Closed
wants to merge 2 commits into from

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Jun 19, 2018

What changes were proposed in this pull request?
Currently SDK Support JSON data loading only with AVRO.
So, converting json to avro record and avro to carbon object is a two step process. Hence there is a need for a new carbonWriter that works with Json without AVRO.
This PR implents that.

Highlights:
#Works with just the json data and carbon schema.
#Implements hadoop's FileInputFormat to create JsonInputFormat.
#supports reading multiple json files in a folder.
#supports reading json data in multiline with record identifier.
#supports single row json read and write.
#Handles bad records when loading json data
#Supports n-level nesting of complex types as josn data

How was this patch tested?

Manual testing, and UTs are added in this PR.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • Any interfaces changed? NA

  • Any backward compatibility impacted? NA

  • Document update required? Yes, will be handled in separate PR

  • Testing done
    Yes, updated the UT.

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6393/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5227/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5338/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5339/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6395/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5341/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5229/

@ajantha-bhat ajantha-bhat force-pushed the issue_fix branch 2 times, most recently from 04d498e to aa7caec Compare June 21, 2018 17:08
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6456/

@CarbonDataQA
Copy link

Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5287/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5377/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5378/

@ajantha-bhat
Copy link
Member Author

retest this please

* @param filePath
* @return
*/
public static String readFromFile(String filePath) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is only used for test case purpose, move these to test cases only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

* @param jsonFileList
* @param fileExtension
*/
public static void getFileList(CarbonFile carbonFile, List<String> jsonFileList,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unused method, please remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

@Override public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {
RecordReader<LongWritable, Text> rdr;
if (context.getConfiguration().getBoolean(ONE_RECORD_PER_LINE, false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use getOneRecordPerLine method here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

@Override public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {

this.identifier = context.getConfiguration().get(RECORD_IDENTIFIER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getRecordIdentifier here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

* @return The JsonNode or null if a JsonParseException,
* JsonMappingException, or IOException error occurs
*/
public static synchronized JsonNode decodeLineToJsonNode(String line) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static is required here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes required, JsonRecordReader is a static class. so this method is referenced in that.

e.printStackTrace();
return null;
} catch (JsonMappingException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove stack trace and add logger and throw the exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

return (float) rdr.getBytesRead() / toRead;
}

@Override public LongWritable getCurrentKey() throws IOException, InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unnecessary throw exceptions from all methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

* @param object (json row as a string)
* @throws IOException
*/
@Override public void write(Object object) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should directly use JsonInputFormat readers, it should not directly depends on jaxon parser

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

throw new IOException("Failed to parse Json row string ");
}
// convert json object to carbon object.
Object[] writeObjects = JsonCarbonUtil.jsonToCarbonRecord(jsonNodeMap, carbonSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be like csv, so it should not convert to datatypes here. It should include DataConverterStep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

* @param recordIdentifier
* @throws IOException
*/
public void writeFromJsonFile(String inputFilePath, String recordIdentifier) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not b here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it.

*/
@InterfaceAudience.User
@InterfaceStability.Evolving
public class JsonReader<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why again one more reader, why don't you use JsonInputFormat reader

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a wrapper, It uses the same record reader from JsonInputFormat.

JsonReader -> is a wrapper with a list of JsonRecordReader from JsonInputFormat, similar to CarbonReader -> list of CarbonRecordReader from CsvInputFormat (one for for each split) and other warapper iterators.

public class JsonCarbonUtil {


public static Object[] jsonToCarbonRecord(Map<String, Object> jsonNodeMap, Schema carbonSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should directly use coverterstep of load

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. done

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6465/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5296/

@ajantha-bhat
Copy link
Member Author

@ravipesala : please review. comments are addressed.

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6500/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5332/

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5421/

@ravipesala
Copy link
Contributor

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5422/

@CarbonDataQA
Copy link

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6502/

@ajantha-bhat
Copy link
Member Author

retest this please

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5334/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5424/

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6505/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5337/

/**
* It reads data from record reader and sends data to next step.
*/
public class InputProcessorStepWithJsonConverterImpl extends AbstractDataLoadProcessorStep {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed? why is again json parsing done here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to create a carbonRow from JsonString.


@InterfaceAudience.User
@InterfaceStability.Evolving
public class JsonReaderBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be new class, should have single builder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Removed this file and Integrated with existing builder itself.

@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6528/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5360/

…O conversion)

What changes were proposed in this pull request?
Currently SDK Support JSON data loading only with AVRO.
So, converting json to avro record and avro to carbon object is a two
step process. Hence there is a need for a new carbonWriter that works
with Json without AVRO.
This PR implents that.

Highlights:
 # Works with just the json data and carbon schema.
 # Implements hadoop's FileInputFormat to create JsonInputFormat.
 # supports reading multiple json files in a folder.
 # supports reading json data in multiline with record identifier.
 # supports single row json read and write.
 # Handles bad records when loading json data

How was this patch tested?

Manual testing, and UTs are added in this PR.
@CarbonDataQA
Copy link

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6538/

@CarbonDataQA
Copy link

Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5367/

@ravipesala
Copy link
Contributor

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5442/

@ajantha-bhat
Copy link
Member Author

@ravipesala : PR is ready and build is success. Please check

@ravipesala
Copy link
Contributor

LGTM

@asfgit asfgit closed this in 5804d75 Jun 26, 2018
asfgit pushed a commit that referenced this pull request Jul 30, 2018
…O conversion)

What changes were proposed in this pull request?
Currently SDK Support JSON data loading only with AVRO.
So, converting json to avro record and avro to carbon object is a two step process. Hence there is a need for a new carbonWriter that works with Json without AVRO.
This PR implents that.

Highlights:
#Works with just the json data and carbon schema.
#Implements hadoop's FileInputFormat to create JsonInputFormat.
#supports reading multiple json files in a folder.
#supports reading json data in multiline with record identifier.
#supports single row json read and write.
#Handles bad records when loading json data
#Supports n-level nesting of complex types as josn data

This closes #2384
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants