New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-2608] SDK Support JSON data loading directly (without AVRO conversion) #2384
Conversation
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6393/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5227/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5338/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5339/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6395/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5341/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5229/ |
04d498e
to
aa7caec
Compare
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6456/ |
Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5287/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5377/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5378/ |
retest this please |
* @param filePath | ||
* @return | ||
*/ | ||
public static String readFromFile(String filePath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is only used for test case purpose, move these to test cases only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
* @param jsonFileList | ||
* @param fileExtension | ||
*/ | ||
public static void getFileList(CarbonFile carbonFile, List<String> jsonFileList, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unused method, please remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
@Override public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, | ||
TaskAttemptContext context) throws IOException, InterruptedException { | ||
RecordReader<LongWritable, Text> rdr; | ||
if (context.getConfiguration().getBoolean(ONE_RECORD_PER_LINE, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use getOneRecordPerLine
method here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
@Override public void initialize(InputSplit split, TaskAttemptContext context) | ||
throws IOException, InterruptedException { | ||
|
||
this.identifier = context.getConfiguration().get(RECORD_IDENTIFIER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getRecordIdentifier
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
* @return The JsonNode or null if a JsonParseException, | ||
* JsonMappingException, or IOException error occurs | ||
*/ | ||
public static synchronized JsonNode decodeLineToJsonNode(String line) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static is required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes required, JsonRecordReader is a static class. so this method is referenced in that.
e.printStackTrace(); | ||
return null; | ||
} catch (JsonMappingException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove stack trace and add logger and throw the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
return (float) rdr.getBytesRead() / toRead; | ||
} | ||
|
||
@Override public LongWritable getCurrentKey() throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unnecessary throw exceptions from all methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
* @param object (json row as a string) | ||
* @throws IOException | ||
*/ | ||
@Override public void write(Object object) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should directly use JsonInputFormat readers, it should not directly depends on jaxon parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
throw new IOException("Failed to parse Json row string "); | ||
} | ||
// convert json object to carbon object. | ||
Object[] writeObjects = JsonCarbonUtil.jsonToCarbonRecord(jsonNodeMap, carbonSchema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be like csv, so it should not convert to datatypes here. It should include DataConverterStep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
* @param recordIdentifier | ||
* @throws IOException | ||
*/ | ||
public void writeFromJsonFile(String inputFilePath, String recordIdentifier) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not b here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it.
*/ | ||
@InterfaceAudience.User | ||
@InterfaceStability.Evolving | ||
public class JsonReader<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why again one more reader, why don't you use JsonInputFormat reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a wrapper, It uses the same record reader from JsonInputFormat.
JsonReader -> is a wrapper with a list of JsonRecordReader from JsonInputFormat, similar to CarbonReader -> list of CarbonRecordReader from CsvInputFormat (one for for each split) and other warapper iterators.
public class JsonCarbonUtil { | ||
|
||
|
||
public static Object[] jsonToCarbonRecord(Map<String, Object> jsonNodeMap, Schema carbonSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should directly use coverterstep of load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. done
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6465/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5296/ |
@ravipesala : please review. comments are addressed. |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6500/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5332/ |
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5421/ |
SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5422/ |
Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6502/ |
retest this please |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5334/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5424/ |
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6505/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5337/ |
/** | ||
* It reads data from record reader and sends data to next step. | ||
*/ | ||
public class InputProcessorStepWithJsonConverterImpl extends AbstractDataLoadProcessorStep { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed? why is again json parsing done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to create a carbonRow from JsonString.
|
||
@InterfaceAudience.User | ||
@InterfaceStability.Evolving | ||
public class JsonReaderBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be new class, should have single builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Removed this file and Integrated with existing builder itself.
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6528/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5360/ |
…O conversion) What changes were proposed in this pull request? Currently SDK Support JSON data loading only with AVRO. So, converting json to avro record and avro to carbon object is a two step process. Hence there is a need for a new carbonWriter that works with Json without AVRO. This PR implents that. Highlights: # Works with just the json data and carbon schema. # Implements hadoop's FileInputFormat to create JsonInputFormat. # supports reading multiple json files in a folder. # supports reading json data in multiline with record identifier. # supports single row json read and write. # Handles bad records when loading json data How was this patch tested? Manual testing, and UTs are added in this PR.
Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/6538/ |
Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/5367/ |
SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/5442/ |
@ravipesala : PR is ready and build is success. Please check |
LGTM |
…O conversion) What changes were proposed in this pull request? Currently SDK Support JSON data loading only with AVRO. So, converting json to avro record and avro to carbon object is a two step process. Hence there is a need for a new carbonWriter that works with Json without AVRO. This PR implents that. Highlights: #Works with just the json data and carbon schema. #Implements hadoop's FileInputFormat to create JsonInputFormat. #supports reading multiple json files in a folder. #supports reading json data in multiline with record identifier. #supports single row json read and write. #Handles bad records when loading json data #Supports n-level nesting of complex types as josn data This closes #2384
What changes were proposed in this pull request?
Currently SDK Support JSON data loading only with AVRO.
So, converting json to avro record and avro to carbon object is a two step process. Hence there is a need for a new carbonWriter that works with Json without AVRO.
This PR implents that.
Highlights:
#Works with just the json data and carbon schema.
#Implements hadoop's FileInputFormat to create JsonInputFormat.
#supports reading multiple json files in a folder.
#supports reading json data in multiline with record identifier.
#supports single row json read and write.
#Handles bad records when loading json data
#Supports n-level nesting of complex types as josn data
How was this patch tested?
Manual testing, and UTs are added in this PR.
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
Any interfaces changed? NA
Any backward compatibility impacted? NA
Document update required? Yes, will be handled in separate PR
Testing done
Yes, updated the UT.
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. NA