New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GIO] New Implementation of IOGEN #1672
Conversation
Update CodeGen part Add some tests for Matrix
Add tests for Frame flat data
# Conflicts: # pom.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been through about half of the files (it is a very big PR)
Some comments, (mainly on syntax) i have not gone into technical depth of what the parts do.
If i find the time i will look through the rest.
@@ -535,6 +535,16 @@ public String toString() { | |||
} | |||
} | |||
|
|||
public enum OpOpGenerateReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this enum if it only has one enum value?
if(!_ioGenRead) | ||
l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat()); | ||
else | ||
l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Transient Reads, should not be affected by DataIOGen.
Transient Reads, only read a matrix from a previous block of code, it should not be connected to IO.
(Correct me if i am wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are identifying the format and generating corresponding readers to that and then reusing it multiple times, I think it can be Transient.
l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType()); | ||
} | ||
else | ||
l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good if you set the output parameters(dimensions) if known. (like in the not IOGen above).
The previous (before your additions) is bad design setting the variables after the call to the method, but it could be you do it inside your constructor? if you do not i suggest to use the method on line 328.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our design for IOGEN is like this:
read(sample=x, sample_raw=$3, format=$4, data_type="matrix")
this read doesn't have any output like the current SystemsDS regular reading. We are saving the JAVA source code in "format=$4" and due the next readers call that format, it is compiled and used.
* | ||
* @return operation type | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting
|
||
private void runIdentification() { | ||
|
||
/* Index properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is documentation, move it to the method definition, or( even better in my opinion) move it to the class definition.
src/main/java/org/apache/sysds/runtime/iogen/FormatIdentifying.java
Outdated
Show resolved
Hide resolved
return matrixReader; | ||
} | ||
|
||
@Override public String getReaderString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override
should be on line above.
return frameReader; | ||
} | ||
|
||
@Override public String getReaderString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
|
||
for(int i = index; i < index + value.length(); i++) | ||
if(reservedPositions.get(i)) { | ||
flag = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only one place you set flag. does this not mean you can put your return logic in here. and then remove the flag variable.
This PR is a new implementation of the GIO (generating readers for custom datasets). In the new implementation, removed all hard-coded implementations for flat datasets and replaced them with code gen. One of the primary goals of GIO is to support single and multi-row representations of tuples in source datasets. This PR is for supporting both of them.
An outline of the PR is:
It supports Matrix and Frame
It supports nested and flat datasets
It supports non-standard datasets, i.e., an incomplete number of cols in a CSV file