Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GIO] New Implementation of IOGEN #1672

Closed
wants to merge 86 commits into from
Closed
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
cf6ab0a
Start reimplement IOGEN
fathollahzadeh Jan 6, 2022
6e8b649
Start change to new CodeGen
fathollahzadeh Jan 7, 2022
bdc9ce8
Update CodeGen
fathollahzadeh Jan 16, 2022
7a1cf72
Update CodeGen
fathollahzadeh Jan 16, 2022
360f570
Fix MappingTrie Bug
fathollahzadeh Jan 16, 2022
8dedf85
Add CodeGen for Frame
fathollahzadeh Jan 17, 2022
6ab2921
Add row scattered column support
fathollahzadeh Jan 17, 2022
c7daac6
Add Multi Line Identification
fathollahzadeh Jan 22, 2022
0ab6ddc
Add Row Prefix Identification and Key Pattern Build for Row Index
fathollahzadeh Jan 22, 2022
239e781
Update CodeGen
fathollahzadeh Jan 23, 2022
2f40135
Update JAVA CodeGen
fathollahzadeh Jan 24, 2022
00f6d65
Fixed some bugs in Identification section
fathollahzadeh Jan 25, 2022
b92a88c
removed an old code
fathollahzadeh Jan 25, 2022
d866f10
Minor
fathollahzadeh Jan 25, 2022
89b2ae0
Fix a bug in ReaderMapping
fathollahzadeh Jan 25, 2022
0110eb9
Add tests for Frame nested data
fathollahzadeh Jan 25, 2022
a828d14
Fix Code Style
fathollahzadeh Jan 25, 2022
030f15e
Minor rollback
fathollahzadeh Jan 25, 2022
87c4e8c
Minor
fathollahzadeh Jan 25, 2022
ee8c972
Init commit for multi row CodeGen
fathollahzadeh Jan 26, 2022
cc0394e
Update multi row CodeGen
fathollahzadeh Jan 26, 2022
f9d8f89
support duplicate values
fathollahzadeh Jan 27, 2022
3db25e3
Experiments VLDB2022
fathollahzadeh Jan 27, 2022
c4b8ec7
minor merge
fathollahzadeh Jan 27, 2022
f3a2b62
minor merge
fathollahzadeh Jan 27, 2022
409c06d
UP a Test
fathollahzadeh Jan 27, 2022
c1f1799
Fix duplicate string mapping bug
fathollahzadeh Jan 27, 2022
bd4cbef
up
fathollahzadeh Jan 27, 2022
28ae8b9
up
fathollahzadeh Jan 27, 2022
72b7507
Fix a bug in code gen
fathollahzadeh Jan 28, 2022
078266c
optimization
fathollahzadeh Jan 29, 2022
951f3c3
optimization
fathollahzadeh Jan 31, 2022
897d3e8
up
fathollahzadeh Feb 1, 2022
bb4394c
Add Baselines
fathollahzadeh Feb 6, 2022
5c7190f
Add Baselines
fathollahzadeh Feb 11, 2022
2aae0ee
Update Exp branch
fathollahzadeh Feb 14, 2022
3baaca0
Add CodeGen for Spars Datasets
fathollahzadeh Feb 17, 2022
350282a
Update GIO Exp
fathollahzadeh Feb 17, 2022
d84c3e9
Update GIO Exp
fathollahzadeh Feb 18, 2022
1d3895a
Update GIO Exp
fathollahzadeh Feb 26, 2022
d26aafd
Update GIO Exp
fathollahzadeh Feb 26, 2022
4b98c6d
Update GIO, Move from 2D Array to MatrixBlock
fathollahzadeh Feb 27, 2022
2b9e79e
Revert "Update GIO, Move from 2D Array to MatrixBlock"
fathollahzadeh Feb 27, 2022
a34cd64
Update GIO EXp, SystemDS Reader
fathollahzadeh Feb 27, 2022
c09aac0
Update GIO EXp
fathollahzadeh Feb 27, 2022
6c1e602
Update GIO Exp
fathollahzadeh Feb 27, 2022
5fe869a
Update GIO Exp
fathollahzadeh Feb 27, 2022
28b9f0e
Update for new custom properties
fathollahzadeh Apr 6, 2022
29d83fa
Extend ReaderMapping t support symmetric, skew-symmetric and pattern …
fathollahzadeh May 30, 2022
d377fd0
Fix Symmetric and Skew-Symmetric properties bugs
fathollahzadeh Jun 7, 2022
93c6e9e
New synchronization with the paper background section
fathollahzadeh Jun 9, 2022
9cdba95
Fix row index identity detection bug
fathollahzadeh Jun 9, 2022
ff3a5e9
Update codegen section based on the new implementation
fathollahzadeh Jun 10, 2022
893c43d
Update codegen section with row-index identity and col-index exist
fathollahzadeh Jun 10, 2022
3d4b36c
Initial commit of multi-line format identification
fathollahzadeh Jun 12, 2022
e1b2ff5
Add record delimiter in seq-scattered record formats
fathollahzadeh Jun 12, 2022
caf01d8
First commit of the seq-scattered reader code gen
fathollahzadeh Jun 14, 2022
0c27ee2
Init commit of parallel code gen
fathollahzadeh Jun 15, 2022
cefabd0
Init commit of parallel code gen for Frame
fathollahzadeh Jun 16, 2022
bbf0f70
Parallel implementation of Frame reader
fathollahzadeh Jun 16, 2022
3ffecd5
minor update, code style
fathollahzadeh Jun 16, 2022
283e83a
Add Gson, Jackson parallel readers, update experiment source code
fathollahzadeh Jun 16, 2022
1091058
Minor update, clean-up
fathollahzadeh Jun 17, 2022
3284b7b
Merge remote-tracking branch 'origin/sf-GIOEXPVLDB2022V2' into sf-GIO
fathollahzadeh Jun 17, 2022
bd84592
Fix nrow calc in parallel frame reader
fathollahzadeh Jun 17, 2022
014a987
Fix multi-line reader bug
fathollahzadeh Jun 21, 2022
9afaf2a
Add AMiner Dataset Reader to SystemDS
fathollahzadeh Jun 21, 2022
31fe364
Add AMiner Dataset Parallel Reader to SystemDS
fathollahzadeh Jun 21, 2022
0443417
Update experiment code
fathollahzadeh Jun 22, 2022
95d8bce
Fix Aminer parallel reader bug
fathollahzadeh Jun 22, 2022
f39ef6f
Fix Frame code gen bug
fathollahzadeh Jun 23, 2022
c1253a9
Fix frame single thread reader
fathollahzadeh Jun 23, 2022
36b1c98
Fix some bugs
fathollahzadeh Jun 24, 2022
2489647
Minor
fathollahzadeh Jun 24, 2022
9d98d58
Improve performance and fix multi-line detection mapping
fathollahzadeh Jun 25, 2022
d541fd3
Improve performance and fix multi-line detection mapping
fathollahzadeh Jun 25, 2022
cd9b54a
Cleanup, fixed bugs in codegen, update tests, remove unnecessary tests
fathollahzadeh Jul 27, 2022
5ee9d77
minor
fathollahzadeh Jul 27, 2022
d630726
Merge remote-tracking branch 'origin/sf-GIO' into GIO
fathollahzadeh Jul 27, 2022
46f3c1b
Remove benchmark codes from GIO PR
fathollahzadeh Jul 27, 2022
b4d4fd3
Minor Cleanup
fathollahzadeh Jul 27, 2022
2f66e41
Initial Integration of GIO with SystemDS (DML)
fathollahzadeh Aug 4, 2022
d612708
Fix Code Style
fathollahzadeh Aug 5, 2022
21bddc0
Minor Rename Input File Name
fathollahzadeh Aug 5, 2022
f096a22
Formatting, resolve some comments,and rename class name
fathollahzadeh Aug 24, 2022
ac02ff1
Minor
fathollahzadeh Aug 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
13 changes: 12 additions & 1 deletion src/main/java/org/apache/sysds/common/Types.java
Expand Up @@ -535,6 +535,16 @@ public String toString() {
}
}

public enum OpOpGenerateReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this enum if it only has one enum value?

GENERATEREADER;
public boolean isGenerateReader(){return this == GENERATEREADER;}

@Override
public String toString() {
return "GRead";
}
}

public enum FileFormat {
TEXT, // text cell IJV representation (mm w/o header)
MM, // text matrix market IJV representation
Expand All @@ -544,7 +554,8 @@ public enum FileFormat {
BINARY, // binary block representation (dense/sparse/ultra-sparse)
FEDERATED, // A federated matrix
PROTO, // protocol buffer representation
HDF5; // Hierarchical Data Format (HDF)
HDF5, // Hierarchical Data Format (HDF)
IOGEN; // Generated Reader

public boolean isIJV() {
return this == TEXT || this == MM;
Expand Down
43 changes: 38 additions & 5 deletions src/main/java/org/apache/sysds/hops/DataOp.java
Expand Up @@ -32,6 +32,7 @@
import org.apache.sysds.conf.ConfigurationManager;
import org.apache.sysds.hops.rewrite.HopRewriteUtils;
import org.apache.sysds.lops.Data;
import org.apache.sysds.lops.DataIOGen;
import org.apache.sysds.lops.Federated;
import org.apache.sysds.lops.Lop;
import org.apache.sysds.common.Types.ExecType;
Expand All @@ -54,11 +55,15 @@ public class DataOp extends Hop {

//read dataop properties
private FileFormat _inFormat = FileFormat.TEXT;
private String _inIOGenFormat;
private long _inBlocksize = -1;
private boolean _hasOnlyRDD = false;

private boolean _recompileRead = true;

private boolean _ioGenRead = false;
private GenerateReaderOp _generateReaderOp;

/**
* List of "named" input parameters. They are maintained as a hashmap:
* parameter names (String) are mapped as indices (Integer) into getInput()
Expand Down Expand Up @@ -247,6 +252,26 @@ public void setFileName(String fn) {
_fileName = fn;
}

public void setIOGenRead(boolean isIOGenRead) {
_ioGenRead = isIOGenRead;
}

public boolean isIOGenRead(){
return _ioGenRead;
}

public String getIOGenFormat() {
return _inIOGenFormat;
}

public void setIOGenFormat(String ioGenFormat) {
this._inIOGenFormat = ioGenFormat;
}

public void setGenerateReaderOp(GenerateReaderOp op){
_generateReaderOp = op;
}

public String getFileName() {
return _fileName;
}
Expand Down Expand Up @@ -283,20 +308,28 @@ public Lop constructLops()
for (Entry<String, Integer> cur : _paramIndexMap.entrySet()) {
inputLops.put(cur.getKey(), getInput().get(cur.getValue()).constructLops());
}
if(_ioGenRead)
inputLops.put("iogenformat", _generateReaderOp.constructLops());

// Create the lop
switch(_op)
{
case TRANSIENTREAD:
l = new Data(_op, null, inputLops, getName(), null,
getDataType(), getValueType(), getFileFormat());
if(!_ioGenRead)
l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
else
l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Transient Reads, should not be affected by DataIOGen.
Transient Reads, only read a matrix from a previous block of code, it should not be connected to IO.
(Correct me if i am wrong)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are identifying the format and generating corresponding readers to that and then reusing it multiple times, I think it can be Transient.

setOutputDimensions(l);
break;

case PERSISTENTREAD:
l = new Data(_op, null, inputLops, getName(), null,
getDataType(), getValueType(), getFileFormat());
l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
if(!_ioGenRead){
l = new Data(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getFileFormat());
l.getOutputParameters().setDimensions(getDim1(), getDim2(), _inBlocksize, getNnz(), getUpdateType());
}
else
l = new DataIOGen(_op, null, inputLops, getName(), null, getDataType(), getValueType(), getIOGenFormat());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good if you set the output parameters(dimensions) if known. (like in the not IOGen above).

The previous (before your additions) is bad design setting the variables after the call to the method, but it could be you do it inside your constructor? if you do not i suggest to use the method on line 328.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our design for IOGEN is like this:
read(sample=x, sample_raw=$3, format=$4, data_type="matrix")
this read doesn't have any output like the current SystemsDS regular reading. We are saving the JAVA source code in "format=$4" and due the next readers call that format, it is compiled and used.


break;

case PERSISTENTWRITE:
Expand Down
170 changes: 170 additions & 0 deletions src/main/java/org/apache/sysds/hops/GenerateReaderOp.java
@@ -0,0 +1,170 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.sysds.hops;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.sysds.common.Types;
import org.apache.sysds.common.Types.DataType;
import org.apache.sysds.lops.Lop;
import org.apache.sysds.lops.ReaderGen;
import org.apache.sysds.runtime.meta.DataCharacteristics;

import java.util.HashMap;
import java.util.Map.Entry;


public class GenerateReaderOp extends Hop {
private static final Log LOG = LogFactory.getLog(GenerateReaderOp.class.getName());
private Types.OpOpGenerateReader _op;

/**
* List of "named" input parameters. They are maintained as a hashmap:
* parameter names (String) are mapped as indices (Integer) into getInput()
* arraylist.
* <p>
* i.e., getInput().get(_paramIndexMap.get(parameterName)) refers to the Hop
* that is associated with parameterName.
*/
private HashMap<String, Integer> _paramIndexMap = new HashMap<>();

private GenerateReaderOp() {
//default constructor for clone
}

@Override
public void checkArity() {

}

@Override
public boolean allowsAllExecTypes() {
return false;
}

@Override
protected DataCharacteristics inferOutputCharacteristics(MemoTable memo) {
return null;
}

@Override
public Lop constructLops() {
//return already created lops
if( getLops() != null )
return getLops();

Types.ExecType et = Types.ExecType.CP;


// construct lops for all input parameters
HashMap<String, Lop> inputLops = new HashMap<>();
for (Entry<String, Integer> cur : _paramIndexMap.entrySet()) {
inputLops.put(cur.getKey(), getInput().get(cur.getValue()).constructLops());
}

Lop l = new ReaderGen(getInput().get(0).constructLops(),_dataType, _valueType, et, inputLops);

setLineNumbers(l);
setPrivacy(l);
setLops(l);

//add reblock/checkpoint lops if necessary
constructAndSetLopsDataFlowProperties();

return getLops();
}

@Override
protected Types.ExecType optFindExecType(boolean transitive) {
return null;
}

@Override
public String getOpString() {
String s = new String("");
s += _op.toString();
s += " "+getName();
return s;
}

@Override
public boolean isGPUEnabled() {
return false;
}

@Override
protected double computeOutputMemEstimate(long dim1, long dim2, long nnz) {
return 0;
}

@Override
protected double computeIntermediateMemEstimate(long dim1, long dim2, long nnz) {
return 0;
}

@Override
public void refreshSizeInformation() {

}

@Override
public Object clone() throws CloneNotSupportedException {
return null;
}

@Override
public boolean compare(Hop that) {
return false;
}

/**
* Generate Reader operation for Matrix
* This constructor supports expression in parameters
* @param l ?
* @param dt data type
* @param dop data operator type
* @param in high-level operator
* @param inputParameters input parameters
*/
public GenerateReaderOp(String l, DataType dt, Types.OpOpGenerateReader dop, Hop in, HashMap<String, Hop> inputParameters) {
_dataType = dt;
_op = dop;
_name = l;
getInput().add(0, in);
in.getParent().add(this);

if(inputParameters != null) {
int index = 1;
for(Entry<String, Hop> e : inputParameters.entrySet()) {
String s = e.getKey();
Hop input = e.getValue();
getInput().add(input);
input.getParent().add(this);

_paramIndexMap.put(s, index);
index++;
}
}
}

public Types.OpOpGenerateReader getOp() {
return _op;
}
}