Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[HIVEMALL-151] Support Matrix conversion from DoK to CSR/CSC matrix #121

Closed
wants to merge 13 commits into from

Conversation

myui
Copy link
Member

@myui myui commented Oct 5, 2017

What changes were proposed in this pull request?

  • Support Matrix conversion from DoK to CSR/CSC matrix
  • Introduce FloatMatrix (dumped supported Java version from Java 7 to Java 8!)
  • Revise SLIM implementation to use CSRMatrix converted from DoKMatrix for this issue

What type of PR is it?

Improvement, Refactoring

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-151

How was this patch tested?

unit tests, manual tests

Checklist

(Please remove this section if not needed; check x for YES, blank for NO)

  • Did you apply source code formatter, i.e., mvn formatter:format, for your commit?
  • Did you run system tests on Hive (or Spark)?

@myui
Copy link
Member Author

myui commented Oct 5, 2017

Review comments are welcome @takuti

Still work in progress

  • Add more unit tests
  • Revise SLIM implementation for this issue
  • manual tests using SLIM

@myui
Copy link
Member Author

myui commented Oct 6, 2017

Caused by: java.lang.IllegalArgumentException: numCols SHOULD be greater than zero. numCols = rowEnd - rowStart = 0 - 0 = 0
at hivemall.math.matrix.MatrixUtils.sortIndicies(MatrixUtils.java:284)
at hivemall.math.matrix.MatrixUtils.coo2csr(MatrixUtils.java:121)
at hivemall.math.matrix.sparse.floats.DoKFloatMatrix.toRowMajorMatrix(DoKFloatMatrix.java:367)
at hivemall.math.matrix.sparse.floats.DoKFloatMatrix.toRowMajorMatrix(DoKFloatMatrix.java:41)
at hivemall.recommend.SlimUDTF.train(SlimUDTF.java:402)
at hivemall.recommend.SlimUDTF.replayTrain(SlimUDTF.java:647)
at hivemall.recommend.SlimUDTF.runIterativeTraining(SlimUDTF.java:592)

@myui
Copy link
Member Author

myui commented Oct 11, 2017

@maropu could you review my PFOR implementation c0f465a if possible?
https://paperhub.s3.amazonaws.com/7558905a56f370848a04fa349dd8bb9d.pdf

@myui myui changed the title [WIP][HIVEMALL-151] Support Matrix conversion from DoK to CSR/CSC matrix [HIVEMALL-151] Support Matrix conversion from DoK to CSR/CSC matrix Oct 11, 2017
@myui
Copy link
Member Author

myui commented Oct 11, 2017

Decided to avoid using DoK => CSR conversion for dataMatrix because rows are also sparse and not suited for CSR.

Copy link
Member

@takuti takuti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments I've noticed when I first looked into the code.

coo2csr(rows, cols, data, rowPointers, colIndicies, values, numRows, numCols, nnz);

if (sortColumns) {
sortIndicies(rowPointers, colIndicies, values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found lots of typos indicies in many different classes...it's indices

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found other typos. add_field_indicies should be add_field_indcies.

+ " - " + rowStart + " = " + numCols + " at i=" + i);
}

final IntDoublePair[] pairs = new IntDoublePair[numCols];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use existing hivemall.utils.struct.Pair instead of newly introduced IntDoublePair (and IntFloatPair) instance?

We frequently like to use pair/tuple-ish data structure, so I feel using the same interface struct.Pair as much as we can is a better idea.

final List<Pair<Integer, Double>> pairs = new ArrayList<Pair<Integer, Double>>();
for (int jj = rowStart; jj < rowEnd; jj++) {
    pairs.add(Pair.of(colIndicies[jj], values[jj]));
}

Collections.sort(pairs, new Comparator<Pair<Integer, Double>>() {
    @Override
    public int compare(Pair<Integer, Double> x, Pair<Integer, Double> y) {
        return Integer.compare(x.getKey(), y.getKey());
    }
});

for (int jj = rowStart, n = 0; jj < rowEnd; jj++, n++) {
    Pair<Integer, Double> tmp = pairs.get(n);
    colIndicies[jj] = tmp.getKey();
    values[jj] = tmp.getValue();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ code is inefficient and better to use primitive objects.
Avoid Autoboxing and primitive <=> object conversions.

}
}

private static void sortIndicies(@Nonnull final int[] rowPointers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method is used both for CSR and CSC, more generic argument names make the code more understandable e.g., majorAxisPointers and minorAxisIndices

@asfgit asfgit closed this in fdf7021 Oct 16, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants