Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[HIVEMALL-220] Implement Cofactor #167

Open
wants to merge 264 commits into
base: master
Choose a base branch
from

Conversation

jaxony
Copy link

@jaxony jaxony commented Oct 18, 2018

What changes were proposed in this pull request?

Implemented new matrix factorization algorithm for the recommendation problem.

What type of PR is it?

Feature

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-220

How was this patch tested?

Unit tests and manual testing on ML20M in a Hive dev environment

How to use this feature?

TODO

Checklist

(Please remove this section if not needed; check x for YES, blank for NO)

  • Did you apply source code formatter, i.e., ./bin/format_code.sh, for your commit?
  • Did you run system tests on Hive (or Spark)?

core/src/main/java/hivemall/mf/FactorizedModel.java Outdated Show resolved Hide resolved
// storing trainable latent factors and weights
private Map<String, RealVector> theta;
private Map<String, RealVector> beta;
private Map<String, Double> betaBias;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Object2DoubleMap<String> betaBias instead to reduce memory consumption.

core/src/main/java/hivemall/mf/CofactorModel.java Outdated Show resolved Hide resolved
protected static RealMatrix calculateWTWSubset(List<Feature> subset, Map<String, RealVector> weights, int numFactors, float constant) {
// equivalent to `B_u.T.dot((c1 - c0) * B_u)` in cofacto.py
RealMatrix delta = new Array2DRowRealMatrix(numFactors, numFactors);
int i = 0, j = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int i = 0, j = 0; never used.

This could be found in IDE warning. Please resolve all IDE warnings.

if (weights.containsKey(key)) {
return;
}
RealVector v = new ArrayRealVector(factor);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final double[] v = new double[factor];
...
weight.put(key, new ArrayRealVector(v));

is better.

core/src/main/java/hivemall/mf/CofactorModel.java Outdated Show resolved Hide resolved

private static double dotFactorsAlongDims(Map<String, RealVector> weights, int dim1, int dim2) {
double result = 0.d;
for (String key : weights.keySet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply

        for(RealVector vec : weights.values()) {
           ...      
        }

What's expected behavior when vec.getEntry(dim1/2) returns null?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RealVector implementation has now changed to double[], so by default it should be 0. In any case, all entries in weights have been initialized to uniform random or gaussian so the array entries should be valid.

*/
private static void addInPlace(RealVector u, RealVector v, double scalar) {
assert u.getDimension() == v.getDimension();
for (int i = 0; i < u.getDimension(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer avoiding method calls by

for (int i = 0, size=u.getDimension(); i < size; i++) {


protected void add(TrainingSample sample) {
if (size() == this.maxSize) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample is thrown away?

You can return boolean value to check MiniBatch is full or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an explicit mini batch size is not required anymore. The upper limit is the size of the inputBuf. Is this correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. let me see..

core/src/main/java/hivemall/mf/CofactorModel.java Outdated Show resolved Hide resolved
core/src/main/java/hivemall/mf/CofactorModel.java Outdated Show resolved Hide resolved
core/src/main/java/hivemall/mf/CofactorModel.java Outdated Show resolved Hide resolved
core/src/main/java/hivemall/mf/CofactorModel.java Outdated Show resolved Hide resolved
@myui myui force-pushed the feature/cofactor-feature-array branch from 9864926 to 08bc526 Compare November 14, 2018 07:00
@jaxony
Copy link
Author

jaxony commented Mar 26, 2019

Hi @myui, I'd like to continue working on cofactor and hopefully get it merged in. Any recommendations on how I can test implementation in distributed environment?

@myui
Copy link
Member

myui commented Mar 26, 2019

Hi @jaxony welcome back!

For pseudo evaluation, docker is easy to evaluate.
https://hivemall.incubator.apache.org/userguide/docker/getting_started.html
^ single worker

This docker-compose spin-up multiple hadoop servers using @Lewuathe 's
https://github.com/Lewuathe/docker-hadoop-cluster

Then, you can just login to master node and setup Hive (and hivemall) on the master node. Note Hive command is just a standalone application submitting jobs to Hadoop.
https://github.com/apache/incubator-hivemall/blob/master/resources/docker/Dockerfile#L58

You can modify Hive setup process on master node following
https://github.com/apache/incubator-hivemall/blob/master/resources/docker/Dockerfile#L58

Also, I'll consider to provide you EMR testbed.

@myui
Copy link
Member

myui commented Apr 2, 2019

@jaxony For security reasons, we would like to add your IP address to whitelist for accessing our EMR environment. Do you have specific IP range (e.g., university's one)?

@jaxony
Copy link
Author

jaxony commented Apr 5, 2019

@myui I think the University of Melbourne's IP range is 128.250.0.0 to 128.250.0.255

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants