Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF-IDF NPE #36

Closed
automaticgiant opened this issue Apr 27, 2016 · 6 comments
Closed

TF-IDF NPE #36

automaticgiant opened this issue Apr 27, 2016 · 6 comments

Comments

@automaticgiant
Copy link

I started with a purely numeric test project to start, but when I tried to adapt it to a spark workflow we were trying to accelerate using tf-idf, it exploded. I jacked up n a little higher looking at #33. Thought it looked kindof like #33, but doesn't seem like a complete match.

public class App {
    public static void main(String[] args) {
        Random rng = new Random();
        rng.setSeed(0);
        int n = 1000;
        HashedTextVectorCreator htvc = new HashedTextVectorCreator(1000, new NaiveTokenizer(), new TfIdf());
        RegressionDataSet regressionDataSet = new RegressionDataSet(Stream
                .generate(UUID::randomUUID)
                .limit(n)
                .map(String::valueOf)
                .map(htvc::newText)
                .map(v -> new DataPointPair<>(new DataPoint(v), rng.nextDouble()))
                .collect(Collectors.toList()));
        RandomForest randomForest = new RandomForest();
        randomForest.train(regressionDataSet);
        double regress = randomForest.regress(new DataPoint(htvc.newText("asdf")));
        System.out.println(regress);
    }
}
/usr/lib/jvm/java-8-openjdk/bin/java -Didea.launcher.port=7533 -Didea.launcher.bin.path=/opt/idea-IU-145.597.3/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-8-openjdk/jre/lib/charsets.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/jaccess.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/nashorn.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/jce.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/jsse.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/management-agent.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/resources.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/rt.jar:/home/automaticgiant/git/valet2k/jsat-test/target/classes:/home/automaticgiant/.m2/repository/com/edwardraff/JSAT/0.0.4/JSAT-0.0.4.jar:/opt/idea-IU-145.597.3/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain asdf.App
Exception in thread "main" java.lang.NullPointerException
    at jsat.text.wordweighting.TfIdf.indexFunc(TfIdf.java:95)
    at jsat.linear.SparseVector.applyIndexFunction(SparseVector.java:882)
    at jsat.text.wordweighting.TfIdf.applyTo(TfIdf.java:105)
    at jsat.text.HashedTextVectorCreator.newText(HashedTextVectorCreator.java:52)
    at jsat.text.HashedTextVectorCreator.newText(HashedTextVectorCreator.java:41)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.SliceOps$1$1.accept(SliceOps.java:204)
    at java.util.stream.StreamSpliterators$InfiniteSupplyingSpliterator$OfRef.tryAdvance(StreamSpliterators.java:1356)
    at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
    at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at asdf.App.main(App.java:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Process finished with exit code 1
@automaticgiant
Copy link
Author

totally happens on 35055f8 (newest at this time)

@automaticgiant
Copy link
Author

automaticgiant commented Apr 28, 2016

It seems that with certain components, a HashedTextVectorCreator can be used directly, but TfIDF and OkapiBM25 need it to be internal to a HashedTextDataLoader, so that it is initialized on a corpus so that weighting can be done. I seem to have the options to either use a data loader or setWeight myself, but I guess I'm leaning towards the former. I'll give it a shot when I'm free today.
It could be more of a documentation issue.

@EdwardRaff
Copy link
Owner

Does this happen with the normal TextDataLoader? I'll hopefully get to testing this later tonight.

@EdwardRaff
Copy link
Owner

Ok, now that I've read this it's a documentation issue. The HashedTextVectorCreator expects the word weighting to already be configured. I'm going to try and write some improved Javadoc right now and improve the error message.

As I look back at this code, I think it could definitely be improved. I'm going to add it to my refactoring list in #1 .

@EdwardRaff
Copy link
Owner

Just tried adding some better documentation to the class descriptions. Please take a look and let me know if it clears stuff up

@automaticgiant
Copy link
Author

f0e3a5f is super helpful. We are redoing the dataflow now to accomodate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants