Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bayes.go tokenizer breaks the sentiment restoration #13

Closed
Corrob opened this issue Sep 18, 2016 · 9 comments
Closed

Bayes.go tokenizer breaks the sentiment restoration #13

Corrob opened this issue Sep 18, 2016 · 9 comments

Comments

@Corrob
Copy link

Corrob commented Sep 18, 2016

#12 breaks restorations that don't call NewNaiveBayes()

Sentiment is broken because it doesn't set the tokenizer. Should the tokenizer somehow be set when it is un-marshaled?

@cdipaolo
Copy link
Owner

Ah 😞 ... do you have a minimum working example of what isn't working now that was before?

@Corrob
Copy link
Author

Corrob commented Sep 18, 2016

Running "go test" in the sentiment repo produces:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4f78c3]

goroutine 8 [running]:
panic(0x593260, 0xc82000a120)
    /usr/lib/go/src/runtime/panic.go:481 +0x3e6
testing.tRunner.func1(0xc82003a090)
    /usr/lib/go/src/testing/testing.go:467 +0x192
panic(0x593260, 0xc82000a120)
    /usr/lib/go/src/runtime/panic.go:443 +0x4e9
github.com/cdipaolo/goml/text.(*NaiveBayes).Predict(0xc82003a000, 0x5ca878, 0x5, 0x2)
    /home/cory/go/src/github.com/cdipaolo/goml/text/bayes.go:265 +0xe3
github.com/cdipaolo/sentiment.Models.SentimentAnalysis(0xc820010630, 0x5ca878, 0x5, 0x5ca418, 0x2, 0x0)
    /home/cory/go/src/github.com/cdipaolo/sentiment/sentiment.go:37 +0x4ec
github.com/cdipaolo/sentiment.TestPositiveWordSentimentShouldPass1(0xc82003a090)
    /home/cory/go/src/github.com/cdipaolo/sentiment/sentiment_test.go:23 +0x12b
testing.tRunner(0xc82003a090, 0x760280)
    /usr/lib/go/src/testing/testing.go:473 +0x98
created by testing.RunTests
    /usr/lib/go/src/testing/testing.go:582 +0x892
exit status 2
FAIL    github.com/cdipaolo/sentiment   0.528s

@piazzamp
Copy link
Collaborator

To restore models with non-default tokenizers, we could make a serializable version of the Tokenizer funcs and then translate between the funcs and their serializable counterparts in custom MarshalJSON and UnmarshalJSON methods. This could be done via some package-scope map of the serializable things (ints or strings?) to the Tokenizer funcs.
That won't help models that use hand-rolled tokenizers though.

for example, there would be a map[string]Tokenizer that the JSON funcs would use to store a handle to the Tokenizer.

@piazzamp
Copy link
Collaborator

ahh – adding an UnmarshalJSON method to NaiveBayesisn't quite the quick and easy thing I thought it would be. It becomes infinitely recursive if defined like:

func (nb *NaiveBayes) UnmarshalJSON(data []byte) error {
    err := json.Unmarshal(data, nb)
    if err != nil {
        return err
    }
    nb.tokenize = SpaceTokenizer
    return nil
}

since json.Unmarshal calls the type's UnmarshalJSON method 😅
I feel dirty putting something like if nb.tokenize != nil { return nil }* in as a base case to prevent that recursion since it could have unexpected side effects – namely that consumers wouldn't be able to unmarshal a NaiveBayes struct after having set its tokenizer.

*even with this, TestPersistNaiveBayesShouldPass1 fails

@cdipaolo
Copy link
Owner

Yeah that sounds nontrivial. Could you have something more along the lines of if nb.tokenize == nil { nb.tokenize = SpaceTokenizer }? That should only run once and then if the user sets it beforehand it wouldn't overwrite it if I'm thinking this through correctly.

(@ev0l any comments?)

@piazzamp
Copy link
Collaborator

it doesn't look like that will prevent the recursion since it won't break out of the function before the call to json.Unmarshal. assuming it's written:

func (nb *NaiveBayes) UnmarshalJSON(data []byte) error {
    if nb.tokenize == nil {
        nb.tokenize = SpaceTokenizer
    }
    err := json.Unmarshal(data, nb)
    if err != nil {
        return err
    }
    return nil
}

whereas checking nb.tokenize != nil and then retuning allows us to skip the call to json.Unmarshal when we think it has already been called.

It might be reasonable to add that check for nil in UnmarshalJSON until we can come up with a more clever solution though – at least then no one will get panics. But there are places where NaiveBayes models with non-nil tokenizers call RestoreFromFile NaiveBayes persistence tests thus mucking up our workaround.

Poking into this more reveals that we have the same issue for the sanitizers.
What if we had interfaces instead of funcs for these fields in the NaiveBayes struct, that way we could define functions on the nil pointer for that type to implement default behavior. I'll play around with this later if it makes sense next time I think about it.

@piazzamp
Copy link
Collaborator

piazzamp commented Oct 5, 2016

I'm thinking something along the lines of:

// Tokenizer accepts a sentence as input and breaks
// it down into a slice of tokens
type Tokenizer interface {
    Tokenize(string) []string
}

// SimpleTokenizer splits sentences into
// tokens by brekaing them on its
// SplitOn string – space, for example
type SimpleTokenizer struct {
    SplitOn string
}

func (t *SimpleTokenizer) Tokenize(sentence string) []string {
    return strings.Split(strings.ToLower(sentence), t.SplitOn)
}

I'll have to play with the json unmarshaling though, I don't know how the typing will work :/

@Corrob
Copy link
Author

Corrob commented Oct 22, 2016

By the way, this was fixed with cdipaolo/sentiment#3. We didn't realize that until now.

@piazzamp
Copy link
Collaborator

e2d8de updates the Restore func to set default tokenizers and sanitizers on restored models. Unfortunately, the sentiment package calls json.UnMarshal instead of the NaiveBayes Restore (or new RestoreWithFuncs) functions ¯_(ツ)_/¯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants