Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ngram formation different between training and classification #13

Open
bdc34 opened this issue Oct 18, 2019 · 2 comments
Open

ngram formation different between training and classification #13

bdc34 opened this issue Oct 18, 2019 · 2 comments
Labels

Comments

@bdc34
Copy link
Contributor

bdc34 commented Oct 18, 2019

It looks like ngram.py from arxiv-classifier is building a phraser from the whole training corpus and using that during training but not using it during classification.

It doesn't seem correct to use two differently configured phrasers for training and classification. It would reduce classification accuracy if "Jacobi elliptic functions" was ngrammed during training to jacobi_elliptic_functions but not during classification.

The naivebayes.py _form_ngrams schedule and min_count have values that are too large to allow any n-grams to be formed with the classy01/ test corpus.

If you set the min_count to 1 and the schedule to (4,3,2,1) n-grams will be formed with the class01 documents. But looking at the test stdout, the phraser used in the training has a different number of phrases than the one used for classification.

@bdc34 bdc34 added the bug label Oct 18, 2019
@DavidLFielding
Copy link

Is it the case the phraser used to build the model is more expensive and generates a better model compared to the classification phraser which primarily needs to be fast?

This may have nothing to do with what you are talking about above. After reading a little bit about the Naive Bayes and some of its shortcomings my initial reaction was whether it is possible to build the model with a more expensive algorithm and classify with the faster algorithm. Your discovery of the classifier using two different algorithms started me wondering though you seem to indicate there is an obvious bug.

@bdc34
Copy link
Contributor Author

bdc34 commented Oct 18, 2019

In

def _form_ngrams(infile, outfile, min_count=20, threshold=10.0):

a Phraser is built that works based on the phrases that it finds in the training text. But that Phraser is never saved.

During the classification step, a new Phraser is built based only on the text in the article we'd like to know the classification of.

So if the training Phraser sees the phrase "tokenized text string" lots of times it will consider it a phrase and n-gram it to tokenized_text_string.

But then an article comes and it only has "tokenized text string" twice, the phraser will not consider it a string.

So during training you end up with a feature tokenized_text_string but during classification you end up with tokenized text string and the features don't match up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants