ngram formation different between training and classification #13

bdc34 · 2019-10-18T15:16:41Z

It looks like ngram.py from arxiv-classifier is building a phraser from the whole training corpus and using that during training but not using it during classification.

It doesn't seem correct to use two differently configured phrasers for training and classification. It would reduce classification accuracy if "Jacobi elliptic functions" was ngrammed during training to jacobi_elliptic_functions but not during classification.

The naivebayes.py _form_ngrams schedule and min_count have values that are too large to allow any n-grams to be formed with the classy01/ test corpus.

If you set the min_count to 1 and the schedule to (4,3,2,1) n-grams will be formed with the class01 documents. But looking at the test stdout, the phraser used in the training has a different number of phrases than the one used for classification.

DavidLFielding · 2019-10-18T17:52:19Z

Is it the case the phraser used to build the model is more expensive and generates a better model compared to the classification phraser which primarily needs to be fast?

This may have nothing to do with what you are talking about above. After reading a little bit about the Naive Bayes and some of its shortcomings my initial reaction was whether it is possible to build the model with a more expensive algorithm and classify with the faster algorithm. Your discovery of the classifier using two different algorithms started me wondering though you seem to indicate there is an obvious bug.

bdc34 · 2019-10-18T18:22:52Z

In

arxiv-classifier/classifier/services/classifier/ngrams.py

Line 32 in cb51432

def _form_ngrams(infile, outfile, min_count=20, threshold=10.0):

a Phraser is built that works based on the phrases that it finds in the training text. But that Phraser is never saved.

During the classification step, a new Phraser is built based only on the text in the article we'd like to know the classification of.

So if the training Phraser sees the phrase "tokenized text string" lots of times it will consider it a phrase and n-gram it to tokenized_text_string.

But then an article comes and it only has "tokenized text string" twice, the phraser will not consider it a string.

So during training you end up with a feature tokenized_text_string but during classification you end up with tokenized text string and the features don't match up.

bdc34 added the bug label Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ngram formation different between training and classification #13

ngram formation different between training and classification #13

bdc34 commented Oct 18, 2019

DavidLFielding commented Oct 18, 2019

bdc34 commented Oct 18, 2019

ngram formation different between training and classification #13

ngram formation different between training and classification #13

Comments

bdc34 commented Oct 18, 2019

DavidLFielding commented Oct 18, 2019

bdc34 commented Oct 18, 2019