Textcat classifying 90% text into the same class #2663

gondg92 · 2018-08-13T12:56:32Z

Hi, i am trying to make a documentary type classifyer with spaCy textcat. There are 5 classes, two of them are a bit similar, but the problem i am finding is after training the network, the first 4 classes get 0,00001 and the last one gets 0,999 in the most test documents. the 1st, 2nd and 3rd classes are completly different from that last class.

The training data size is between 300 - 700 documents per each class and they are well defined, summing around 2500 documents to train the textcat.

Do you know why is this happening?

Code

fact = []
for f in glob.glob("D:\usb\fact\*"):
sentence=cleanDf(f)
fact.append((sentence, {"cats": {"FAC": 1, "NOM": 0, "TEX": 0, "TIC": 0, "ECO": 0,}}))

nom = []
for f in glob.glob("D:\usb\nom\*"):
sentence=cleanDf(f)
nom.append((sentence, {"cats": {"FAC": 0, "NOM": 1, "TEX": 0, "TIC": 0, "ECO": 0,}}))

tex = []
for f in glob.glob("D:\usb\tex\*"):
sentence=cleanDf(f)
tex.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 1, "TIC": 0, "ECO": 0,}}))

tic = []
for f in glob.glob("D:\usb\tic\*"):
sentence=cleanDf(f)
tic.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 0, "TIC": 1, "ECO": 0,}}))

eco = []
for f in glob.glob("D:\usb\eco\*"):
sentence=cleanDf(f)
eco.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 0, "TIC": 0, "ECO": 1,}}))

train_data = fact + nom + tex + tic + eco

textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
textcat.add_label('FAC')
textcat.add_label('NOM')
textcat.add_label('TEX')
textcat.add_label('TIC')
textcat.add_label('ECO')
optimizer = nlp.begin_training()
for itn in range(25):
for doc, gold in tqdm(train_data):
nlp.update([doc], [gold], sgd=optimizer)`

#The example text is a very obvious "fac" doc.

{'FAC': 0.00011911078036064282, 'NOM': 0.0006520846509374678, 'TEX': 4.539787187241018e-05, 'TIC': 4.539787187241018e-05, 'ECO': 0.9999545812606812}

Environment

Operating System: Windows 10
Python Version Used: 3.6
spaCy Version Used: 2.0.11

honnibal · 2019-02-21T14:34:29Z

Could you try again with the new spacy-nightly? We've resolved a few issues with the text classifier, that might be behind the problem you're seeing.

Another thing you should definitely do is minibatch your data. I see you're passing in one doc and one gold object per update. You should get better accuracy if you use minibatching, even with small batches of like 2-8 documents. You might also try using a small dropout, of say 0.1.

lock · 2019-03-27T11:23:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added training Training and updating models feat / textcat Feature: Text Classifier labels Aug 14, 2018

honnibal added the perf / accuracy Performance: accuracy label Sep 12, 2018

honnibal added the more-info-needed This issue needs more information label Feb 21, 2019

honnibal closed this as completed Feb 25, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textcat classifying 90% text into the same class #2663

Textcat classifying 90% text into the same class #2663

gondg92 commented Aug 13, 2018 •

edited

honnibal commented Feb 21, 2019

lock bot commented Mar 27, 2019

Textcat classifying 90% text into the same class #2663

Textcat classifying 90% text into the same class #2663

Comments

gondg92 commented Aug 13, 2018 • edited

honnibal commented Feb 21, 2019

lock bot commented Mar 27, 2019

gondg92 commented Aug 13, 2018 •

edited