Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textcat classifying 90% text into the same class #2663

Closed
gondg92 opened this issue Aug 13, 2018 · 2 comments
Closed

Textcat classifying 90% text into the same class #2663

gondg92 opened this issue Aug 13, 2018 · 2 comments
Labels
feat / textcat Feature: Text Classifier more-info-needed This issue needs more information perf / accuracy Performance: accuracy training Training and updating models

Comments

@gondg92
Copy link

gondg92 commented Aug 13, 2018

Hi, i am trying to make a documentary type classifyer with spaCy textcat. There are 5 classes, two of them are a bit similar, but the problem i am finding is after training the network, the first 4 classes get 0,00001 and the last one gets 0,999 in the most test documents. the 1st, 2nd and 3rd classes are completly different from that last class.

The training data size is between 300 - 700 documents per each class and they are well defined, summing around 2500 documents to train the textcat.

Do you know why is this happening?

  • Code

fact = []
for f in glob.glob("D:\usb\fact\*"):
sentence=cleanDf(f)
fact.append((sentence, {"cats": {"FAC": 1, "NOM": 0, "TEX": 0, "TIC": 0, "ECO": 0,}}))

nom = []
for f in glob.glob("D:\usb\nom\*"):
sentence=cleanDf(f)
nom.append((sentence, {"cats": {"FAC": 0, "NOM": 1, "TEX": 0, "TIC": 0, "ECO": 0,}}))

tex = []
for f in glob.glob("D:\usb\tex\*"):
sentence=cleanDf(f)
tex.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 1, "TIC": 0, "ECO": 0,}}))

tic = []
for f in glob.glob("D:\usb\tic\*"):
sentence=cleanDf(f)
tic.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 0, "TIC": 1, "ECO": 0,}}))

eco = []
for f in glob.glob("D:\usb\eco\*"):
sentence=cleanDf(f)
eco.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 0, "TIC": 0, "ECO": 1,}}))

train_data = fact + nom + tex + tic + eco

textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
textcat.add_label('FAC')
textcat.add_label('NOM')
textcat.add_label('TEX')
textcat.add_label('TIC')
textcat.add_label('ECO')
optimizer = nlp.begin_training()
for itn in range(25):
for doc, gold in tqdm(train_data):
nlp.update([doc], [gold], sgd=optimizer)`

#The example text is a very obvious "fac" doc.

{'FAC': 0.00011911078036064282, 'NOM': 0.0006520846509374678, 'TEX': 4.539787187241018e-05, 'TIC': 4.539787187241018e-05, 'ECO': 0.9999545812606812}

  • Environment

Operating System: Windows 10
Python Version Used: 3.6
spaCy Version Used: 2.0.11

@ines ines added training Training and updating models feat / textcat Feature: Text Classifier labels Aug 14, 2018
@honnibal honnibal added the perf / accuracy Performance: accuracy label Sep 12, 2018
@honnibal
Copy link
Member

Could you try again with the new spacy-nightly? We've resolved a few issues with the text classifier, that might be behind the problem you're seeing.

Another thing you should definitely do is minibatch your data. I see you're passing in one doc and one gold object per update. You should get better accuracy if you use minibatching, even with small batches of like 2-8 documents. You might also try using a small dropout, of say 0.1.

@honnibal honnibal added the more-info-needed This issue needs more information label Feb 21, 2019
@lock
Copy link

lock bot commented Mar 27, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / textcat Feature: Text Classifier more-info-needed This issue needs more information perf / accuracy Performance: accuracy training Training and updating models
Projects
None yet
Development

No branches or pull requests

3 participants