Textcat classifying 90% text into the same class #2663
Labels
feat / textcat
Feature: Text Classifier
more-info-needed
This issue needs more information
perf / accuracy
Performance: accuracy
training
Training and updating models
Hi, i am trying to make a documentary type classifyer with spaCy textcat. There are 5 classes, two of them are a bit similar, but the problem i am finding is after training the network, the first 4 classes get 0,00001 and the last one gets 0,999 in the most test documents. the 1st, 2nd and 3rd classes are completly different from that last class.
The training data size is between 300 - 700 documents per each class and they are well defined, summing around 2500 documents to train the textcat.
Do you know why is this happening?
fact = []
for f in glob.glob("D:\usb\fact\*"):
sentence=cleanDf(f)
fact.append((sentence, {"cats": {"FAC": 1, "NOM": 0, "TEX": 0, "TIC": 0, "ECO": 0,}}))
nom = []
for f in glob.glob("D:\usb\nom\*"):
sentence=cleanDf(f)
nom.append((sentence, {"cats": {"FAC": 0, "NOM": 1, "TEX": 0, "TIC": 0, "ECO": 0,}}))
tex = []
for f in glob.glob("D:\usb\tex\*"):
sentence=cleanDf(f)
tex.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 1, "TIC": 0, "ECO": 0,}}))
tic = []
for f in glob.glob("D:\usb\tic\*"):
sentence=cleanDf(f)
tic.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 0, "TIC": 1, "ECO": 0,}}))
eco = []
for f in glob.glob("D:\usb\eco\*"):
sentence=cleanDf(f)
eco.append((sentence, {"cats": {"FAC": 0, "NOM": 0, "TEX": 0, "TIC": 0, "ECO": 1,}}))
train_data = fact + nom + tex + tic + eco
textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
textcat.add_label('FAC')
textcat.add_label('NOM')
textcat.add_label('TEX')
textcat.add_label('TIC')
textcat.add_label('ECO')
optimizer = nlp.begin_training()
for itn in range(25):
for doc, gold in tqdm(train_data):
nlp.update([doc], [gold], sgd=optimizer)`
#The example text is a very obvious "fac" doc.
{'FAC': 0.00011911078036064282, 'NOM': 0.0006520846509374678, 'TEX': 4.539787187241018e-05, 'TIC': 4.539787187241018e-05, 'ECO': 0.9999545812606812}
Operating System: Windows 10
Python Version Used: 3.6
spaCy Version Used: 2.0.11
The text was updated successfully, but these errors were encountered: