# FastText: Multi-Label Text Classification

- Toxic Comment
- Movie Genres
- Audio Categorization
- Image Categorization
- Biology: Genes in Yeast Dataset
- Technology
- Age, Sex, Image Prediction etc.

![alt text](https://gombru.github.io/assets/cross_entropy_loss/multiclass_multilabel.png)

FastText: https://arxiv.org/pdf/1607.01759.pdf


Toxic Comment Dataset: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge


GitHub Dataset Link: https://github.com/laxmimerit/Toxic-Comment

In [4]:
!pip install ktrain



In [2]:
!git clone https://github.com/laxmimerit/Toxic-Comment.git

Cloning into 'Toxic-Comment'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects:  11% (1/9)[Kremote: Counting objects:  22% (2/9)[Kremote: Counting objects:  33% (3/9)[Kremote: Counting objects:  44% (4/9)[Kremote: Counting objects:  55% (5/9)[Kremote: Counting objects:  66% (6/9)[Kremote: Counting objects:  77% (7/9)[Kremote: Counting objects:  88% (8/9)[Kremote: Counting objects: 100% (9/9)[Kremote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects:  12% (1/8)[Kremote: Compressing objects:  25% (2/8)[Kremote: Compressing objects:  37% (3/8)[Kremote: Compressing objects:  50% (4/8)[Kremote: Compressing objects:  62% (5/8)[Kremote: Compressing objects:  75% (6/8)[Kremote: Compressing objects:  87% (7/8)[Kremote: Compressing objects: 100% (8/8)[Kremote: Compressing objects: 100% (8/8), done.[K
remote: Total 9 (delta 1), reused 3 (delta 0), pack-reused 0[K
Unpacking objects: 100% (9/9), done.


In [5]:
import pandas as pd
import ktrain
from ktrain import text

In [12]:
PATH = "/content/Toxic-Comment/train.csv"
NUM_WORDS = 50000
MAXLEN = 150

train, val, preproc = text.texts_from_csv(PATH, 'comment_text', label_columns=['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'],
                    ngram_range = 1, max_features = NUM_WORDS, maxlen = MAXLEN)

detected encoding: utf-8 (if wrong, set manually)
language: en
Word Counts: 197439
Nrows: 143613
143613 train sequences
train sequence lengths:
	mean : 67
	95percentile : 229
	99percentile : 569
x_train shape: (143613,150)
y_train shape: (143613, 6)
Is Multi-Label? True
15958 test sequences
test sequence lengths:
	mean : 65
	95percentile : 215
	99percentile : 550
x_test shape: (15958,150)
y_test shape: (15958, 6)


In [13]:
text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]


In [14]:
model = text.text_classifier('fasttext', train, preproc)

Is Multi-Label? True
compiling word ID features...
maxlen is 150
done.


In [15]:
learner = ktrain.get_learner(model, train, val)

In [16]:
learner.autofit(0.001, 2)



begin training using triangular learning rate policy with max lr of 0.001...
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f4c0903add8>

In [17]:
predictor  = ktrain.get_predictor(learner.model, preproc)

In [19]:
predictor.predict(['I kill you'])

[[('toxic', 0.8786945),
  ('severe_toxic', 0.2070176),
  ('obscene', 0.60317975),
  ('threat', 0.12133912),
  ('insult', 0.5495991),
  ('identity_hate', 0.18305233)]]

In [20]:
predictor.save('/content/drive/My Drive/toxic_fasttext')