# Text Classification
This notebook will demonstrate how to build a transformer model from scratch to predict the language of a given tweet. The training set consists of tweets of 77 languages. The model is trained on character level. 

# Environmental setup
Install Fairseq library 

In [None]:
!git clone https://github.com/pytorch/fairseq
%cd fairseq
!pip install --editable ./

# Preprocess the data

In [None]:
%cd ../data/twitter/
!fairseq-preprocess \
  --trainpref train --validpref valid --testpref test \
  --source-lang input --target-lang label \
  --destdir ../../tmp/data-bin/ --dataset-impl raw --tokenizer space

# Register a new Fairseq custom task

In [None]:
%cd ../..
!cp src/simple_classification.py fairseq/fairseq/tasks/

# train a transformer model from scratch

In [None]:
!fairseq-train tmp/data-bin/ --task simple_classification \
                        --arch transformer \
                        --save-interval 1 \
                        --distributed-world-size 10 \
                        --max-epoch 30 \
                        --save-dir save-dir \
                        --distributed-world-size 1 \
                        --share-decoder-input-output-embed \
                        --optimizer adam --adam-betas '(0.9,0.98)' \
                        --clip-norm 0.0 --lr 5e-6 --lr-scheduler inverse_sqrt \
                        --warmup-updates 4000 --dropout 0.1 \
                        --decoder-embed-dim 512 \
                        --encoder-embed-dim 512 \
                        --weight-decay 0.0001 --criterion label_smoothed_cross_entropy \
                        --label-smoothing 0.5 --batch-size 32 \
                        --activation-dropout 0.3 --encoder-attention-heads 8  \
                        --encoder-layers 6 \
                        --decoder-layers 6 --decoder-attention-heads 8 \
                        --encoder-ffn-embed-dim 1536 --decoder-ffn-embed-dim 1536 \
                        --activation-dropout 0.3 \

# If you encounter error message as below
ValueError: Please build (or rebuild) Cython components with `python setup.py build_ext --inplace`.
## Uncomment following cell and run it.

In [None]:
#! pip uninstall -y numpy
#! pip install numpy

# Evaluation on test set

In [None]:
!python3 src/eval_classifier.py tmp/data-bin/ --path save-dir/checkpoint_best.pt