Download data from: http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset


In [5]:
from sklearn.datasets import fetch_20newsgroups
import spacy
import string

# To be used for pre-processing of data
## from terminal run "python -m spacy download en"
tokenizer = spacy.load('en_core_web_sm')
# tokenizer = spacy.load("en")
punctuations = string.punctuation

# Load data
First, let's load the dataset from sklearn. 

In [6]:
newsgroup_train = fetch_20newsgroups(subset='train')
newsgroup_test = fetch_20newsgroups(subset='test') # we will use it later

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [7]:
train_split = 10000
train_data = newsgroup_train.data[:train_split]
train_targets = newsgroup_train.target[:train_split]

val_data = newsgroup_train.data[train_split:]
val_targets = newsgroup_train.target[train_split:]

test_data = newsgroup_test.data
test_targets = newsgroup_test.target

print ("Train dataset size is {}".format(len(train_data)))
print ("Val dataset size is {}".format(len(val_data)))
print ("Test dataset size is {}".format(len(test_data)))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532


Fasttext library takes a file as input and learn a classification model.
The sentences in input file should be in this format: "_ __label__ _[class] [Text]" 
We will prepare the train file and test file in this format.

In [8]:
def create_newsgroup_file(data, targets, outfile_name):
    with open(outfile_name, 'w') as fout:
        for i, sent in enumerate(data):
            line = "__label__" + str(targets[i]) + " " + sent.replace('\n', ' ') + "\n"
            fout.write(line)
            

create_newsgroup_file(train_data, train_targets, 'newsgroups.train') 
create_newsgroup_file(val_data, val_targets, 'newsgroups.val') 
create_newsgroup_file(test_data, test_targets, 'newsgroups.test') 

### Let's check how the file we created look like

In [9]:
!head -2 newsgroups.train

__label__7 From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15   I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is  all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail.  Thanks, - IL    ---- brought to you by your neighborhood Lerxst ----     
__label__4 From: guykuo@carson.u.washington.edu (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Po

### Install FastText if you haven't! 
Use the following commands to install fasttext.
```
wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip
cd fastText-0.1.0
make
```

### Let's start training the fasttext classifier, and check its performance on validation set.

In [22]:
# Train fasttext
!~/fastText-0.1.0/fastText supervised -input newsgroups.train -output model_newsgroup

Read 2M words
Number of words:  258366
Number of labels: 20
Progress: 100.0%  words/sec/thread: 2598655  lr: 0.000000  loss: 2.999971  eta: 0h0m lr: 0.092890  loss: 2.969951  eta: 0h0m 0h0m   loss: 3.016491  eta: 0h0m gress: 39.4%  words/sec/thread: 2570665  lr: 0.060622  loss: 3.016731  eta: 0h0m gress: 43.4%  words/sec/thread: 2576871  lr: 0.056588  loss: 3.016632  eta: 0h0m 53.4%  words/sec/thread: 2593548  lr: 0.046583  loss: 3.015920  eta: 0h0m gress: 70.0%  words/sec/thread: 2600160  lr: 0.030031  loss: 3.009035  eta: 0h0m m gress: 84.1%  words/sec/thread: 2601062  lr: 0.015860  loss: 3.004981  eta: 0h0m s: 84.4%  words/sec/thread: 2601617  lr: 0.015646  loss: 3.004905  eta: 0h0m gress: 88.4%  words/sec/thread: 2599276  lr: 0.011563  loss: 3.004934  eta: 0h0m gress: 92.8%  words/sec/thread: 2600100  lr: 0.007160  loss: 3.002179  eta: 0h0m %  words/sec/thread: 2600920  lr: 0.006514  loss: 3.001157  eta: 0h0m 1775  loss: 3.000236  eta: 0h0m 


In [23]:
# Evaluate it on validation set
!~/fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.val

N	1314
P@1	0.105
R@1	0.105
Number of examples: 1314


Note that FastText reports the precision and recall, not accuracy!  
The **precision** is the number of correct labels among the labels predicted by fastText.  
The **recall** is the number of labels that successfully were predicted, among all the real labels.

## What a horrible model! Do some preprocessing to make it better!

In [25]:
def preprocess_sent(sent):
    temp_sent = ' '.join(sent.split('\n')) # remove line breaks as fasttext read each sample text as a line
    tokens = tokenizer(temp_sent)
    pos = [(tok.text, tok.pos_) for tok in tokens]
    processed_toks = [tok.text.lower() for tok in tokens if (tok.text not in punctuations)]
    
    return ' '.join(processed_toks).strip() #[token.text.lower() for token in tokens]
    
    
temp = preprocess_sent(train_data[0])
temp

"from lerxst@wam.umd.edu where 's my thing subject what car is this nntp posting host rac3.wam.umd.edu organization university of maryland college park lines 15    i was wondering if anyone out there could enlighten me on this car i saw the other day it was a 2-door sports car looked to be from the late 60s/ early 70s it was called a bricklin the doors were really small in addition the front bumper was separate from the rest of the body this is   all i know if anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail   thanks il     ---- brought to you by your neighborhood lerxst ----"

In [26]:
def create_newsgroup_file(data, targets, outfile_name):
    with open(outfile_name, 'w') as fout:
        for i, sent in enumerate(data):
            proc_sent = preprocess_sent(sent)
            line = "__label__" + str(targets[i]) + " " + proc_sent + "\n"
            fout.write(line)
            
create_newsgroup_file(train_data, train_targets, 'newsgroups.proc.train') 
create_newsgroup_file(val_data, val_targets, 'newsgroups.proc.val') 
create_newsgroup_file(test_data, test_targets, 'newsgroups.proc.test') 

In [27]:
!~/fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3110110  lr: 0.000000  loss: 2.971143  eta: 0h0m m gress: 27.5%  words/sec/thread: 3099689  lr: 0.072543  loss: 3.015791  eta: 0h0m 39.3%  words/sec/thread: 3100043  lr: 0.060684  loss: 3.016589  eta: 0h0m 0.048287  loss: 3.014113  eta: 0h0m s: 63.1%  words/sec/thread: 3096866  lr: 0.036949  loss: 3.006444  eta: 0h0m 63.5%  words/sec/thread: 3095917  lr: 0.036475  loss: 3.005187  eta: 0h0m gress: 73.6%  words/sec/thread: 3097775  lr: 0.026447  loss: 2.991425  eta: 0h0m gress: 79.9%  words/sec/thread: 3106669  lr: 0.020091  loss: 2.990266  eta: 0h0m 


In [28]:
!~/fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

N	1314
P@1	0.123
R@1	0.123
Number of examples: 1314


We see tiny improvement but still a bad model. Let's adjust the hyperparameters of the model.
Fasttext library uses 5 training epochs by default, which is not enough for learning our data. 
Let's try adjusting the number of epoch to 30.

#### It is important to note that the two models above aren't strictly comparable.
Each model is randomly initialized at the beginning of the training. So, every time you re-train the model, you will notice that the precision and recall are different.
In practice, it's a good idea to train the model with different initializations at least 5 times, and report the min, max, mean, and median stats.

In [29]:
!~/fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup -epoch 30
!~/fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3013444  lr: 0.000000  loss: 1.249545  eta: 0h0m 14m  words/sec/thread: 1613482  lr: 0.099983  loss: 2.841834  eta: 0h0m h0m  words/sec/thread: 3041454  lr: 0.097465  loss: 3.013645  eta: 0h0m  words/sec/thread: 3031102  lr: 0.096791  loss: 3.015508  eta: 0h0m  words/sec/thread: 3061186  lr: 0.094877  loss: 3.016323  eta: 0h0m  words/sec/thread: 3058415  lr: 0.093825  loss: 3.014880  eta: 0h0m d: 3079127  lr: 0.092948  loss: 3.014586  eta: 0h0m eta: 0h0m %  words/sec/thread: 3068266  lr: 0.091807  loss: 3.010029  eta: 0h0m  words/sec/thread: 3060623  lr: 0.090598  loss: 2.986003  eta: 0h0m h0m s: 10.1%  words/sec/thread: 3039995  lr: 0.089910  loss: 2.944014  eta: 0h0m gress: 13.0%  words/sec/thread: 3053428  lr: 0.087005  loss: 2.904153  eta: 0h0m 057  lr: 0.086289  loss: 2.892912  eta: 0h0m s: 14.3%  words/sec/thread: 3050586  lr: 0.085650  loss: 2.889325  eta: 0h0m gress: 17.9%  words/sec

Great! A huge improvement. 
Learning rate dictates how fast a model learns. By default, it's 0.05. Model will converge faster with bigger learning rate, though bigger learning rate doesn't always mean better.
Let's adjust it as well.

In [30]:
!~/fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup -epoch 30 -lr 0.5
!~/fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 3029077  lr: 0.000000  loss: 0.304259  eta: 0h0m 14m d: 2883247  lr: 0.488090  loss: 2.958491  eta: 0h0m  words/sec/thread: 2985736  lr: 0.477887  loss: 2.901876  eta: 0h0m %  words/sec/thread: 3001967  lr: 0.474388  loss: 2.781270  eta: 0h0m  words/sec/thread: 2968698  lr: 0.470701  loss: 2.441050  eta: 0h0m h0m  words/sec/thread: 2994171  lr: 0.456473  loss: 2.179527  eta: 0h0m 2.150629  eta: 0h0m   words/sec/thread: 3004711  lr: 0.449748  loss: 2.124960  eta: 0h0m 2994327  lr: 0.434942  loss: 1.768491  eta: 0h0m gress: 14.4%  words/sec/thread: 2999232  lr: 0.427841  loss: 1.691443  eta: 0h0m   loss: 1.650610  eta: 0h0m 15.8%  words/sec/thread: 3006859  lr: 0.421238  loss: 1.635735  eta: 0h0m gress: 18.1%  words/sec/thread: 3007479  lr: 0.409345  loss: 1.468156  eta: 0h0m gress: 19.3%  words/sec/thread: 3013634  lr: 0.403484  loss: 1.424582  eta: 0h0m s: 19.8%  words/sec/thread: 3009390  l

Nice, the results improves! 

Now, instead of using **bags of words**, let's try using **bags of N-grams**. We'll use **Bigrams (N=2)** here.  
N-grams provide a sense of word order. 

Sentence: "I love eating pizza"  
Bigrams for the above sentence: "I love", "love eating", "eating pizza".  
By looking at the N-grams, it is possible to reconstruct a sentence.

In [31]:
!~/fastText-0.1.0/fastText supervised -input newsgroups.proc.train -output model_newsgroup \
-epoch 30 -lr 0.5 -wordNgrams 2
!~/fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.val

Read 2M words
Number of words:  134132
Number of labels: 20
Progress: 100.0%  words/sec/thread: 1118522  lr: 0.000000  loss: 0.380091  eta: 0h0m m 3.0%  words/sec/thread: 1024208  lr: 0.484844  loss: 3.010075  eta: 0h0m 0h0m   lr: 0.483629  loss: 3.004493  eta: 0h0m 0m m 0.480930  loss: 2.976413  eta: 0h0m h0m 0h0m   words/sec/thread: 547270  lr: 0.478338  loss: 2.959235  eta: 0h0m 0h0m m 0h0m 5.1%  words/sec/thread: 468332  lr: 0.474316  loss: 2.907886  eta: 0h0m 5.5%  words/sec/thread: 443042  lr: 0.472426  loss: 2.887521  eta: 0h0m 0.472054  loss: 2.883266  eta: 0h0m h0m 415121  lr: 0.469254  loss: 2.848401  eta: 0h0m   lr: 0.467964  loss: 2.827346  eta: 0h0m   lr: 0.466598  loss: 2.814551  eta: 0h0m   words/sec/thread: 395094  lr: 0.465687  loss: 2.807956  eta: 0h0m 0.465539  loss: 2.805873  eta: 0h0m   words/sec/thread: 390934  lr: 0.465032  loss: 2.804093  eta: 0h0m m 0h0m 7.9%  words/sec/thread: 372899  lr: 0.460402  loss: 2.742176  eta: 0h0m 2.723889  eta: 0h0m %  words/sec/thr

You may check out other hyperparameters you can adjust on the Fasttext repo: https://github.com/facebookresearch/fastText/blob/master/README.md

After we have chosen the best model based on validation performance, we can test how it perform on actual test set.  
Remember the lecture? ***Never*** tune your model on test set!

In [32]:
!~/fastText-0.1.0/fastText test model_newsgroup.bin newsgroups.proc.test

N	7532
P@1	0.764
R@1	0.764
Number of examples: 7532


## Exercise
Try training the fastText using IMDB Large Movie Review Dataset and fine-tune the hyperparameters.