# Text Classification using ULMFit and fastai Library



ULMFiT is a transfer learning method for any NLP task. Transfer learning involves using pre-trained deep learning models and adapting them to our problems. 

**Problem Statement**

Fine-tune a pre-trained model and use it for text classification on a new dataset. Since the dataset is small (<1000 labeled instances), a neural network model trained from scratch would overfit it.

**Dataset:** 20 Newsgroup dataset available in sklearn.datasets.

In [1]:
# install PyTorch and fastai into the Colab environment
!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
!pip install fastai

Looking in links: https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
Collecting torch_nightly
[?25l  Downloading https://download.pytorch.org/whl/nightly/cu92/torch_nightly-1.2.0.dev20190805%2Bcu92-cp36-cp36m-linux_x86_64.whl (704.8MB)
[K     |████████████████████████████████| 704.8MB 25kB/s 
[?25hInstalling collected packages: torch-nightly
Successfully installed torch-nightly-1.2.0.dev20190805+cu92


In [0]:
# import libraries
import fastai
from fastai import *
from fastai.text import * 
import pandas as pd
import numpy as np
from functools import partial
import io
import os

In [3]:
# import dataset
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [4]:
df = pd.DataFrame({'label':dataset.target, 'text':dataset.data})
df.head()

Unnamed: 0,label,text
0,17,Well i'm not sure about the story nad it did s...
1,0,"\n\n\n\n\n\n\nYeah, do you expect people to re..."
2,17,Although I realize that principle is not one o...
3,11,Notwithstanding all the legitimate fuss about ...
4,10,"Well, I will have to change the scoring on my ..."


In [5]:
df.shape

(11314, 2)

In [0]:
df = df[df['label'].isin([10,15])]
df = df.reset_index(drop = True)

In [8]:
df['label'].value_counts()

10    600
15    599
Name: label, dtype: int64

**Data Preprocessing**

In [0]:
# remove non-alphabets
df['text'] = df['text'].str.replace("[^a-zA-Z]", " ")

In [10]:
# download nltk package
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
# tokenization 
tokenized_doc = df['text'].apply(lambda x: x.split())

# remove stop-words 
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization 
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 
    detokenized_doc.append(t) 

df['text'] = detokenized_doc

In [0]:
from sklearn.model_selection import train_test_split

# split data into training and validation set
df_trn, df_val = train_test_split(df, stratify = df['label'], test_size = 0.3, random_state = 12)

In [13]:
# Language model data
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "")

# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=32)

**Fine-Tuning the Pre-Trained Model and Making Predictions**

In [16]:
# create a learner object that will create a model, download the pre-trained weights, and be ready for fine-tuning
learn = language_model_learner(data_lm, arch = AWD_LSTM, drop_mult=0.7)

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd


In [17]:
# train the learner object with learning rate = 1e-2
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,6.110006,5.188154,0.262036,00:03


In [0]:
# save encoder 
learn.save_encoder('ft_enc')

In [20]:
# use data_clas object created earlier to build a classifier with the fine-tuned encoder
learn = text_classifier_learner(data_clas, arch = AWD_LSTM, drop_mult=0.7)
learn.load_encoder('ft_enc')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (839 items)
x: TextList
xxbos xxmaj from account sound like even saw goal xxmaj mike xxmaj smith came behind net fired xxunk pass hit xxmaj fuhr back leg xxmaj fuhr backing time never saw happened xxmaj the puck went straight xxmaj fuhr leg net xxmaj fuhr never chance xxmaj there play back goaltender fact xxmaj xxunk xxmaj xxunk xxmaj calgary dumped xxmaj smith xxunk xxmaj it unfortunate happened xxmaj smith nice guy rookie time birthday xxmaj but blame lies xxmaj starting pee wee coaches tell players never make cross ice pass front net xxmaj too much chance intercepted hitting goaltender whatever xxmaj and people say xxmaj smith cost xxmaj oilers series i say certainly cause team lose three games xxmaj there reason xxunk team like xxmaj edmonton tied late third period th game second round xxmaj everybody team take responsibility even situation,xxbos xxmaj yes xxmaj he also played xxmaj jesus xxmaj jesus xxmaj christ xxmaj superstar 

In [21]:
# fit model again
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.367993,0.19745,0.930556,00:06


Accuracy increased! The validation loss is less than the training loss as well.

In [22]:
# get predictions
preds, targets = learn.get_preds()

predictions = np.argmax(preds, axis = 1)
pd.crosstab(predictions, targets)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,159,4
1,21,176


Adapted from: https://www.analyticsvidhya.com/blog/2018/11/tutorial-text-classification-ulmfit-fastai-library/