<a href="https://colab.research.google.com/github/bipinKrishnan/fastai_course/blob/master/NLP_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install fastai --upgrade

In [2]:
from fastai.text.all import *
import numpy as np
import pandas as pd

from IPython.core.display import display, HTML

In [3]:
path = untar_data(URLs.IMDB)
(path/'..').ls()

(#1) [Path('/root/.fastai/data/imdb/../imdb')]

## Tokenization

In [4]:
txt_files = get_text_files(path, folders=['train', 'test', 'unsup'])

In [5]:
txt = txt_files[0].open().read()
txt

'I still can\'t figure out why any self-respecting person would ever attempt to make a film that is as stupid as In The Woods. Or better yet, why any decent person would ever rent, let alone buy, this piece of utter garbage.<br /><br />I think the writer should win the award for the dumbest storyline ever made into an actual movie.<br /><br />Everything about this movie just screams of stupidity. The acting is very mechanical and fake, the special effects (if you can call them that) and the "scary monster" seem like they\'re from an old 80\'s PBS tv show.<br /><br />Well, the list goes on and on. I won\'t bore you with all the details, if you want to be super bored you can go out and rent this movie!'

In [6]:
tokenizer = WordTokenizer()
toks = first(tokenizer([txt]))

print(f'{coll_repr(toks, 30)}\n{toks}')

(#151) ['I','still','ca',"n't",'figure','out','why','any','self','-','respecting','person','would','ever','attempt','to','make','a','film','that','is','as','stupid','as','In','The','Woods','.','Or','better'...]
(#151) ['I','still','ca',"n't",'figure','out','why','any','self','-'...]


In [7]:
first(tokenizer(["I'm going to U.S.A tomorrow."]))

(#7) ['I',"'m",'going','to','U.S.A','tomorrow','.']

In [8]:
first(tokenizer([["I'm going to U.S.A tomorrow."], ["Hey there how are you ?"]]))

(#11) ['[','"','I',"'m",'going','to','U.S.A','tomorrow','.','"'...]

Fatsai's **Tokenizer** class adds some additional tokenizing functionality to the **WordTokenizer** class

In [9]:
tkn = Tokenizer(tokenizer)

In [10]:
tkn(txt)

(#163) ['xxbos','i','still','ca',"n't",'figure','out','why','any','self'...]

In [11]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

In [12]:
??fix_html

In [13]:
x = 'how are you, I && my sister'

In [14]:
x.replace('&&', '&').replace(',', ' ').replace('  ', ' ')

'how are you I & my sister'

In [15]:
L(['JDKSLK', 'djksdj', 'djlskdlk', 2])

(#4) ['JDKSLK','djksdj','djlskdlk',2]

In [16]:
sample_txt = L([files.open().read() for files in txt_files[:2000]])
sample_txt

(#2000) ['I still can\'t figure out why any self-respecting person would ever attempt to make a film that is as stupid as In The Woods. Or better yet, why any decent person would ever rent, let alone buy, this piece of utter garbage.<br /><br />I think the writer should win the award for the dumbest storyline ever made into an actual movie.<br /><br />Everything about this movie just screams of stupidity. The acting is very mechanical and fake, the special effects (if you can call them that) and the "scary monster" seem like they\'re from an old 80\'s PBS tv show.<br /><br />Well, the list goes on and on. I won\'t bore you with all the details, if you want to be super bored you can go out and rent this movie!',"There is so much tragedy that takes place in the world involving the military and others involved in physical conflict, yet it is rare that a soldier comes forward to tell the truth. In Shake Hands with the Devil: The Journey of Roméo Dallaire, we are lucky to have not just a so

In [17]:
!pip install sentencepiece!=0.1.90,!=0.1.91

Collecting sentencepiece!=0.1.90,!=0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 5.4MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.86


In [18]:
def subword(sz):
  st = SubwordTokenizer(vocab_sz=sz)
  st.setup(sample_txt)

  return ' '.join(first(st([txt]))[:40])

In [19]:
subword(1000)

"▁I ▁still ▁can ' t ▁figure ▁out ▁why ▁any ▁self - re s p ect ing ▁person ▁would ▁ever ▁attempt ▁to ▁make ▁a ▁film ▁that ▁is ▁as ▁stupid ▁as ▁In ▁The ▁W ood s . ▁O r ▁better ▁yet ,"

Increasing vocab size decreases number of characters in a token and vice-versa

In [20]:
subword(200)

"▁I ▁st i l l ▁c an ' t ▁f i g u re ▁ o u t ▁w h y ▁ an y ▁ s e l f - re s p e c t ing ▁p er s"

In [21]:
subword(10000)

"▁I ▁still ▁can ' t ▁figure ▁out ▁why ▁any ▁self - respect ing ▁person ▁would ▁ever ▁attempt ▁to ▁make ▁a ▁film ▁that ▁is ▁as ▁stupid ▁as ▁In ▁The ▁Wood s . ▁Or ▁better ▁yet , ▁why ▁any ▁decent ▁person ▁would"

Picking a **subword vocab size** represents a compromise: a **larger vocab** means **fewer tokens per sentence**, which means **faster training, less memory, and less state for the model to remember**; but on the **downside**, it means **larger embedding matrices**, which **require more data to learn**

## Numericalization


Numericalization is the process of mapping tokens to integers

In [22]:
tokens = tkn(txt)

In [23]:
#map applies tkn function each list of text in sample_txt
toks200 = sample_txt[:200].map(tkn)
toks200

(#200) [(#163) ['xxbos','i','still','ca',"n't",'figure','out','why','any','self'...],(#149) ['xxbos','xxmaj','there','is','so','much','tragedy','that','takes','place'...],(#196) ['xxbos','xxmaj','this','was','a','terrible','movie','with','a','bad'...],(#85) ['xxbos','xxmaj','what','a','waste','of','talent','and','cinematography','.'...],(#73) ['xxbos','i','thought','i','could','never','find','a','completely','bad'...],(#228) ['xxbos','i','first','saw','this','film','when','my','mother','bought'...],(#138) ['xxbos','i','have','a','very','hard','time','picking','a','favorite'...],(#211) ['xxbos','a','group','of','friends','are','out','partying','one','night'...],(#917) ['xxbos','xxmaj','some','films','are','mediocre',',','some','films','are'...],(#110) ['xxbos','i','do',"n't",'mind','the','movies','not','being','the'...]...]

In [24]:
num = Numericalize()
num.setup(toks200)

In [25]:
num.vocab[:5]

['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld']

In [26]:
print(coll_repr(num.vocab, 30))

(#2024) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','in','i','it','"','this','that',"'s",'\n\n','-','was','as','with','for'...]


In [27]:
num(tokens)[:20]

tensor([  2,  18, 138, 190,  39, 720,  62, 154, 113, 721,  25,   0, 295,  68,
        155, 502,  15, 122,  12,  30])

In [28]:
tokens

(#163) ['xxbos','i','still','ca',"n't",'figure','out','why','any','self'...]

Building a numericalization function

In [29]:
def token2idx(token):
  try: 
    return (num.vocab.index(token))
  except ValueError:
    return ("unk")

In [30]:
token2idx('xxbos'), token2idx('his'), token2idx('jdksjdj'), token2idx(4) 

(2, 43, 'unk', 'unk')

In [31]:
' '.join(num.vocab[idx] for idx in num(tokens)[:20])

"xxbos i still ca n't figure out why any self - xxunk person would ever attempt to make a film"

In [32]:
original_text = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
original_preprocessed_text = "xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \n xxmaj then we will study how we build a language model and train it for a while ."

In [33]:
my_preprocessed_text = ' '.join(tkn(original_text))
my_preprocessed_text

"xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \n xxmaj then we will study how we build a language model and train it for a while ."

In [34]:
original_preprocessed_text == my_preprocessed_text

True

In [35]:
bs, seq_len = 6, 15
stream = tkn(original_text)

[stream[i*seq_len: (i+1)*seq_len] for i in range(bs)]

[(#15) ['xxbos','xxmaj','in','this','chapter',',','we','will','go','back'...],
 (#15) ['movie','reviews','we','studied','in','chapter','1','and','dig','deeper'...],
 (#15) ['first','we','will','look','at','the','processing','steps','necessary','to'...],
 (#15) ['how','to','customize','it','.','xxmaj','by','doing','this',','...],
 (#15) ['of','the','preprocessor','used','in','the','data','block','xxup','api'...],
 (#15) ['will','study','how','we','build','a','language','model','and','train'...]]

In [36]:
df = pd.DataFrame(np.array([stream[i*seq_len: (i+1)*seq_len] for i in range(bs)]))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
1,movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
2,first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
3,how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
4,of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
5,will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


Simple tests

In [41]:
display(HTML(df.to_html(index=False, header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


In [42]:
sample_df = pd.DataFrame([[33, 44, 25]], columns=['epoch', 'error', 'accuracy'])
sample_df

Unnamed: 0,epoch,error,accuracy
0,33,44,25


In [43]:
for i in range(5):
  sample_df = pd.DataFrame([[i+1, i*44, i*25]], columns=['epoch', 'error', 'accuracy'])
  display(HTML(sample_df.to_html(index=False)))

epoch,error,accuracy
1,0,0


epoch,error,accuracy
2,44,25


epoch,error,accuracy
3,88,50


epoch,error,accuracy
4,132,75


epoch,error,accuracy
5,176,100


In [44]:
for i in range(5):
  sample_df = pd.DataFrame([[i+1, i*44, i*25]], columns=['epoch', 'error', 'accuracy'])
  display(sample_df)

Unnamed: 0,epoch,error,accuracy
0,1,0,0


Unnamed: 0,epoch,error,accuracy
0,2,44,25


Unnamed: 0,epoch,error,accuracy
0,3,88,50


Unnamed: 0,epoch,error,accuracy
0,4,132,75


Unnamed: 0,epoch,error,accuracy
0,5,176,100


Back to business

In [37]:
nums200 = toks200.map(num)

In [38]:
dl = LMDataLoader(nums200, bs=32)

In [39]:
x, y = first(dl)
x.shape, y.shape

(torch.Size([32, 72]), torch.Size([32, 72]))

In [40]:
' '.join([num.vocab[i] for i in x[1][:20]]), ' '.join([num.vocab[i] for i in y[1][:20]])

('the xxmaj xxunk xxmaj party . i can understand why a lot of people look down on writing like his',
 'xxmaj xxunk xxmaj party . i can understand why a lot of people look down on writing like his ,')

## Language model with data block

In [41]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

db = DataBlock(
    blocks=TextBlock.from_folder(path/'..', is_lm=True),
    get_items=get_imdb,
    splitter=RandomSplitter(0.1)
)

@classmethod

In [127]:
class MyFirstClass:
  age = 26

  @classmethod
  def print_something(cls):
    return cls.age

  def show(self, num):
    return num

In [132]:
# MyClass.print_something = classmethod(MyClass.print_something)

MyFirstClass.print_something()

26

In [131]:
MyFirstClass().show(4)

4

In [95]:
class Person:
    age = 25

    def printAge(cls):
        print('The age is:', cls.age)

# create printAge class method
Person.printAge = classmethod(Person.printAge)

Person.printAge()

The age is: 25


Back to business

In [42]:
dls = db.dataloaders(path, bs=32, seq_len=80)

In [43]:
dls.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj season after season , the players or characters in this show appear to be people who you 'd absolutely love to hate . xxmaj is this show rigged to be that or were they chosen for the same ? xxmaj each episode vilifies one single person specifically and he ends up getting killed off . xxmaj you enjoy seeing them get screwed although its totally wrong and sick . xxmaj you enjoy seeing them screwing others , getting","xxmaj season after season , the players or characters in this show appear to be people who you 'd absolutely love to hate . xxmaj is this show rigged to be that or were they chosen for the same ? xxmaj each episode vilifies one single person specifically and he ends up getting killed off . xxmaj you enjoy seeing them get screwed although its totally wrong and sick . xxmaj you enjoy seeing them screwing others , getting screwed"
1,are simply hilarious but most of the film is rather lame . xxmaj at least the music score is very good but the music always ends abruptly because of the editing . xxmaj there are also a few scenes that are not logical and film also contains a very obvious ( and therefore disturbing ) continuity error . xxmaj jean xxmaj reno gives a decent performance and xxmaj christian xxmaj clavier turned out to be a very talented comedy actor,simply hilarious but most of the film is rather lame . xxmaj at least the music score is very good but the music always ends abruptly because of the editing . xxmaj there are also a few scenes that are not logical and film also contains a very obvious ( and therefore disturbing ) continuity error . xxmaj jean xxmaj reno gives a decent performance and xxmaj christian xxmaj clavier turned out to be a very talented comedy actor .


In [None]:
learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()

In [None]:
learn.fit_one_cycle(1, 2e-2)

In [55]:
learn.save('1epoch')

Path('models/1epoch.pth')

In [56]:
learn.load('1epoch')

<fastai.text.learner.LMLearner at 0x7f153a68fe80>

In [None]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

In [58]:
learn.save_encoder('finetuned')

In [46]:
text = 'I like the movie because'
num_words = 40

learn.predict(text, num_words, temperature=0.75)

"i like the movie because it does n't work . i can not imagine the movies being mainly filmed in England . It 's an adaptation of William Shakespeare 's play of the same name to play the King"

## Text classifier

In [97]:
db_classifier = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls.vocab), CategoryBlock),
    get_items = partial(get_text_files, folders=['train', 'test']),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
)

In [None]:
dls_cls = db_classifier.dataloaders(path, bs=32, seq_len=72)

In [None]:
dls_cls.show_batch(max_n=2)

In [105]:
learn = text_classifier_learner(dls_cls, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()

In [106]:
learn = learn.load_encoder('finetuned')

In [107]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.559645,0.344728,0.85136,12:34


In [None]:
#freeze except the last two layers
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

In [None]:
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-3/(2.6**4),1e-3))