<a href="https://colab.research.google.com/github/paruliansaragi/cnn-fastai/blob/master/lesson10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

Where we are going :
We have seen in every lesson this idea of taking a pre-trained model, whip off some some stuff on the top, replace it with something new, and get it to do something similar. We’ve kind of dived in a little bit deeper to that to say with ConvLearner.pretrained it had a standard way of sticking stuff on the top which does a particular thing (i.e. classification). Then we learned actually we can stick any PyTorch module we like on the end and have it do anything we like with a custom_head and so suddenly you discover there’s some really interesting things we can do.

In fact, Yang Lu said “what if we did a different kind of custom head?” and the different custom head was let’s take the original pictures, rotate them, and the make our dependent variable the opposite of that rotation and see if it can learn to un-rotate it. This is a super useful thing, in fact, I think Google photos nowadays has this option that it’ll actually automatically rotate your photos for you. But the cool thing is, as he showed here, you can build that network right now by doing exactly the same as our previous lesson. But your custom head is one that spits out a single number which is how much to rotate by, and your dataset has a dependent variable which is how much you rotated by.

So you suddenly realize with this idea of a backbone plus a custom head, you can do almost anything you can think about [16:30].

Today, we are going to look at the same idea and see how that applies to NLP.
In the next lesson, we are going to go further and say if NLP and computer vision lets you do the same basic ideas, how do we combine the two. We are going to learn about a model that can actually learn to find word structures from images, images from word structures, or images from images. That will form the basis if you wanted to go further of doing things like going from an image to a sentence (i.e. image captioning) or going from a sentence to an image which we kind of started to do, a phrase to image.
From there, we’ve got to go deeper then into computer vision to think what other kinds of things we can do with this idea of pre-trained network plus a custom head. So we will look at various kinds of image enhancement like increasing the resolution of a low-res photo to guess what was missing or adding artistic filters on top of photos, or changing photos of horses into photos of zebras, etc.
Then finally that’s going to bring us all the way back to bounding boxes again. To get there, we’re going to first of all learn about segmentation which is not just figuring out where a bounding box is, but figuring out what every single pixel in an image is a part of — so this pixel is a part of a person, this pixel is a part of a car. Then we are going to use that idea, particularly an idea called UNet, which turns out that this idea of UNet, we can apply to bounding boxes — where it’s called feature pyramids. We’ll use that to get really good results with bounding boxes. That’s kind of our path from here. It’s all going to build on each other but take us into lots of different areas.

## IMDB

### Standardize format

The basic paths for NLP is that we have to take sentences and turn them into numbers, and there is a couple to get there. At the moment, somewhat intentionally, fastai.text does not provide that many helper functions. It’s really designed more to let you handle things in a fairly flexible way.

In [0]:
!pip install fastai==0.7.0



In [0]:
!pip install torchtext==0.2.3



In [0]:
from fastai.text import *
import html

In [0]:
DATA_PATH=Path('data/')
DATA_PATH.mkdir(exist_ok=True)

In [0]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2018-10-16 17:24:18--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.2’


2018-10-16 17:24:21 (27.4 MB/s) - ‘aclImdb_v1.tar.gz.2’ saved [84125825/84125825]



In [0]:
!tar xzfv aclImdb_v1.tar.gz 

In [0]:
!mv aclImdb data

In [0]:
BOS = 'xbos' # beginning-of-sentence tag
FLD = 'xfld' # data field tag

PATH=Path('data/aclImdb/')

In [0]:
CLAS_PATH=Path('data/imdb_clas/')
CLAS_PATH.mkdir(exist_ok=True)

LM_PATH=Path('data/imdb_lm/')
LM_PATH.mkdir(exist_ok=True)
#the classification path is going to contain the info used to create a sentiment analysis path
#the Lang model path is containing info to create LM


As you can see here [21:59], I wrote something called get_texts which goes through each thing in CLASSES. There are three classes in IMDb: negative, positive, and then there’s another folder “unsupervised” which contains the ones they haven’t gotten around to labeling yet — so we will just call that a class for now. So we just go through each one of those classes, then find every file in that folder, and open it up, read it, and chuck it into the end of the array. As you can see, with pathlib, it’s super easy to grab stuff and pull it in, and then the label is just whatever class we are up to so far. We will do that for both training set and test set.

In [0]:
CLASSES = ['neg', 'pos', 'unsup']

def get_texts(path):
    texts,labels = [],[]
    for idx,label in enumerate(CLASSES): # go through each class
        for fname in (path/label).glob('*.*'): # find every file in that folder with that name
            texts.append(fname.open('r', encoding='utf-8').read()) # open it and read it and chuck it into the end of this array
            labels.append(idx) # the label is just whatever class i'm up to so far
    return np.array(texts),np.array(labels)

trn_texts,trn_labels = get_texts(PATH/'train')
val_texts,val_labels = get_texts(PATH/'test')

In [0]:
len(trn_texts), len(val_texts) # 50,00 of train are unsup 

(75000, 25000)

In [0]:
col_names = ['labels', 'text']

One thing that’s always good idea is to sort things randomly [23:19]. It is useful to know this simple trick for sorting things randomly particularly when you’ve got multiple things you have to sort the same way. In this case, you have labels and texts. np.random.permutation, if you give it an integer, it gives you a random list from 0 up to and not including the number you give it in some random order.

In [0]:
np.random.seed(42) # next step is to sort randomly as its good idea
trn_idx = np.random.permutation(len(trn_texts)) # np rand perm if you give it an int it 
#gives you a random list from 0 up to 
# and not including the num you give it in some random order
val_idx = np.random.permutation(len(val_texts))  

You can them pass that in as an indexer to give you a list that’s sorted in that random order. So in this case, it is going to sort trn_texts and trn_labels in the same random way. So that’s a useful little idiom to use.

In [0]:
trn_texts = trn_texts[trn_idx] # you can pass that in as an indexer
val_texts = val_texts[val_idx] #to give you a list thats sorted in that random order

trn_labels = trn_labels[trn_idx] # its going to sort them both in the same random way
val_labels = val_labels[val_idx]

Now we have our texts and labels sorted, we can create a dataframe from them [24:07]. Why are we doing this? The reason is because there is a somewhat standard approach starting to appear for text classification datasets which is to have your training set as a CSV file with the labels first, and the text of the NLP documents second. So it basically looks like this:

In [0]:
df_trn = pd.DataFrame({'text': trn_texts, 'labels':trn_labels}, columns=col_names) # now that we have the texts and labels sorted
#we can then create a dataframe from them.
#why though? because, there is a standard approach starting to appear for text classification datasets 
#that is to have your train dataset in a csv file. With the labels first and the text second in their respective csv's
#and a file with classes.txt that lists the classes
df_val = pd.DataFrame({'text': val_texts, 'labels':val_labels}, columns=col_names)

So you have your labels and texts, and then a file called classes.txt which just lists the classes. I say somewhat standard because in a reasonably recent academic paper Yann LeCun and a team of researcher looked at quite a few datasets and they use this format for all of them. So that’s what I started using as well for my recent paper. You’ll find that this notebook, if you put your data into this format, the whole notebook will work every time [25:17]. So rather than having a thousand different formats, I just said let’s just pick a standard format and your job is to put your data in that format which is the CSV file. The CSV files have no header by default.

You’ll notice at the start, we have two different paths [25:51]. One was the classification path, and the other was the language model path. In NLP, you’ll see LM all the time. LM means language model. The classification path is going to contain the information that we are going to use to create a sentiment analysis model. The language model path is going to contain the information we need to create a language model. So they are a little bit different. One thing that is different is that when we create the train.csv in the classification path, we remove everything that has a label of 2 because label of 2 is “unsupervised” and we can’t use it.

In [0]:
df_trn[df_trn['labels']!=2].to_csv(CLAS_PATH/'train.csv', header=False, index=False)#when we create the train.csv in the classification path
#we remove everything that has a label of 2, because label of 2 is unsupervised
df_val.to_csv(CLAS_PATH/'test.csv', header=False, index=False)#the labels for the classification path are the actual labels
#the labels for the LM path have no labels so 

(CLAS_PATH/'classes.txt').open('w', encoding='utf-8').writelines(f'{o}\n' for o in CLASSES)


We start by creating the data for the Language Model(LM). The **LM's goal is to learn the structure of the english language.** It **learns language by trying to predict the next word given a set of previous words(ngrams)**. Since the LM does not classify reviews, the labels can be ignored.

The LM can benefit from all the textual data and there is no need to exclude the unsup/unclassified movie reviews.

We first concat all the train(pos/neg/unsup = 75k) and test(pos/neg=25k) reviews into a big chunk of 100k reviews. And then we use sklearn splitter to divide up the 100k texts into 90% training and 10% validation sets.

In [0]:
df_trn.head()
#youll find that with this notebook is that if you've put your data into this format the whole notebook will work everytime
#Your job is to put it into a csv file

Unnamed: 0,labels,text
0,2,This is one of the funniest films I've ever se...
1,0,I don't see much reason to get into this movie...
2,1,Wes Craven has been created a most successful ...
3,2,"The images are amazing! Clearly, the filmed cl..."
4,2,"I wasn't interested in the story, mainly becau..."


In [0]:
(CLAS_PATH/'classes.txt').open().readlines()

['neg\n', 'pos\n', 'unsup\n']

Now the language model, we can create our own validation set, so you’ve probably come across by now, sklearn.model_selection.train_test_split which is a really simple function that grabs a dataset and randomly splits it into a training set and a validation set according to whatever proportion you specify. In this case, we concatenate our classification training and validation together, split it by 10%, now we have 90,000 training, 10,000 validation for our language model. So that’s getting the data in a standard format for our language model and our classifier.

In [0]:
fastinotwork = np.concatenate([trn_texts,val_texts])

In [0]:
from sklearn.model_selection import train_test_split
trn_texts,val_texts = train_test_split(fastinotwork, test_size=0.1)


In [0]:
trn_texts,val_texts = sklearn.model_selection.train_test_split(
    np.concatenate([trn_texts,val_texts]), test_size=0.1)#simple function that splits data
#and we concat classification training and testing and split it by 10%.
#100,000 altogether so 90,000 train and 10,000 for validation

In [0]:
len(trn_texts), len(val_texts)

The second difference is the labels [26:51]. For the classification path, the labels are the actual labels, but for the language model, there are no labels so we just use a bunch of zeros and that just makes it a little easier because we can use a consistent dataframe/CSV format.

In [0]:
df_trn = pd.DataFrame({'text':trn_texts, 'labels':[0]*len(trn_texts)}, columns=col_names)#no labels 
#for the lm and we just use zeros - easier for csv format
df_val = pd.DataFrame({'text':val_texts, 'labels':[0]*len(val_texts)}, columns=col_names)

df_trn.to_csv(LM_PATH/'train.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test.csv', header=False, index=False)


### Language model tokens
In this section, we start cleaning up the messy text. There are 2 main activities we need to perform:

1. Clean up extra spaces, tab chars, new ln chars and other characters and replace them with standard ones
2. Use the awesome spacy library to tokenize the data. Since spacy does not provide a parallel/multicore version of the tokenizer, the fastai library adds this functionality. This parallel version uses all the cores of your CPUs and runs much faster than the serial version of the spacy tokenizer.
Tokenization is the process of splitting the text into separate tokens so that each token can be assigned a unique index. This means we can convert the text into integer indexes our models can use.

We use an appropriate chunksize as the tokenization process is memory intensive

The next thing we need to do is tokenization. Tokenization means at this stage, for a document (i.e. a movie review), we have a big long string and we want to turn it into a list of tokens which is similar to a list of words but not quite. For example, don’t we want it to be do and n’t, we probably want full stop to be a token, and so forth. Tokenization is something that we passed off to a terrific library called spaCy — partly terrific because the Australian wrote it and partly terrific because it’s good at what it does. We put a bit of stuff on top of spaCy but the vast majority of the work’s been done by spaCy.



In [0]:
#next thing to do is tokenization
#that means that at this stage we have, for a document, a big long string
#and we want to turn it into a list of tokens which are kind of a list of words but not quite
chunksize=24000


Before we pass it to spaCy, Jeremy wrote this simple fixup function which is each time he’s looked at different datasets (about a dozen in building this), every one had different weird things that needed to be replaced. So here are all the ones he’s come up with so far, and hopefully this will help you out as well. All the entities are html unescaped and there are bunch more things we replace. Have a look at the result of running this on text that you put in and make sure there’s no more weird tokens in there.

In [0]:
re1 = re.compile(r'  +')

def fixup(x):
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x)) #html escape turn & into &amp; and unescape turn & into &

In [0]:
def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)#get texts is going to grab the labels make them 
    #into ints. 
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)#its gna grab the texts. Beginning of stream token=BOS
    #type of texts that
    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)#allows us
      #to have multiple fields in our csv
    texts = list(texts.apply(fixup).values)#applies fixup

    tok = Tokenizer().proc_all_mp(partition_by_cores(texts))#tokenize it by process all multi-processing
    return tok, list(labels)#because tokenizing is slow and spacy is slow
  #each part of the list will be tok on each core. Partition by cores which takes a list and splits it
  #into sublists which the num of cores on your comp

get_all function calls get_texts and get_texts is going to do a few things [29:40]. One of which is to apply that fixup that we just mentioned.

In [0]:
def get_all(df, n_lbls):
    tok, labels = [], []
    for i, r in enumerate(df):#we go through each chunk of which each one is a dataframe
        print(i)
        tok_, labels_ = get_texts(r, n_lbls)#and we call get texts
        tok += tok_;
        labels += labels_
    return tok, labels

Let’s look through this because there is some interesting things to point out [29:57]. We are going to use pandas to open our train.csv from the language model path, but we are passing in an extra parameter you may not have seen before called **chunksize**. **Python and pandas can both be pretty inefficient** when it comes to storing and using text data. So you’ll see that very few people in NLP are working with large corpuses. And Jeremy thinks the part of the reason is that traditional tools made it really difficult — you run out of memory all the time. So this process he is showing us today, he has used on corpuses of over a billion words successfully using this exact code. One of the simple trick is this thing called chunksize with pandas. That that means **is that pandas does not return a data frame, but it returns an iterator that we can iterate through chunks of a data frame.** That is why we don’t say tok_trn = get_text(df_trn) but instead **we call get_all which loops through the data frame but actually what it’s really doing is it’s looping through chunks of the data frame so each of those chunks is basically a data frame representing a subset of the data **[31:05].

In [0]:
df_trn = pd.read_csv(LM_PATH/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(LM_PATH/'test.csv', header=None, chunksize=chunksize)

Question: When I’m working with NLP data, many times I come across data with foreign texts/characters. Is it better to discard them or keep them [31:31]? No no, definitely keep them. This whole process is unicode and I’ve actually used this on Chinese text. This is designed to work on pretty much anything. In general, most of the time, it’s not a good idea to remove anything. Old-fashioned NLP approaches tended to do all this like lemmatization and all these normalization steps to get rid things, lower case everything, etc. But that’s throwing away information which you don’t know ahead of time whether it’s useful or not. So don’t throw away information.

So we go through each chunk each of which is a data frame and we call get_texts [32:19]. get_texts will grab the labels and makes them into integers, and it’s going to grab the texts. A couple things to point out:

- Before we include the text, we have “beginning of stream” (BOS) token which we defined in the beginning. There’s nothing special about these particular strings of letters — **they are just ones I figured don’t appear in normal texts very often**. So every text is going to start with ‘**xbos’ — why is that? Because it’s often useful for your model to know when a new text is starting.** For example, if it’s a language model, **we are going to concatenate all the texts together. So it would be really helpful for it to know all this articles finished and a new one started so I should probably forget some of their context now.**

- Ditto is quite often texts have multiple fields like a title and abstract, and then a main document. So by the same token, we’ve got this thing here which lets us actually have multiple fields in our CSV. So this process is designed to be very flexible. Again at the start of each one, we put a special “field starts here” token followed by the number of the field that’s starting here for as many fields as we have. Then we apply fixup to it.

- Then most importantly [33:54], we tokenize it — we tokenize it by doing a “process all multiprocessing” (proc_all_mp). Tokenizing tends to be pretty slow but we’ve all got multiple cores in our machines now, and some of the better machines on AWS can have dozens of cores. spaCy is not very amenable to multi processing but Jeremy finally figured out how to get it to work. The good news is that it’s all wrapped up in this one function now. So all you need to pass to that function is a list of things to tokenize which each part of that list will be tokenized on a different core. There is also a function called partition_by_cores which takes a list and splits it into sublists. The number of sublists is the number of cores that you have in your computer. On Jeremy’s machine without multiprocessing, this takes about an hour and a half, and with multiprocessing, it takes about 2 minutes. So it’s a really hand thing to have. Feel free to look inside it and take advantage of it for your own stuff. Remember, we all have multiple cores even in our laptops and very few things in Python take advantage or it unless you make a bit of an effort to make it work.

In [0]:
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)

Here is the result at the end [35:42]. Beginning of the stream token (xbos), beginning of field number 1 token (xfld 1), and tokenized text. You’ll see that the punctuation is on whole now a separate token.

In [0]:
(LM_PATH/'tmp').mkdir(exist_ok=True)

t_up: t_up mgm — MGM was originally capitalized. But the interesting thing is that normally people either lowercase everything or they leave the case as is. **Now if you leave the case as is, then “SCREW YOU” and “screw you” are two totally different sets of tokens that have to be learnt from scratch. Or if you lowercase them all, then there is no difference at all. So how do you fix this so that you both get a semantic impact of “I’M SHOUTING NOW” but not have to learn the shouted version vs. the normal version.** So the idea is to come up with **a unique token to mean the next thing is all uppercase. Then we lowercase it, so now whatever used to be uppercase is lowercased, and then we can learn the semantic meaning of all uppercase.**

tk_rep: Similarly, if you have 29 ! in a row, we don’t learn a separate token for 29 exclamation marks — instead we put in a** special token for “the next thing repeats lots of times” and then put the number 29 and an exclamation mark (i.e. tk_rep 29 !).** So there are a few tricks like that. If you are interested in NLP, have a look at the tokenizer code for these little tricks that Jeremy added in because some of them are kind of fun.

In [0]:
' '.join(tok_trn[0])

The **nice thing with doing things this way is we can now just np.save that and load it back up later** [37:44]. We don’t have to recalculate all this stuff each time like we tend to have to do with torchtext or a lot of other libraries. **Now that we got it tokenized, the next thing we need to do is to turn it into numbers which we call numericalizing it. The way we numericalize it is very simple.**

- We make a list list of all the words that appear in some order
- Then we replace every word with its index into that list
- The list of all the tokens, we call that the vocabulary.

In [0]:
np.save(LM_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(LM_PATH/'tmp'/'tok_val.npy', tok_val)

In [0]:
tok_trn = np.load(LM_PATH/'tmp'/'tok_trn.npy')
tok_val = np.load(LM_PATH/'tmp'/'tok_val.npy')

Here is an example of some of the vocabulary [38:28]. The **Counter class in Python is very handy for this. It basically gives us a list of unique items and their counts.** Here are the 25 most common things in the vocabulary.** Generally speaking, we don’t want every unique token in our vocabulary. If it doesn’t appear at least twice then might just be a spelling mistake or a word we can’t learn anything about it if it doesn’t appear that often.** Also the stuff we are going to be learning about so far in this part **gets a bit clunky once you’ve got a vocabulary bigger than 60,000**. Time permitting, we may look at some work Jeremy has been doing recently on handling larger vocabularies, otherwise that might have to come in a future course. But actually for classification, doing more than about 60,000 words doesn’t seem to help anyway.

The vocab is the unique set of all tokens in our dataset. The vocab provides us a way for us to simply replace each word in our datasets with a unique integer called an index.

In a large corpus of data one might find some rare words which are only used a few times in the whole dataset. We discard such rare words and avoid trying to learn meaningful patterns out of them.

Here we have set a minimum frequency of occurence to 2 times. It has been observed by NLP practicioners that a maximum vocab of 60k usually yields good results for classification tasks. So we set maz_vocab to 60000.

In [0]:
freq = Counter(p for o in tok_trn for p in o)
freq.most_common(25)

So we are going to limit our vocabulary to 60,000 words, things that appear at least twice [39:33]. Here is a simple way to do that. **Use .most_common, pass in the max vocab size.** **That’ll sort it by the frequency and if it appears less often than a minimum frequency, then don’t bother with it at all.** That gives us itos — that’s the same name that torchtext used and it means integer-to-string. This is just the list of unique tokens in the vocab. W**e’ll insert two more tokens — a vocab item for unknown (_unk_) and a vocab item for padding (_pad_).**

In [0]:
max_vocab = 60000
min_freq = 2

In [0]:
itos = [o for o,c in freq.most_common(max_vocab) if c>min_freq]# int to string itos
itos.insert(0, '_pad_')
itos.insert(0, '_unk_')

We create a reverse mapping called stoi which is useful to lookup the index of a given token. stoi also has the same number of elements as itos. We use a high performance container called collections.defaultdict to store our stoi mapping.

We can then create the dictionary which goes in the opposite direction (string to integer)[40:19]. That won’t cover everything because we intentionally truncated it down to 60,000 words. **If we come across something that is not in the dictionary, we want to replace it with zero for unknown so we can use defaultdict with a lambda function that always returns zero.**

In [0]:
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)

So now we have our stoi dictionary defined, we can then call that for every word for every sentence [40:50].

In [0]:
trn_lm = np.array([[stoi[o] for o in p] for p in tok_trn])
val_lm = np.array([[stoi[o] for o in p] for p in tok_val])

Here is our numericalized version:

In [0]:
' '.join(str(o) for o in trn_lm[0])

Of course, the nice thing is we can save that step as well. Each time we get to another step, we can save it. These are not very big files compared to what you are used with images. Text is generally pretty small.

Very important to also save that vocabulary (itos). **The list of numbers means nothing unless you know what each number refers to, and that’s what itos tells you.**

In [0]:
np.save(LM_PATH/'tmp'/'trn_ids.npy', trn_lm)
np.save(LM_PATH/'tmp'/'val_ids.npy', val_lm)
pickle.dump(itos, open(LM_PATH/'tmp'/'itos.pkl', 'wb'))

So you save those three things, and later on you can load them back up.

In [0]:
trn_lm = np.load(LM_PATH/'tmp'/'trn_ids.npy')
val_lm = np.load(LM_PATH/'tmp'/'val_ids.npy')
itos = pickle.load(open(LM_PATH/'tmp'/'itos.pkl', 'rb'))

Now our vocab size is 60,002 and our training language model has 90,000 documents in it.

In [0]:
vs=len(itos)
vs,len(trn_lm)

That’s the preprocessing you do [42:01]. We can probably wrap a little bit more of that in utility functions if we want to but it’s all pretty straight forward and that exact code will work for any dataset you have once you’ve got it in that CSV format.


## wikitext103 conversion
We are now going to build an english language model for the IMDB corpus. We could start from scratch and try to learn the structure of the english language. But we use a technique called transfer learning to make this process easier. In transfer learning (a fairly recent idea for NLP) a pre-trained LM that has been trained on a large generic corpus(like wikipedia articles) can be used to transfer it's knowledge to a target LM and the weights can be fine-tuned.

Our source LM is the wikitext103 LM created by Stephen Merity @ Salesforce research. Link to dataset The language model for wikitext103 (AWD LSTM) has been pre-trained and the weights can be downloaded here: Link. Our target LM is the IMDB LM.

Here is kind of a new insight that’s not new at all which is that **we’d like to pre-train something.** **We know from lesson 4 that if we pre-train our classifier by first creating a language model and then fine-tuning that as a classifier, that was helpful. It actually got us a new state-of-the-art result — we got the best IMDb classifier result that had been published by quite a bit.** We are not going that far enough though, because IMDb movie reviews are not that different to any other English document; compared to how different they are to a random string or even to a Chinese document. So just like ImageNet allowed us to train things that recognize stuff that kind of looks like pictures, and we could use it on stuff that was nothing to do with ImageNet like satellite images. **Why don’t we train a language model that’s good at English and then fine-tune it to be good at movie reviews.**

So this basic insight led Jeremy to try building a language model on Wikipedia. Stephen Merity has already processed Wikipedia, found a subset of nearly the most of it, but throwing away the stupid little articles leaving bigger articles. He calls that wikitext103. Jeremy grabbed wikitext103 and trained a language model on it. He used exactly the same approach he’s about to show you for training an IMDb language model, but instead he trained a wikitext103 language model. He saved it and made it available for anybody who wants to use it at this URL. **The idea now is let’s train an IMDb language model which starts with these weights.** Hopefully to you folks, this is an extremely obvious, extremely non-controversial idea because it’s basically what we’ve done in nearly every class so far. But when Jeremy first mentioned this to people in the NLP community June or July of last year, there couldn’t have been less interest and was told it was stupid [45:03]. Because Jeremy was obstreperous, he ignored them even though they know much more about NLP and tried it anyway. And let’s see what happened.

**wikitext103 conversion** [46:11]
Here is how we do it. Grab the wikitext models. If you do wget -r, it will recursively grab the whole directory which has a few things in it.

In [0]:
# ! wget -nH -r -np -P {PATH} http://files.fast.ai/models/wt103/
#wikitext 103 language model

The pre-trained LM weights have an embedding size of 400, 1150 hidden units and just 3 layers. We need to match these values with the target IMDB LM so that the weights can be loaded up.

We need to make sure that our language model has exactly the same embedding size, number of hidden, and number of layers as Jeremy’s wikitext one did otherwise you can’t load the weights in.

In [0]:
em_sz,nh,nl = 400,1150,3

Here are our pre-trained path and our pre-trained language model path.

In [0]:
PRE_PATH = PATH/'models'/'wt103'
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'

Let’s go ahead and torch.load in those weights from the forward wikitext103 model. **We don’t normally use torch.load, but that’s the PyTorch way of grabbing a file**. It basically gives you a dictionary containing the name of the layer and a tensor/array of those weights.

**Now the problem is that wikitext language model was built with a certain vocabulary which was not the same as ours** [47:14]. Our #40 is not the same as wikitext103 model’s #40. **So we need to map one to the other.** That’s very very simple because luckily Jeremy saved itos for the wikitext vocab.

In [0]:
wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)

We calculate the mean of the layer0 encoder weights. This can be used to assign weights to unknown tokens when we transfer to target IMDB LM.

In [0]:
enc_wgts = to_np(wgts['0.encoder.weight'])
row_m = enc_wgts.mean(0)#average embedding weight across wikitext

**Here is the list of what each word is for wikitext103 model, and we can do the same defaultdict trick to map it in reverse.** We’ll use -1 to mean that it is not in the wikitext dictionary.

In [0]:
itos2 = pickle.load((PRE_PATH/'itos_wt103.pkl').open('rb'))
stoi2 = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itos2)})


Before we try to transfer the knowledge from wikitext to the IMDB LM, we match up the vocab words and their indexes. We use the defaultdict container once again, to assign mean weights to unknown IMDB tokens that do not exist in wikitext103.

**So now we can just say our new set of weights is just a whole bunch of zeros with vocab size by embedding size (i.e. we are going to create an embedding matrix)** [47:57]. **We then go through every one of the words in our IMDb vocabulary.** **We are going to look it up in stoi2 (string-to-integer for the wikitext103 vocabulary) and see if it’s a word there.** **If that is a word there, then we won’t get the -1. So r will be greater than or equal to zero, so in that case, we will just set that row of the embedding matrix to the weight which was stored inside the named element ‘0.encoder.weight’.** You can look at this dictionary wgts and it’s pretty obvious what each name corresponds to. It looks very similar to the names that you gave it when you set up your module, so here are the encoder weights.

**If we don’t find it [49:02], we will use the row mean — in other words, here is the average embedding weight across all of the wikitext103. So we will end up with an embedding matrix for every word that’s in both our vocabulary for IMDb and the wikitext103 vocab, we will use the wikitext103 embedding matrix weights; for anything else, we will just use whatever was the average weight from the wikitext103 embedding matrix.**

In [0]:
new_w = np.zeros((vs, em_sz), dtype=np.float32)
for i,w in enumerate(itos):
    r = stoi2[w]
    new_w[i] = enc_wgts[r] if r>=0 else row_m

We now overwrite the weights into the wgts odict. The decoder module, which we will explore in detail is also loaded with the same weights due to an idea called weight tying.

We will then replace the encoder weights with new_w turn into a tensor [49:35]. We haven’t talked much about **weight tying, but basically the decoder (the thing that turns the final prediction back into a word) uses exactly the same weights, so we pop it there as well.** Then there is a bit of weird thing with how we do embedding dropout that ends up with a whole separate copy of them for a reason that doesn’t matter much. So we popped the weights back where they need to go. So this is now a set of torch state which we can load in.

In [0]:
wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))
wgts['1.decoder.weight'] = T(np.copy(new_w))

Now that we have the weights prepared, we are ready to create and start training our new IMDB language pytorch model!

## Language model
It is fairly straightforward to create a new language model using the fastai library. Like every other lesson, our model will have a backbone and a custom head. The backbone in our case is the IMDB LM pre-trained with wikitext and the custom head is a linear classifier. In this section we will focus on the backbone LM and the next section will talk about the classifier custom head.

bptt (also known traditionally in NLP LM as ngrams) in fastai LMs is approximated to a std. deviation around 70, by perturbing the sequence length on a per-batch basis. This is akin to shuffling our data in computer vision, only that in NLP we cannot shuffle inputs and we have to maintain statefulness.

Since we are predicting words using ngrams, we want our next batch to line up with the end-points of the previous mini-batch's items. batch-size is constant and but the fastai library expands and contracts bptt each mini-batch using a clever stochastic implementation of a batch. (original credits attributed to Smerity)

## Language model [50:18]
Let’s create our language model. Basic approach we are going to use is we are going to concatenate all of the documents together into a single list of tokens of length 24,998,320. That is going to be what we pass in as a training set. So for the language model:

- We take all our documents and just concatenate them back to back.
- We are going to be continuously trying to predict what’s the next word after these words.
- We will set up a whole bunch of dropouts.
- Once we have a model data object, we can grab the model from it, so that’s going to give us a learner.
- Then as per usual, we can call learner.fit. We do a single epoch on the last layer just to get that okay. The way it’s set up is the last layer is the embedding words because that’s obviously the thing that’s going to be the most wrong because a lot of those embedding weights didn’t even exist in the vocab. So we will train a single epoch of just the embedding weights.
- Then we’ll start doing a few epochs of the full model. How is it looking? In lesson 4, we had the loss of 4.23 after 14 epochs. In this case, we have 4.12 loss after 1 epoch. So by pre-training on wikitext103, we have a better loss after 1 epoch than the best loss we got for the language model otherwise.

Question: What is the wikitext103 model? Is it a AWD LSTM again [52:41]? Yes, we are about to dig into that. The way I trained it was literally the same lines of code that you see above, but without pre-training it on wikitext103.

In [0]:
wd=1e-7
bptt=70
bs=52
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))

In [0]:
t = len(np.concatenate(trn_lm))
t, t//64

The goal of the LM is to learn to predict a word/token given a preceeding set of words(tokens). We take all the movie reviews in both the 90k training set and 10k validation set and concatenate them to form long strings of tokens. In fastai, we use the LanguageModelLoader to create a data loader which makes it easy to create and use bptt sized mini batches. The LanguageModelLoader takes a concatenated string of tokens and returns a loader.

We have a special modeldata object class for LMs called LanguageModelData to which we can pass the training and validation loaders and get in return the model itself.

This is the LanguageModelLoader and I really hope that by now, you’ve learned in your editor or IDE how to jump to symbols [1:02:37]. I don’t want it to be a burden for you to find out what the source code of LanguageModelLoader is. If your editor doesn’t make it easy, don’t use that editor anymore. There’s lots of good free editors that make this easy.

So this is the source code for LanguageModelLoader, and it’s interesting to notice that it’s not doing anything particularly tricky. It’s not deriving from anything at all. What makes something that’s capable of being a data loader is that it’s something you can iterate over.

![alt text](https://cdn-images-1.medium.com/max/1500/1*ttM96lLbHQn06byFwmHj0g.png)

Here is the fit function inside fastai.model [1:03:41]. This is where everything ends up eventually which goes through each epoch, creates an iterator from the data loader, and then just does a for loop through it. So anything you can do a for loop through can be a data loader. Specifically it needs to return tuples of independent and dependent variables for mini-batches.

![alt text](https://cdn-images-1.medium.com/max/1200/1*560U29nWI0xNGLsHgnWFNQ.png)

So anything with a __iter__ method is something that can act as an iterator [1:04:09]. yield is a neat little Python keywords you probably should learn about if you don’t already know it. But it basically spits out a thing and waits for you to ask for another thing — normally in a for loop or something. In this case, we start by initializing the language model passing it in the numbers nums this is the numericalized long list of all of our documents concatenated together. The first thing we do is to “batchfy” it. This is the thing which quite a few of you got confused about last time. If our batch size is 64 and we have 25 million numbers in our list. We are not creating items of length 64 — we are creating 64 items in total. So each of them is of size t divided by 64 which is 390k. So that’s what we do here:


Find rest at this link - way too much text for this Notebook

https://medium.com/@hiromi_suenaga/deep-learning-2-part-2-lesson-10-422d87c3340c 

In [0]:
trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)

We setup the dropouts for the model - these values have been chosen after experimentation. If you need to update them for custom LMs, you can change the weighting factor (0.7 here) based on the amount of data you have. For more data, you can reduce dropout factor and for small datasets, you can reduce overfitting by choosing a higher dropout factor. No other dropout value requires tuning

In [0]:
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7

We first tune the last embedding layer so that the missing tokens initialized with mean weights get tuned properly. So we freeze everything except the last layer.

We also keep track of the accuracy metric.

In [0]:
learner= md.get_model(opt_fn, em_sz, nh, nl, 
    dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])

learner.metrics = [accuracy]
learner.freeze_to(-1)

In [0]:
learner.model.load_state_dict(wgts)

We set learning rates and fit our IMDB LM. We first run one epoch to tune the last layer which contains the embedding weights. This should help the missing tokens in the wikitext103 learn better weights.

In [0]:
lr=1e-3
lrs = lr

In [0]:
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)


Note that we print out accuracy and keep track of how often we end up predicting the target word correctly. While this is a good metric to check, it is not part of our loss function as it can get quite bumpy. We only minimize cross-entropy loss in the LM.

The exponent of the cross-entropy loss is called the perplexity of the LM. (low perplexity is better).

In [0]:
learner.save('lm_last_ft')

In [0]:
learner.load('lm_last_ft')

In [0]:
learner.unfreeze()

In [0]:
learner.lr_find(start_lr=lrs/10, end_lr=lrs*10, linear=True)

In [0]:
learner.sched.plot()

In [0]:
learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=15)

In [0]:
learner.save('lm1')

In [0]:
learner.save_encoder('lm1_enc')

In [0]:
learner.sched.plot_loss()

In [0]:
df_trn = pd.read_csv(CLAS_PATH/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(CLAS_PATH/'test.csv', header=None, chunksize=chunksize)

In [0]:
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)

In [0]:
(CLAS_PATH/'tmp').mkdir(exist_ok=True)

np.save(CLAS_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(CLAS_PATH/'tmp'/'tok_val.npy', tok_val)

np.save(CLAS_PATH/'tmp'/'trn_labels.npy', trn_labels)
np.save(CLAS_PATH/'tmp'/'val_labels.npy', val_labels)

In [0]:
tok_trn = np.load(CLAS_PATH/'tmp'/'tok_trn.npy')
tok_val = np.load(CLAS_PATH/'tmp'/'tok_val.npy')

In [0]:
itos = pickle.load((LM_PATH/'tmp'/'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)

In [0]:
trn_clas = np.array([[stoi[o] for o in p] for p in tok_trn])
val_clas = np.array([[stoi[o] for o in p] for p in tok_val])

np.save(CLAS_PATH/'tmp'/'trn_ids.npy', trn_clas)
np.save(CLAS_PATH/'tmp'/'val_ids.npy', val_clas)

In [0]:
trn_clas = np.load(CLAS_PATH/'tmp'/'trn_ids.npy')
val_clas = np.load(CLAS_PATH/'tmp'/'val_ids.npy')

trn_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'trn_labels.npy'))
val_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'val_labels.npy'))

bptt,em_sz,nh,nl = 70,400,1150,3
vs = len(itos)
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
bs = 48

min_lbl = trn_labels.min()
trn_labels -= min_lbl
val_labels -= min_lbl
c=int(trn_labels.max())+1

In [0]:
trn_ds = TextDataset(trn_clas, trn_labels)
val_ds = TextDataset(val_clas, val_labels)
trn_samp = SortishSampler(trn_clas, key=lambda x: len(trn_clas[x]), bs=bs//2)
val_samp = SortSampler(val_clas, key=lambda x: len(val_clas[x]))
trn_dl = DataLoader(trn_ds, bs//2, transpose=True, num_workers=1, pad_idx=1, sampler=trn_samp)
val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)
md = ModelData(PATH, trn_dl, val_dl)

In [0]:
# part 1
dps = np.array([0.4, 0.5, 0.05, 0.3, 0.1])
dps = np.array([0.4,0.5,0.05,0.3,0.4])*0.5
m = get_rnn_classifier(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

opt_fn = partial(optim.Adam, betas=(0.7, 0.99))
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=.25
learn.metrics = [accuracy]

lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])
lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])
wd = 1e-7
wd = 0
learn.load_encoder('lm1_enc')

In [0]:
learn.freeze_to(-1)
learn.lr_find(lrs/1000)
learn.sched.plot()

In [0]:
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))


In [0]:
learn.save('clas_0')
learn.load('clas_0')
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

In [0]:
learn.save('clas_1')
learn.load('clas_1')

In [0]:
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=(32,10))

In [0]:
learn.sched.plot_loss()

In [0]:
learn.save('clas_2')

In [0]:
learn.sched.plot_loss()