I want to illustrate how I do come up with meaningful preprocessing when building deep learning NLP models.

I start with two golden rules:

1) Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings <br>
2) Get your vocabulary as close to the embeddings as possible

Some of you might used standard preprocessing steps when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc. The reason is simple: You loose valuable information, which would help your NN to figure things out.

In [10]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

In [6]:
train = pd.read_csv("../input/bug-data/bugs-data.csv")
print("Train shape : ",train.shape)

Train shape :  (2764, 3)


In [7]:
train.head()

Unnamed: 0,bugid,summary,component
0,808252,home and end key not working on web pages open...,Keyboard Navigation
1,1389748,[Linux] default browser firefox not working,Shell Integration
2,1588476,Full page scrolling screenshot not working for...,PDF Viewer
3,1609740,"Redhat Linux EL 6.10 Firefox v68, Set as Deskt...",Shell Integration
4,1617503,Bookmark button not working after system retur...,Bookmarks & History


I will use the following function to track our training vocabulary, which goes through all our text and counts the occurance of the contained words.

In [8]:
def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

So lets populate the vocabulary and display the first 5 elements and their count. Note that now we can use progess_apply to see progress bar

In [11]:
sentences = train["summary"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:5]})

100%|██████████| 2764/2764 [00:00<00:00, 247656.67it/s]
100%|██████████| 2764/2764 [00:00<00:00, 269005.39it/s]

{'home': 15, 'and': 349, 'end': 5, 'key': 20, 'not': 1948}





Next we import the embeddings we want to use in our model later. For illustration I use GoogleNews here.

In [16]:
from gensim.models import KeyedVectors

news_path = '../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin'
embeddings_index = KeyedVectors.load_word2vec_format(news_path, binary=True)

Next I define a function that checks the intersection between our vocabulary and the embeddings. It will output a list of out of vocabulary (oov) words that we can use to improve our preprocessing

In [17]:
import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [18]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 5581/5581 [00:00<00:00, 168178.37it/s]

Found embeddings for 53.70% of vocab
Found embeddings for  81.97% of all text





Only 53% of our vocabulary will have embeddings, making our data more or less useless. So lets have a look and start improving. For this we can easily have a look at the top oov words.

In [20]:
oov[:20]

[('and', 349),
 ('to', 291),
 ('a', 167),
 ('available/working', 133),
 ('of', 119),
 ('working.', 91),
 ('-', 72),
 ('working,', 43),
 ('3.5', 30),
 ('[e10s]', 21),
 ('add-on', 19),
 ('e10s', 19),
 ('/', 18),
 ('10', 17),
 ('properly.', 12),
 ('"Open', 12),
 ('3.6', 10),
 ('working)', 10),
 (',', 9),
 ('working:', 8)]

On first place there is "to". Why? Simply because "to" was removed when the GoogleNews Embeddings were trained. We will fix this later, for now we take care about the splitting of punctuation as this also seems to be a Problem. But what do we do with the punctuation then - Do we want to delete or consider as a token? I would say: It depends. If the token has an embedding, keep it, if it doesn't we don't need it anymore. So lets check:

In [36]:
'?' in embeddings_index

False

In [37]:
'&' in embeddings_index

True

Interesting. While "&" is in the Google News Embeddings, "?" is not. So we basically define a function that splits off "&" and removes other punctuation.

In [21]:
def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

In [22]:
train["clean_summary"] = train["summary"].progress_apply(lambda x: clean_text(x))
sentences = train["clean_summary"].apply(lambda x: x.split())
vocab = build_vocab(sentences)

100%|██████████| 2764/2764 [00:00<00:00, 86782.82it/s]
100%|██████████| 2764/2764 [00:00<00:00, 286956.84it/s]


In [23]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 4390/4390 [00:00<00:00, 209145.88it/s]

Found embeddings for 74.05% of vocab
Found embeddings for  90.30% of all text





Nice! We were able to increase our embeddings ratio from 53% to 74% by just handling punctiation. Ok lets check on thos oov words.

In [25]:
for i in range(10):
    print(embeddings_index.index_to_key[i])

</s>
in
for
that
is
on
##
The
with
said


Hmm seems like numbers also are a problem. Lets check the top 10 embeddings to get a clue.
There is "##" in there - Simply because as a reprocessing all numbers bigger than 9 have been replaced by hashs. I.e. 15 becomes ## while 123 becomes ### or 15.80€ becomes ##.##€. So lets mimic this preprocessing step to further improve our embeddings coverage

In [26]:
import re

def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

In [27]:
train["clean_summary"] = train["clean_summary"].progress_apply(lambda x: clean_numbers(x))
sentences = train["clean_summary"].progress_apply(lambda x: x.split())
vocab = build_vocab(sentences)

100%|██████████| 2764/2764 [00:00<00:00, 58342.76it/s]
100%|██████████| 2764/2764 [00:00<00:00, 35901.49it/s]
100%|██████████| 2764/2764 [00:00<00:00, 289421.22it/s]


In [28]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 4094/4094 [00:00<00:00, 194401.46it/s]

Found embeddings for 80.09% of vocab
Found embeddings for  92.65% of all text





Nice! Another 6% increase. Now as much as with handling the puntuation, but every bit helps. Lets check the oov words

In [32]:
oov[:10]

[('and', 353),
 ('to', 301),
 ('a', 169),
 ('of', 119),
 ('e##s', 40),
 ('E##S', 11),
 ('###a1', 9),
 ('firefox2', 6),
 ('userChromecss', 5),
 ('windowopen', 5)]

In [33]:
to_remove = ['a','to','of','and']
sentences = [[word for word in sentence if not word in to_remove] for sentence in tqdm(sentences)]
vocab = build_vocab(sentences)

100%|██████████| 2764/2764 [00:00<00:00, 234843.64it/s]
100%|██████████| 2764/2764 [00:00<00:00, 312001.94it/s]


In [34]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 4090/4090 [00:00<00:00, 149079.29it/s]

Found embeddings for 80.17% of vocab
Found embeddings for  96.11% of all text





In [35]:
oov[:20]

[('e##s', 40),
 ('E##S', 11),
 ('###a1', 9),
 ('firefox2', 6),
 ('userChromecss', 5),
 ('windowopen', 5),
 ('Firefox##', 4),
 ('###a2', 4),
 ('windowfocus', 4),
 ('##b2', 4),
 ('PgUp', 4),
 ('Cannot', 4),
 ('setTimeout', 4),
 ('Quantumbar', 3),
 ('pdfjs', 3),
 ('PageUp', 3),
 ('Forecastfox', 3),
 ('ctrlT', 3),
 ('Autoscroll', 3),
 ('DownThemAll', 3)]

This Looks good now for getting as closer to embeddings.