# Aim: Preprocessing Steps for Pre-Trained Embeddings

The standard preprocessing steps cannot be applied to vector pre-trained embeddings. Our main objectitive should be to get the dataset as close to the pre-trained embeddings vectors being used. 


1. Blindly applying stemming or lemmatization might change the words which might reduce the word coverage. 
2. Simply removing embeddings doesnt help. Take the following example
   The glove embedding has seperate  vector embedding for it and 's so removing ' would create a new word its 
   which doesnt have embeddings 
3. For digits above 9 can be replaced by #


Things covered in this notebook
1. correcting misspellings two methods 
2. adding spaces to punctuations 
3. mapping punctuations and special characters 
4. mapping numbers larger than 9 to #'s
5. contraction correction 

I have Used the Stackoverflow dataset here.Thi




In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import operator
import string 
from tqdm import tqdm
import re




glove_filepath = '../../../kaggle/embedding/glove.6B/glove.6B.100d.txt'
    
    

# Loading the embeddings from glove 100d 

In [12]:
def load_embed(file):
    
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
  
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
        
    return embeddings_index

In [13]:
# %%timeit
word_matrix = load_embed(glove_filepath)

In [14]:
print(len(word_matrix.keys()))    

400000


### We have vector embeddings for 400000 

## Lets Have a close look at the different words , digits and other special characters if any in the glove embeddings

In [15]:
'''
we need to know 

1. numbers present 
2. punctuation present 
3. emojies
4. words 
5. characters 


'''
numbers = []
words = []
chars = []
punctuations = []
not_ascii = []
words_with_apostrophe = []





def is_char(x):
    
    x = ord(x)
    
    if x >=65 and x<=90 or x>= 97 and x <= 122:
        return True
        #meaning its a character 
    else:
        return False
        #not a character 
        
        

for word in tqdm(word_matrix.keys()):
    
     if word.isdigit():
            numbers.append(word)
     
     elif  len(word) == 1 and not is_char(word):
            punctuations.append(word)
            
     elif len(word) == 1 and is_char(word):
            chars.append(word)
     
     elif not word.isascii():
            not_ascii.append(word)
            
     elif word[0] == "'":
          words_with_apostrophe.append(word)
     else:
            words.append(word)   
    

100%|██████████| 400000/400000 [00:00<00:00, 564933.56it/s]


In [16]:
print(f'number of digits {len(numbers)}')
print(f'number of characters {len(chars)}')
print(f'number of not_ascii {len(not_ascii)}')
print(f'number of punctuations {len(punctuations)}')
print(f'number of words {len(words)}')
print(f'number of words with apostophy {len(words_with_apostrophe)}')

number of digits 3124
number of characters 26
number of not_ascii 9966
number of punctuations 32
number of words 386743
number of words with apostophy 109


### Chars

## glove has all 26 characts

In [18]:
for i in chars:
    print(i,end = ' ')

a i e s b x c r d m t v g p q y n f k o l w h u j z 

### Words or letters with Apostrochy

In [19]:
for i in words_with_apostrophe:
    print(i,end = ' ')

's '' 're 've 'm 'll 'd 'em '70s '60s '80s '90s 'n' '50s 'n '96 '40s '98 '30s '97 '95 'cause '99 '94 'til '92 '20s '93 '08 '06 '04 '88 '91 '86 '07 '05 '89 '68 '87 '84 '90 '09 '03 '67 '72 '85 '69 '76 '02 '78 'twas '82 '74 '64 '79 '57 '01 '83 '66 '60 '75 '10 '65 'till '81 '73 '77 '70 '80 '62 '59 '71 '63 '58 '00 '56 '50 '61 '55 '48 '49 '54 '51 '30 '27 '11 '12 '53 '13 '47 '32 '52 '46 '20 '36 '45 '29 '34 '40 '39 '28 '38 '42 '41 '15 '37 '25 '14 '44 

## Method to Check Coverage

In [28]:
def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    
    for word in vocab.keys():
        try:
            known_words[word] = embeddings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

In [29]:
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab



###### Check If un processed  Improves our word Coverage for GLove embeddings

In [30]:
def get_unknown_words(text):
    
    vocab_raw = build_vocab(text)
    unk_words_raw = check_coverage(vocab_raw,word_matrix)
    
    return unk_words_raw
    

## Stackoverflow Dataset
### The  processed_text column is results of apply  preprcessing steps which include removing stopwrds,html tags, stemming. I have included them for comparision with the new pre-precessing pipeline.

In [25]:
import pickle as pk

dataset = pk.load(open('df_questions','rb'))


In [26]:
dataset.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Text,Tags,procesed_text
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,"Last year, I read a blog post from Brendan O'C...",The Two Cultures: statistics vs. machine learn...,[machine-learning],the two cultur statist machin learn last year...
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,What are some of the ways to forecast demograp...,Forecasting demographic census What are some o...,[forecasting],forecast demograph censu what way forecast de...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain En...,How would you describe in plain English the ch...,Bayesian and frequentist reasoning in plain En...,[bayesian],bayesian frequentist reason plain english how...
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values i...,After taking a statistics course and then tryi...,What is the meaning of p values and t values i...,"[hypothesis-testing, t-test, p-value, interpre...",what mean valu valu statist test after take s...
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not me...,"There is an old saying: ""Correlation does not ...",Examples for teaching: Correlation does not me...,[correlation],exampl teach correl causat there correl causa...


# Lets Check the Coverage on the processed_text

In [31]:
unk_words_raw = get_unknown_words(dataset['procesed_text'])

Found embeddings for 10.78% of vocab
Found embeddings for  74.24% of all text


# Now Lest Check the Coverage on UnProcessedText 

## The unprocessed text will consist of title + body


In [192]:
unprocessed_text = dataset['Title'] + dataset['Body']

In [193]:
unk_words_raw = get_unknown_words(unprocessed_text)

Found embeddings for 4.66% of vocab
Found embeddings for  70.01% of all text


### Ok by the old processing style we were able to increase the text coverage by 4% and vocab coverage by about 6%

## Lets take a look at the unknown words 

In [36]:
unk_words_raw[0:4]

[('I', 280153), ('The', 40999), ('&lt;-', 28677), ("I'm", 23018)]

### Lowering all words 

In [194]:
unprocessed_text_lower = unprocessed_text.map(lambda x : x.lower())
unk_words_raw = get_unknown_words(unprocessed_text_lower)

Found embeddings for 5.75% of vocab
Found embeddings for  78.42% of all text


### Ok by just lowering the words we increased our coverage by 8%. We could have used a glove embedding which contains a larger vocab but that would also increase the memory requirment.

In [38]:
for i in unk_words_raw[0:50]:
    print(i,end = ' ')

('&lt;-', 28677) ("i'm", 23332) ('0,', 18219) ('$$', 12435) ('however,', 11696) ("don't", 10849) ('1l,', 10619) ('1,', 10428) ('2l,', 8909) ('&gt;', 8486) ("i've", 8459) ('data.', 6631) ('example,', 6061) ("it's", 5969) ('data,', 5330) ('so,', 5258) ('model.', 4935) ('&lt;', 4852) ('$x$', 4734) ('2,', 4423) ('is,', 4208) ("can't", 4150) ('is:', 4109) ('\\\\', 4046) ('0l,', 4017) ("doesn't", 3995) ("i'd", 3970) ('this?', 3957) ('variables.', 3926) ('model,', 3925) ('3l,', 3719) ('(or', 3597) ('1)', 3512) ('(i', 3399) ('$n$', 3328) ('it.', 3282) ('na,', 3263) ('distribution.', 3259) ('$y$', 3221) ('now,', 3002) ('this:', 2964) ('\\sim', 2875) ('3,', 2777) ("let's", 2773) ('variable.', 2770) ('variables,', 2766) ('(the', 2749) ('(e.g.', 2731) ('(i.e.', 2712) ('(and', 2690) 

### Adding space to  punctuations 

Will add spaces to all punctuations except -,., and ' 

In [195]:
import re


puncts = ['´',',', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]


# no_spacing = ["'",'.','-']

to_replace = '´'


def clean_text(x):
    
    x = str(x)
    
    #will convert ´ to '
    
    if to_replace in x:
        x = x.replace(to_replace,"'")
    
    
    for punct in puncts:
        if punct  == "'":
            x = x.replace(punct, f' {punct}')
        else:
            x = x.replace(punct, f' {punct} ')
    
 
    #will convert e.g to eg
    if '.' in x:
        x = x.replace('.','')
        
       
    return x

In [197]:
cleaned_lower_text = unprocessed_text_lower.map(clean_text)
unk_words_raw = get_unknown_words(cleaned_lower_text)

Found embeddings for 14.20% of vocab
Found embeddings for  95.67% of all text


### Ok Wow !!! We have 95.67% text coverage. But we cover only 14.20% of unique words lets take a look at the words with no embeddings

In [74]:
unk_words_raw[0:10]

[('â', 27435),
 ('mathbf', 7528),
 ('\x80\x99', 5908),
 ('\x80\x98', 5323),
 ('sqrt', 5090),
 ('0l', 4190),
 ('mathbb', 3668),
 ('dataframe', 2783),
 ('infty', 2765),
 ('mathcal', 2670)]

#### '\x80\x98' are spaces and will need to take care of them as well. 

# Another thing we could do is try replacing contractions with words 



# Contractions

In [51]:
contraction_mapping = {"ain't": "is not", 
                       "aren't": "are not",
                       "can't": "cannot", 
                       "'cause": "because",
                       "could've": "could have", 
                       "couldn't": "could not", 
                       "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [52]:


def known_contractions(embed):
    known = []
    for contract in contraction_mapping:
        if contract in embed:
            known.append(contract)
    return known



In [53]:
print("- Known Contractions -")
print("   Glove :")
print(known_contractions(word_matrix))


- Known Contractions -
   Glove :
["'cause", "ma'am", "o'clock"]


#### We have embeddings for only 3 contractions from our list 

# Contractions in raw data

In [76]:
vocab_raw = build_vocab(unprocessed_text)
contaction_list = known_contractions(vocab_raw)

In [77]:
len(contaction_list)

65

contaction_list

In [78]:
contaction_list[0:6]

["ain't", "aren't", "can't", "'cause", "could've", "couldn't"]

In [79]:
for i in contaction_list:
    if i in word_matrix:
        print(i)

'cause


# Only one Word was found

## Clearning contractions

In [80]:


def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

text_no_contractions = unprocessed_text.map(lambda x : clean_contractions(x,contraction_mapping))

In [81]:
unk_words_raw = get_unknown_words(text_no_contractions)

Found embeddings for 4.66% of vocab
Found embeddings for  70.46% of all text


### No improvemet may using embeddings with more contractions can help.

# PUNCTUATION MAPPING

The below step must not be confused with the earlier step there we just seperated the punctuations by space here 
we try to map them. Its like finding word embedding of a synonmy of a word.

In [92]:

# punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'

punct = punctuations

punct_mapping = {"‘": "'", "₹": "e", "´": "'", 
                 "°": "", "€": "e", "™": "tm", 
                 "√": " sqrt ", "×": "x", "²": 
                 "2", "—": "-", "–": "-", "’": 
                 "'", "_": "-", 
                 "`": "'", '“': 
                 '"', '”': '"', 
                 '“': '"', "£": 
                 "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', }



In [93]:

def punct_mapper(text, mapping):
    
    for p in mapping:
        text = text.replace(p, mapping[p])    
    return text

In [94]:
text_punt_map = unprocessed_text_lower.map(lambda x : punct_mapper(x,punct_mapping))

In [95]:
unk_words_raw = get_unknown_words(text_punt_map)

Found embeddings for 14.41% of vocab
Found embeddings for  95.93% of all text


In [96]:
for i in unk_words_raw[0:40]:
    print(i,end = ' ')

('â', 27435) ('mathbf', 7528) ('\x80\x99', 5908) ('\x80\x98', 5323) ('sqrt', 5090) ('0l', 4190) ('mathbb', 3668) ('dataframe', 2783) ('infty', 2765) ('mathcal', 2670) ('nbsp', 2627) ('rnorm', 2620) ('boldsymbol', 2373) ('î', 2272) ('mathrm', 2145) ('leq', 2136) ('ldots', 2108) ('±', 2019) ('lmer', 2017) ('coef', 1576) ('datai', 1515) ('00000', 1492) ('covariate', 1374) ('mydata', 1319) ('8l', 1314) ('\x80\x9d', 1285) ('asfactor', 1266) ('varepsilon', 1264) ('cbind', 1196) ('signif', 1190) ('regressioni', 1164) ('nrow', 1150) ('\x95\x90', 1148) ('bmatrix', 1120) ('setseed', 1095) ('lme4', 1086) ('rightarrow', 1076) ('ï', 1067) ('\x80\x99s', 999) ('modeli', 977) 

# Correct Mispelled words

In [97]:
mispell_dict = {'colour': 'color', 
                'centre': 'center', 
                'favourite': 'favorite', 
                'travelling': 'traveling',
                'counselling': 'counseling',
                'theatre': 'theater', 'cancelled':
                'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}

In [98]:

def correct_spelling(x, dic):
    for word in dic.keys():
        x = x.replace(word, dic[word])
        
    return x



In [99]:
text_mispelled = unprocessed_text_lower.map(lambda x : correct_spelling(x,mispell_dict))

In [100]:
oov_glove = get_unknown_words(text_mispelled)

Found embeddings for 14.40% of vocab
Found embeddings for  95.92% of all text


# Correcting Spelling doesnt help much there is another method used later

# Taking care of numbers

In [128]:
def clean_numbers(x):
    
    
#     x = re.sub('[0-9]{5}','#####',x)
#     x = re.sub('[0-9]{4}','###'',x)
    x = re.sub('[0-9]{3}',' ### ',x)
    x = re.sub('[0-9]{2}',' ## ',x)
    
    return x
    


In [199]:
text_clean_numbers = cleaned_lower_text.map(clean_numbers)
ov_glove = get_unknown_words(text_clean_numbers)

Found embeddings for 31.52% of vocab
Found embeddings for  97.37% of all text


In [130]:
ov_glove

[('â', 27435),
 ('mathbf', 7528),
 ('\x80\x99', 5908),
 ('\x80\x98', 5327),
 ('sqrt', 5092),
 ('0l', 4758),
 ('mathbb', 3668),
 ('dataframe', 2783),
 ('infty', 2765),
 ('mathcal', 2670),
 ('nbsp', 2627),
 ('rnorm', 2620),
 ('boldsymbol', 2373),
 ('î', 2272),
 ('leq', 2154),
 ('mathrm', 2145),
 ('ldots', 2111),
 ('lmer', 2043),
 ('±', 2019),
 ('8l', 1612),
 ('coef', 1576),
 ('datai', 1515),
 ('covariate', 1374),
 ('mydata', 1319),
 ('\x80\x9d', 1285),
 ('asfactor', 1266),
 ('varepsilon', 1264),
 ('²', 1202),
 ('cbind', 1196),
 ('signif', 1190),
 ('regressioni', 1164),
 ('nrow', 1150),
 ('\x95\x90', 1148),
 ('bmatrix', 1120),
 ('setseed', 1095),
 ('lme4', 1086),
 ('rightarrow', 1078),
 ('ï', 1067),
 ('\x80\x99s', 999),
 ('\x80\x93', 986),
 ('modeli', 977),
 ('rmse', 902),
 ('glmer', 886),
 ('cdots', 870),
 ('runif', 857),
 ('reml', 811),
 ('ncol', 803),
 ('garch', 798),
 ('dnorm', 790),
 ('\x80', 761),
 ('dfrac', 752),
 ('aov', 741),
 ('randomforest', 740),
 ('textbf', 737),
 ('loglik', 

## Preprocessing by Andrew Lukyanenko

The cell is code is taken from a notebook by Andrew Lukyanenko 

In [121]:
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

def clean_text(x):
    x = str(x)
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x

def clean_numbers(x):
    
#     x = re.sub('[0-9]{5,}', '#####', x)
#     x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

mispell_dict = {"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"isn't" : "is not",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"that's" : "that is",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"didn't": "did not",
"tryin'":"trying"}

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re

mispellings, mispellings_re = _get_mispell(mispell_dict)


def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]
    return mispellings_re.sub(replace, text)

# Clean the text
cleaned_text = unprocessed_text.apply(lambda x: clean_text(x.lower()))

# Clean numbers
clean_text_number = cleaned_text.apply(lambda x: clean_numbers(x))

# Clean spellings
clean_text_number_misspelling = clean_text_number.apply(lambda x: replace_typical_misspell(x))

ov_glove = get_unknown_words(clean_text_number_misspelling)


Found embeddings for 36.39% of vocab
Found embeddings for  96.45% of all text


The text-embedding is lowered by 1% but the vocab has increaes by 4%

In [119]:
print(f'number of out of vocab words {len(ov_glove)}')
full_voacb = build_vocab(clean_text_number_misspelling)
print(f'entire  vocab words {len(full_voacb)}')

number of out of vocab words 69607
entire  vocab words 109825


# My old method without stemming or removing stopwords 
## This method includes more punctuations 

In [180]:


def clean_text(x):
    
    # remove html tags
    regex = re.compile('<.*?>')
    input =  re.sub(regex, '', x)

    #remove punctuations, numbers.
    input = re.sub('[!@#$%^&*()\n_:><?\-.{}|+-,;""``~`—]|[0-9]|/|=|\[\]|\[\[\]\|\\|//]',' ',input)
    input = re.sub('[“’\']','',input)   
    input = input.replace('\\','')
        


    return input.lower()




tmp_sent  = "AAAAAA <html> <h1> run <i>running</i> ban banned dancing dance 1 2 3  4   5 5  5 !@#$%^&*(){{:><<< MMM<>?PLOKIU}} </h1> </html>"


print(clean_text(tmp_sent))


processed_text_1 = unprocessed_text.map(clean_text)




aaaaaa   run running ban banned dancing dance                                  plokiu    


In [181]:
ov_glove = get_unknown_words(processed_text_1)

Found embeddings for 34.95% of vocab
Found embeddings for  97.06% of all text


In [182]:
ov_glove

[('mathbf', 7039),
 ('sqrt', 4979),
 ('mathbb', 3273),
 ('nbsp', 2627),
 ('rnorm', 2622),
 ('infty', 2515),
 ('mathcal', 2411),
 ('lmer', 2230),
 ('ldots', 2070),
 ('boldsymbol', 2057),
 ('mathrm', 2054),
 ('leq', 1793),
 ('coef', 1643),
 ('covariate', 1420),
 ('right]', 1277),
 ('mydata', 1254),
 ('datai', 1230),
 ('cbind', 1201),
 ('signif', 1191),
 ('varepsilon', 1163),
 ('nrow', 1145),
 ('bmatrix', 1120),
 ('glmnet', 963),
 ('aov', 961),
 ('glmer', 959),
 ('regressioni', 939),
 ('rmse', 926),
 ('â±', 890),
 (']]', 884),
 ('rightarrow', 857),
 ('runif', 857),
 ('garch', 833),
 ('reml', 813),
 ('cdots', 812),
 ('ncol', 801),
 ('dnorm', 791),
 ('modeli', 788),
 ('chisq', 767),
 ('randomforest', 754),
 ('x[', 745),
 ('shouldnt', 739),
 ('ggplot', 727),
 ('loglik', 724),
 ('dfrac', 721),
 ('surv', 717),
 ('textbf', 709),
 ('î²', 697),
 ('pmatrix', 692),
 ('multicollinearity', 650),
 ('newdata', 648),
 ('i]', 644),
 ('variablesi', 585),
 ('resid', 583),
 ('nlme', 580),
 ('nanâ±nan', 578)

# Looks like blindly stemming or lemmatization hurts the coverage the most

### glove 800 with lower got final coverage of 43% and text coverage of 98.51%

## Final pipeline:

In [9]:

####helper functions 


def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    
    for word in vocab.keys():
        try:
            known_words[word] = embeddings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab



def get_unknown_words(text):
    
    vocab_raw = build_vocab(text)
    unk_words_raw = check_coverage(vocab_raw,word_matrix)
    
    return unk_words_raw


numbers = []
words = []
chars = []
punctuations = []
not_ascii = []
words_with_apostrophe = []





def is_char(x):
    
    x = ord(x)
    
    if x >=65 and x<=90 or x>= 97 and x <= 122:
        return True
        #meaning its a character 
    else:
        return False
        #not a character 
        
        

for word in tqdm(word_matrix.keys()):
    
     if word.isdigit():
            numbers.append(word)
     
     elif  len(word) == 1 and not is_char(word):
            punctuations.append(word)
            
     elif len(word) == 1 and is_char(word):
            chars.append(word)
     
     elif not word.isascii():
            not_ascii.append(word)
            
     elif word[0] == "'":
          words_with_apostrophe.append(word)
     else:
            words.append(word)   
    


In [189]:


# adding space to punctuations 
import re


puncts = ['´',',', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]


# no_spacing = ["'",'.','-']

to_replace = '´'


def clean_text(x):
    
    x = str(x)
    
    #will convert ´ to '
    
    if to_replace in x:
        x = x.replace(to_replace,"'")
    
    
    for punct in puncts:
        if punct  == "'":
            x = x.replace(punct, f' {punct}')
        else:
            x = x.replace(punct, f' {punct} ')
    
 
    #will convert e.g to eg
    if '.' in x:
        x = x.replace('.','')
        
    
    # have added lower 
    return x.lower()


cleaned_text = unprocessed_text.map(clean_text)

# mapping punctuations 
punct_mapping = {"‘": "'", "₹": "e", "´": "'", 
                 "°": "", "€": "e", "™": "tm", 
                 "√": " sqrt ", "×": "x", "²": 
                 "2", "—": "-", "–": "-", "’": 
                 "'", "_": "-", 
                 "`": "'", '“': 
                 '"', '”': '"', 
                 '“': '"', "£": 
                 "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', }



def punct_mapper(text, mapping):
    
    for p in mapping:
        text = text.replace(p, mapping[p])    
    return text


# cleaned_text = cleaned_text.map(lambda x: punct_mapper(x,punct_mapping))

# mapping digits to #'s
def clean_numbers(x):
    
    
#     x = re.sub('[0-9]{5}','#####',x)
#     x = re.sub('[0-9]{4}','###'',x)
    x = re.sub('[0-9]{3}',' ### ',x)
    x = re.sub('[0-9]{2}',' ## ',x)
    
    return x
    

cleaned_text = cleaned_text.map(clean_numbers)
ov_glove = get_unknown_words(cleaned_text)


Found embeddings for 31.54% of vocab
Found embeddings for  97.37% of all text


## Spelling Correction Method 2 

In this method we require to load word2vec model . So it needs a lot of memory. 
The way we can use stemming , lemmatization and spelling correction if we dont find the vector embedding for a 
given word.





In [212]:
#imports 
from __future__ import absolute_import, division

import os
import time
import numpy as np
import pandas as pd
import gensim
from tqdm import tqdm

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk import word_tokenize



### stemmers 


from nltk.stem import PorterStemmer
ps = PorterStemmer()


from nltk.stem.lancaster import LancasterStemmer
lc = LancasterStemmer()


from nltk.stem import SnowballStemmer
sb = SnowballStemmer("english")



wnl = WordNetLemmatizer()



import gc

import sys
from os.path import dirname

import spacy

'runing '

In [203]:
path = '../../../kaggle/embedding/GoogleNews-vectors-negative300/'
model = gensim.models.KeyedVectors.load_word2vec_format(path + 'GoogleNews-vectors-negative300.bin', 
                                                        binary=True)

In [204]:
words = model.index2word

print('loaded...')

w_rank = {}
for i,word in enumerate(words):
    w_rank[word] = i
WORDS = w_rank

print('done..')


# Use fast text as vocabulary
def words(text): return re.findall(r'\w+', text.lower())
def P(word): 
    "Probability of `word`."
    # use inverse of rank as proxy
    # returns 0 if the word isn't in the dictionary
    return - WORDS.get(word, 0)

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)
def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or [word])
def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def singlify(word):
    
    return "".join([letter for i,letter in enumerate(word) if i == 0 or letter != word[i-1]])

loaded...
done..


# Correcting words 

In [205]:
correction('rbun')

'run'

In [207]:
#modifying coverage to include correction


def check_coverage(vocab, embeddings_index,with_correction):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    
    for word in tqdm(vocab.keys()):
        
        try:
            
            if with_correction:
                if len(word)>1:
                    known_words[word] = embeddings_index[correction(word)]
                else:
                    known_words[word] = embeddings_index[word]
            else:
                known_words[word] = embeddings_index[word]
            
            nb_known_words += vocab[word]
       
    
    
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

In [208]:
vocab = build_vocab(cleaned_text)

unk_words1 = check_coverage(vocab,word_matrix,with_correction = True)

100%|██████████| 126838/126838 [00:53<00:00, 2349.62it/s]


Found embeddings for 53.53% of vocab
Found embeddings for  98.45% of all text


### Ok this is fantastic !!! our vocab coverage has increased over 50% 

# The below function is used while loading creating the weight_matrix 

#### The method is quite simple Search for the stem , corrected , lemmatized version of the word only when the current version of the word is not present in the embeddings. 

#### eg. say we are looking for the embedding for word running which we can't find,if we apply stemming we get run whose vector is present in the embedding matrix this approach is better than blindly applying stem,lemmatiozation operations.


In [223]:

stem = []
correct = []
lemma = []

def load_glove(vocab,embeddings_index):

    '''
    word is check 
    1. the actual word 
    2. lower case and check
    3. upper case and check
    4. captalize and check
    5. Stem and check
    6. lemma and check
    7. correct spelling and check.
    '''
    known_words = 0
    unk_words = 0




    for key in tqdm(vocab.keys()):

        word = key

        embedding_vector = embeddings_index.get(word)
        #checking if word is present.
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            known_words += 1
            continue

        #check if lower case of the word is present.
        word = key.lower()
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            known_words += 1
            continue

        #cheking if the upper case of the word is present.
        word = key.upper()
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            known_words += 1
            continue

        #checking if capital word is present.
        word = key.capitalize()
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            known_words += 1
            continue

        '''
        Three steppers are  used 

        ps -> PorterStemmer 
        ls -> lancaster 
        sb->snowball stemmer 

        need to check on this as well

        
        '''    



        word = ps.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            stem.append(word)
            known_words += 1
            continue


        word = lc.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            stem.append(word)
            known_words += 1
            continue


        word = sb.stem(key)
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            stem.append(word)
            known_words += 1
            
            continue

        '''
        Using Lemmatization
        
        '''    


        word = wnl.lemmatize(key,'v')
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
#             embedding_matrix[word_dict[key]] = embedding_vector
            lemma.append(key)
            known_words += 1
            continue

        ''''
        Checking for spellling mistakes 
        
        '''    



        if len(key) > 1:
            word = correction(key)
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
#                 embedding_matrix[word_dict[key]] = embedding_vector
                correct.append(word)
                known_words += 1
                continue

                
        #unknown words.    
#         embedding_matrix[word_dict[key]] = unknown_vector        
        unk_words += 1

    


    return unk_words,known_words 



In [224]:
vocab = build_vocab(cleaned_text)

unk_words,known_words = load_glove(vocab,word_matrix)

100%|██████████| 126838/126838 [00:59<00:00, 2145.97it/s]


In [238]:
len(vocab)

126838

### Words find by stemmers 

In [225]:
len(stem)

11542

In [231]:
stem

['connor',
 'machin',
 'number',
 'method',
 'text',
 'components',
 'tests',
 'sets',
 'method',
 'sizes',
 'observations',
 'time',
 'crossing',
 'survey',
 'improv',
 'variance',
 'clustering',
 'regular',
 'score',
 'train',
 'respect',
 'background',
 'network',
 'ranks',
 'effect',
 'sign',
 'videos',
 'testing',
 'trend',
 'purus',
 'int',
 'analysis',
 'inverter',
 'regression',
 'pcl',
 'revise',
 'dredg',
 'data',
 'vector',
 'series',
 'period',
 'loss',
 'log',
 'correlated',
 'paradox',
 'subsample',
 'across',
 'tmt',
 'needed',
 'expectancy',
 'help',
 'best',
 'simulations',
 'stl',
 'distributions',
 'ind',
 'run',
 'pairs',
 'imperfect',
 'toler',
 'cov',
 'target',
 'l2',
 'lsa',
 'rpi',
 'interaction',
 'datapoint',
 'variables',
 'inflation',
 'matrix',
 'term',
 'distribution',
 'spec',
 'rade',
 'at',
 'sc',
 'exp',
 'cont',
 'population',
 'container',
 'rpo',
 'ric',
 'function',
 'auc',
 'sigma',
 'classification',
 'variable',
 'forecasting',
 'missing',
 're

### Words find by lemmatization

In [226]:
len(lemma)

49

### Words Found by Correction

In [233]:
len(correct)

19742

In [235]:
correct[0:5]

['at', 'march', 'leverage', 'acceptable', 'kin']

# Total New Words found  By load_glove method 

In [236]:
len(stem) + len(lemma) + len(correct)

31333

## Incredible we found embeddings for 30k words a lot of them we using stemmatization.


# new things learned 
1. dealing with contractions 
2. dealing with punctuations 
3. dealing with quotes (not implemented)


conslusion : The preprocessing steps will change for each dataset . The best thing to do is lowercase the words 
get the initail dataset coverage and out of vocab words . Then check for contractions, punck. 

The glove embeddings has abt. 3000 digit and most of the punctuations in the word embeddings.

Then check for if there still words that can  be found in the word embedings the do mispelling corrections,
then check for acrnims

## The Work FLow


The first thing to do is know what words are present in your embeddings. Seperate them by

1. words 
2. chars 
3. punctuations 
4. non-ascii
5. numbers 


Then apply the following preprocessing steps 


1. while using embeddings adding space to punctuactions works the best eg it's to it ' s
2. dealing with numbers by adding # doubles word coverage 
3. mapping misspelled words didnt help much in my case 
4. mapping acronims



Best results obtained are total text coverage of 96.71 and unique word coverage is  36.62%.


Then the next thing to do is filter out the words which are  oov.

