### NLP - Preprocessor For Language Model

Our objective here is to Create a language model from scratch. The language model either can be at character or word level. Given the group of words or character predict the forthcoming character or a word.

There is a excellent explanation given here for RNN and LSTM. Please check it below

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/

In this session, let us take the csv as an input and outputs the numerical data which would be used with the neural net to build a language model or for sentiment classification 

### LanguageTokenizer is a wrapper on top of the spacy tokenizer which helps in preprocessing the text before giving to tokenization process to Spacy. 

The preprocess steps includes 

1) Replace the repeated characters with the t_rep tag. for example 22222 will be replaced as t_rep 2 5

2) Replace the repeated words with the t_wrep tag. for example 'Test Test Test Test' will be replaced as t_wrep test 4

3) Replace the Caps word with the tag . for example 'HOW are you?' will be replaced with 't_up how are you?'

4) Replace Html BR tags with /n

5) Replace Multi spaces to a single spaces

Thanks to the fast ai course which helps in deriving these common initutions 

Post this preprocessed text will be tokenized using Spacy.

In [1]:
import spacy
from spacy.symbols import ORTH
import re 

class LanguageTokenizer:
    def __init__(self,language='en'):
        self.processor = spacy.load('en')   
        for w in ('_eos_','_bos_','_unk_'):
            self.processor.tokenizer.add_special_case(w, [{ORTH: w}])
            
        self.re_repeatedCharacters = re.compile(r'(\S)(\1{3,})')
        self.re_repeatedWords = re.compile(r'(\b\w+\W+)(\1{3,})')
        self.re_breaks = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)  
 
    def substituteBreaks(self,text):           
        return self.re_breaks.sub("\n", text)   
    
    def _substituteRepeatedCharacters(self,r):
        TK_REP=  'tk_rep'      
        c,cc = r.groups()        
        return f' {TK_REP} {len(cc)+1} {c} '   
   
    def _substituteRepeatedWords(self,r):
        TK_WREP = 'tk_wrep'
        c,cc = r.groups()
        return f' {TK_WREP} {len(cc.split())+1} {c} '
        
    def substituteCaps(self,text):
        TOK_UP,TOK_SENT,TOK_MIX = ' t_up ',' t_st ',' t_mx '
        res = []
        prev='.'
        re_word = re.compile('\w')
        re_nonsp = re.compile('\S')
        for s in re.findall(r'\w+|\W+', text):
            res += ([TOK_UP,s.lower()] if (s.isupper() and (len(s)>2))   
                    else [s.lower()]) 
        return ''.join(res)
    
    #subtitute multi spaces with single space        
    def substituteSpaces(self,text):
        return re.sub(' {2,}', ' ', text)
    
    
    def processText(self, text):
        text = self.re_repeatedCharacters.sub(self._substituteRepeatedCharacters, text)
        text = self.re_repeatedWords.sub(self._substituteRepeatedWords, text)
        text = self.substituteCaps(text)
        text = self.substituteBreaks(text)
        text = self.substituteSpaces(text)
        
        return [token.text for token in self.processor.tokenizer(text)]
        
    
    def process(self, texts):        
        return [self.processText(text) for text in texts]        
        

### LanguagePreprocessor is the wrapper which accepts the CSV and it should have a column names 'text and labels' and returns the numerical representation for both training set and validation set. 

1) It creates the vocabalary dictionary using the max_vocab_size and min_word_freq as a hyper-parameter
2) It returns numercalize input based on the vocabalary dictionary and this in turn can be used for further processing for language model creation

In [17]:
import pandas as pd
import html
import pickle
import ipdb as pdb
import numpy as np
from sklearn.model_selection import train_test_split
from collections import Counter
import collections


#Always assumes the CSV has text and labels as columns
class LanguagePreprocessor:
    def __init__(self,csvPath,chunkSize = 50000,validationSetSize = 0.1,labelReplace=None):
        self.tokenizer = LanguageTokenizer('en') 
        self.trn_texts, self.trn_labels, self.val_texts, self.val_labels =self._splitDataset(csvPath, validationSetSize,chunkSize,labelReplace)
        
    def apply(self, max_vocab_size = 60000, min_word_freq = 2):       
        self.trn_tokens= self.tokenizer.process(self.trn_texts)
        self.val_tokens= self.tokenizer.process(self.val_texts)      
        
        self._buildVocabDict(max_vocab_size,min_word_freq)
        return self._numericalizeByVocabDict()        
        
    def _getTokens(self):
        return self.tokenizer.process(trn_texts)
    
    def _buildVocabDict(self, max_vocab_size = 60000, min_word_freq = 2, bos_token = "_bos_", unk_token="_unk_", eos_token = "_eos_"):
        freq = Counter(p for o in self.trn_tokens for p in o)
        #pdb.set_trace()
        self.numbertotoken=  [o for  o,c in freq.most_common(max_vocab_size) if c> min_word_freq]
        self.numbertotoken.insert(0, bos_token)       
        self.numbertotoken.insert(0, eos_token)
        self.numbertotoken.insert(0, unk_token)        
        self.tokentonumber = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(self.numbertotoken)}) 
        
        return self.numbertotoken, self.tokentonumber
    
    def _numericalizeByVocabDict(self):
        self.trn_lm = np.array([[self.tokentonumber[o] for o in p] for p in self.trn_tokens])
        self.val_lm = np.array([[self.tokentonumber[o] for o in p] for p in self.val_tokens])
        return self.trn_lm, self.val_lm
        
        
    def _fixup(self,x):
        re1 = re.compile(r'  +')
        x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
        return re1.sub(' ', html.unescape(x))    
        
    def _parsedInput(self ,df,labelReplace,ini_texts,ini_labels):
        BOS = '_bos_'  # beginning-of-sentence tag
        if(labelReplace is not None):
             df.replace({'labels': {'neg': 0, 'pos': 1, 'unsup':2}},inplace=True)
        
        labels = df.iloc[:,-1].astype(np.int64)
        texts = f'\n{BOS} ' + df["text"].astype(str) 
        
        ini_texts =   ini_texts + list(texts.apply(self._fixup).values) if ini_texts is not None else list(texts.apply(self._fixup).values)
        ini_labels = ini_labels + labels.tolist() if ini_labels is not None else labels.tolist()  
        
        return ini_texts, ini_labels
        
    def _splitDataset(self,csvPath, validationSetSize,chunksize,labelReplace):
       
        trn_txt_df, trn_label_df, val_txt_df,val_label_df =[],[],[],[]        
        df =  pd.read_csv(csvPath, chunksize=chunksize) if chunksize is not None else pd.read_csv(csvPath) 
            
        trn_texts, trn_labels, val_texts, val_labels = None,None,None,None
        
        for i,data in enumerate(df):
            trn_data,val_data= train_test_split(
                data, test_size=validationSetSize)
            
            trn_texts, trn_labels = self._parsedInput(trn_data , labelReplace,trn_texts, trn_labels)
            val_texts, val_labels= self._parsedInput(val_data , labelReplace,val_texts, val_labels)
            
            
        return trn_texts, trn_labels, val_texts, val_labels

In [32]:
preprocessor = LanguagePreprocessor('data/imdb_master.csv',labelReplace={'labels': {'neg': 0, 'pos': 1, 'unsup':2}})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  regex=regex)


In [33]:
preprocessor.apply(max_vocab_size=50000)

(array([ [40, 41, 12, 14, 16, 9947, 5653, 1976, 705, 454, 29, 313, 717, 5, 6365, 31, 1533, 1579, 28, 5, 7, 7018, 566, 6, 563, 18, 327, 710, 140, 2894, 5, 500, 7355, 45, 3, 6438, 79, 8, 7, 208, 16063, 6, 36, 15359, 399, 31, 7075, 10933, 28, 4, 66, 12145, 1172, 8605, 58, 14, 412, 17, 31, 79, 28, 5, 216, 5, 6365, 686, 333, 1470, 101, 7, 37801, 770, 15, 6262, 7, 2424, 0, 56, 16, 1962, 9, 26, 17, 4021, 4206, 4, 20, 27, 81, 151, 32, 554, 891, 148, 805, 5, 6, 123, 3, 26, 32, 163, 90, 20, 3, 16, 106, 85, 35, 1639, 36, 3859, 56, 16, 289, 5, 23, 1625, 4, 10933, 3676, 55, 180, 134, 5, 151, 38, 7019, 139, 5, 6, 1457, 9, 16, 5726, 154, 16, 1329, 4, 11, 17, 13088, 4, 3, 313, 958, 8, 55, 1740, 6, 1883, 27532, 151, 38, 601, 5, 45, 3, 73, 495, 4, 1579, 1659, 9, 180, 796, 4, 873, 5, 55, 15873, 93, 38, 12, 890, 18, 25, 7, 3449, 93, 37, 38, 110, 763, 12, 374, 3, 16, 5588, 16, 53, 16, 395, 1954, 16, 154, 4, 3, 26, 71, 4095, 49, 7, 848, 18, 824, 18, 106, 58, 7, 16, 167, 17, 610, 16, 56, 20, 1579, 6, 3287, 3

### Store the preprocess file for the next session(building language model) 

In [35]:
import dill 

with open('data/imdbpreprocesseddata.file', "wb") as dill_file:
    dill.dump(preprocessor, dill_file)    