Links:
https://www.kaggle.com/nz0722/simple-eda-text-preprocessing-jigsaw
https://www.codeastar.com/word-embedding-in-nlp-and-python-part-1/
https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings/comments

Things to try/To-Do

import re
def punct_apo_fix(x):
    x = str(x)
    x = x.replace("_"," ")    
    for punct in "`’":
        x = x.replace(punct,"'")    
    for punct in '!,?()%":.$“/;#+*=&gt;[]&amp;-':
        x = x.replace(punct, f' {punct} ')
    apos = re.findall("'.*?[\s]", x)
    for apo in apos: 
        if apo.lower() in ["'t ","'re ", "' ", "'ve ", "'s ", "'ll ", "'d ", "'n ", "'clock ", "'m "]:
            x = x.replace(apo, f' {apo}')
        else:
            x = x.replace(apo, f" ' {apo[1:]}")
    if (x.endswith("'")): 
        x = x[:-1]+" '"        
    return x
    
    


In [2]:
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors

X_train = pd.read_csv('train.csv')
X_test = pd.read_csv('test.csv')



Before we do anything in NLP, we need to build a vocabulary consisting of unique words (vocabulary list). Build_vocab will take care of that. What it does is that it goes through the text we have, tokenizes them and counts the occurence of the contained words. In this manner it builds a vocabulary list from the comments.

We will have to retrieve tokens from a document by separating punctuations from words and splitting words like "y'all" or "you're". 

In [4]:
def build_vocab(texts):
    sentences=texts.apply(lambda x: x.split()).values 
    vocab={}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word]+=1
            except KeyError:
                vocab[word]=1    
    return vocab

In [29]:
X_train['comment_text']=X_train['comment_text'].apply(lambda x: x.lower())
vocab=build_vocab(X_train['comment_text'])
len(vocab)

635719

KeyedVectors will implement word vectors and their similarity look-ups

In this project, we are going to use Universal Embedding: embeddings that are pre-trained on the a large corpus of data to improve their performance by incorporating some general word/sentence representations learned on the larger dataset. The whole idea of embedding is to encode words or sentences in fixed-length dense vectors. What it means is that they are distributed representations of text in an n-dimensional space.

Our first major goal is to get our vocabulary as close to embedding as possible

In [6]:
ft_common_crawl = 'crawl-300d-2M.vec'
embeddings_index = KeyedVectors.load_word2vec_format(ft_common_crawl)

To actually check the intersection between the vocabulary, that we will be slowly improving and the pre-existing embedding, check_coverage function is created. It tells us how many words in our vocabulary actually exist in the larger embedding. It will output the list of words that were not found in the embedding, and by adressing those words, we can improve our vocabulary

nb_known_words/nb_unknown_words represent the number of words covered in the corpus
Also, prints the vocabulary coverage of embeddings and the % of comments text covered by the embeddings

In [7]:
import operator
def check_coverage(vocab,embeddings_index):
    known_words={}
    unknown_words={}
    nb_known_words=0
    nb_unknown_words=0
    for word in vocab.keys():
        try:
            known_words[word]=embeddings_index[word]
            nb_known_words+=vocab[word]
        except:
            unknown_words[word]=vocab[word]
            nb_unknown_words+=vocab[word]
    print('Found embedding for {:.3%} of vocab'.format(len(known_words)/len(vocab)))
    print('Found embedding for {:.3%} of all text'.format(nb_known_words/(nb_known_words+nb_unknown_words)))
    unknown_words=sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]
    return unknown_words
unknown = check_coverage(vocab, embeddings_index)    

Found embedding for 13.109% of vocab
Found embedding for 91.197% of all text


Embedding only cover ~13% of the vocabulary and this accounts for 92% from the entire comments texts.
About 8.8% of our text is useless, since it won't be detected by the embedding. To improve this, we need to check which words are out-of-vocabulary (oov)

In [16]:
#clear punctuation and special characters
def clean_punc(data):
    punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~`" + '""“”’' + '...∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
    def clean_special_chars(text, punct):
        for p in punct:
            text = text.replace(p, '').lower()
        return text

    data = data.astype(str).apply(lambda x: clean_special_chars(x, punct))
    return data

In [17]:
X_train['comment_text']=clean_punc(X_train['comment_text'])
vocab=build_vocab(X_train['comment_text'])
unknown=check_coverage(vocab,embeddings_index)

Found embedding for 25.724% of vocab
Found embedding for 98.846% of all text


In [18]:
unknown[:1000]

[('khadr', 5039),
 ('trudeaus', 5003),
 ('alaskas', 4341),
 ('murkowski', 3096),
 ('altright', 2367),
 ('sb21', 2095),
 ('hawaiis', 1847),
 ('siemian', 1826),
 ('putins', 1736),
 ('antitrump', 1692),
 ('mulroney', 1660),
 ('chretien', 1537),
 ('notley', 1384),
 ('oregons', 1191),
 ('manafort', 1179),
 ('ontarios', 1162),
 ('albertans', 1141),
 ('djou', 1088),
 ('alceste', 1049),
 ('m103', 1048),
 ('comeys', 1034),
 ('gorsuch', 1019),
 ('sloter', 948),
 ('guptas', 941),
 ('begich', 940),
 ('khadrs', 933),
 ('altleft', 932),
 ('horgan', 915),
 ('usccb', 874),
 ('sb91', 857),
 ('presidentelect', 854),
 ('kpmg', 834),
 ('pfds', 832),
 ('singlepayer', 800),
 ('tfsa', 797),
 ('antiamerican', 797),
 ('gabbard', 769),
 ('trumpster', 744),
 ('jpii', 737),
 ('kealoha', 692),
 ('klastri', 682),
 ('albertas', 668),
 ('chaput', 652),
 ('fptp', 649),
 ('imua', 641),
 ('farright', 641),
 ('massengill', 628),
 ('arpaio', 617),
 ('shannyn', 587),
 ('antimuslim', 586),
 ('eugenes', 586),
 ('punahou', 58

Most of the words at this point are actually misspelled

In [14]:
#remove contraction
contraction_mapping = {
        "ain't": "is not", "aren't": "are not", "can't": "cannot", "'cause": "because",
        "could've": "could have", "couldn't": "could not","didn't": "did not",  "doesn't": "does not", 
        "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", 
        "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", 
        "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", 
        "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", 
        "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  
        "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", 
        "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", 
        "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", 
        "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", 
        "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", 
        "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", 
        "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", 
        "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", 
        "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", 
        "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", 
        "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", 
        "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", 
        "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  
        "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", 
        "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
        "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
        "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
        "y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", 
        "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

def fixing_contraction(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text=text.replace(s,"'")
    text=" ".join([mapping][word] if word in mapping else word for word in text.split(" "))
    return text

In [15]:
X_train['comment_text']=X_train['comment_text'].apply(lambda x: fixing_contraction(x, contraction_mapping))    
vocab=build_vocab(X_train['comment_text'])
unknown=check_coverage(vocab,embeddings_index)

Found embedding for 25.724% of vocab
Found embedding for 98.846% of all text


Did not change much, which confirms that most unknown words we have left suffer from misspelling.

There are some words like home, up, written in small caps way

In [19]:
small_caps_mapping = { 
    "ᴀ": "a", "ʙ": "b", "ᴄ": "c", "ᴅ": "d", "ᴇ": "e", "ғ": "f", "ɢ": "g", "ʜ": "h", "ɪ": "i", 
    "ᴊ": "j", "ᴋ": "k", "ʟ": "l", "ᴍ": "m", "ɴ": "n", "ᴏ": "o", "ᴘ": "p", "ǫ": "q", "ʀ": "r", 
    "s": "s", "ᴛ": "t", "ᴜ": "u", "ᴠ": "v", "ᴡ": "w", "x": "x", "ʏ": "y", "ᴢ": "z"
}

def clean_small_caps(text):
    for char in small_caps_mapping:
        text = text.replace(char, small_caps_mapping[char])
    return text

In [20]:
X_train["comment_text"] = X_train["comment_text"].apply(lambda text: clean_small_caps(text))
vocab=build_vocab(X_train['comment_text'])
unknown=check_coverage(vocab,embeddings_index)

Found embedding for 25.729% of vocab
Found embedding for 98.855% of all text


We covered 25.73% of the vocabulary which amounts to 98.86% of all text

We will need to clean up the test set as well!!