## Pre-Processing Text

Modifying text cleaning code from [KevinLiao159 (GitHub)](https://github.com/KevinLiao159/Quora/blob/master/kernels/submission_v50.py), who in turn was modifying code from [Heng Zheng (Kaggle)](https://www.kaggle.com/hengzheng/attention-capsule-why-not-both-lb-0-694), who was actually using code written by [Theo Viel (Kaggle)](https://www.kaggle.com/theoviel/improve-your-score-with-text-preprocessing-v2). Love ya Kaggle, don't ever change.

Also very informative is [this kernel by Dieter (Kaggle)](https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings) which discusses how to customize preprocessing to the embedding chosen.

I chose to work with Fasttext to start out, because this is the most recent text embedding available and thus included Trump and other more topical US political content. 

In [308]:
import pandas as pd
import numpy as np
import os
import re
import string
import unicodedata
from collections import Counter

pd.set_option('display.max_colwidth', -1)

In [6]:
def load_data():
    """
    Load test and train data from csv.
    
    Returns
    _______
    df_train: DataFrame
        Full raw training dataset
    df_test: DataFrame
        Full raw test dataset
    """
    
    # Select local path vs kaggle kernel
    path = os.getcwd()
    if 'data-projects/kaggle_quora/notebooks' in path:
        data_dir = '../data/raw/'
    else:
        data_dir = ''

    df_train = pd.read_csv(data_dir +'train.csv')
    df_test = pd.read_csv(data_dir +'test.csv')
    return df_train, df_test

### Misspellings, misspacings, contractions and punctuation

Embeddings are like children, if they don't know a word yet then they can't understand what you are trying to say. But we can correct to word to one they do recognize.

In [7]:
df_train_raw, df_test_raw = load_data()

df_train = df_train_raw
df_test = df_test_raw

In [8]:
def load_word_embedding(filepath):
    """
    given a filepath to embeddings file, return a word to vec
    dictionary, in other words, word_embedding
    E.g. {'word': array([0.1, 0.2, ...])}
    """
    def _get_vec(word, *arr):
        return word, np.asarray(arr, dtype='float32')

    print('load word embedding ......')
    try:
        word_embedding = dict(_get_vec(*w.split(' ')) for w in open(filepath))
    except UnicodeDecodeError:
        word_embedding = dict(_get_vec(*w.split(' ')) for w in open(
            filepath, encoding="utf8", errors='ignore'))

    return word_embedding

In [9]:
fasttext = load_word_embedding('../data/raw/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec')

load word embedding ......


### Text / Document Coverage

In [11]:
def build_vocab(docs):
    """
    given a list or np.array of strings create a dictionary of unique words with frequencies.
    
    Parameters
    __________
    docs: list or np.array
        iterable of text
    
    Returns
    _______
    dict
        unique words as keys, frequencies as values
    
    """
    vocab = {}
    
    for doc in docs:
        for word in doc.split():
            vocab[word] = vocab.get(word, 0) + 1
                
    return vocab

def vocab_embedding_coverage(vocab, embedding, verbose = False):
    """
    given a dict representing the word frequency of a corpus, 
    calculate the percentage of unique words and 
    the percentage of the corpus matched in the embedding dict.
    
    Parameters
    __________
    vocab: dict
        word frequency of corpus
    embedding: dict
        embedding vector converted to dict
    verbse: bool
        print summary statistics
    
    Returns
    _______
    
    perc_words : float
        percentage of unique words identified in corpus
    perc_corpus : float
        percentage of corpus identified in corpus
    words_in_embedding: dict
        dictionary of unique words, frequency and whether found in embedding (true / false)
    """
    
    words_in_embedding = {}
    word_found_count = 0
    corpus_found_count = 0
    corpus_count = 0 
    
    for word, freq in vocab.items():
        corpus_count += freq
        words_in_embedding[word] = {
            'frequency': freq,
            'embedding': (word in embedding)
        }
        if word in embedding:
            word_found_count += 1
            corpus_found_count += freq
    
    perc_words = word_found_count / len(vocab)
    perc_corpus = corpus_found_count / corpus_count
    
    print('{}% of vocabulary words found in embedding files'.format(round(100*perc_words,2)))
    print('{}% of corpus found in embedding files'.format(round(100*perc_corpus,2)))
    
    return perc_words, perc_corpus, words_in_embedding

In [12]:
vocab = build_vocab(np.concatenate((df_train.question_text, df_test.question_text)))
w,c,words_in_embedding = vocab_embedding_coverage(vocab, fasttext, True)
df_words = pd.DataFrame.from_dict(words_in_embedding, orient = 'index')
df_words = df_words.sort_values(by='frequency', ascending=False)
df_words[np.logical_not(df_words.embedding)].head(10)

29.77% of vocabulary words found in embedding files
87.66% of corpus found in embedding files


Unnamed: 0,frequency,embedding
India?,17082,False
don't,15642,False
it?,13436,False
I'm,13344,False
What's,12985,False
do?,9112,False
life?,8074,False
can't,7375,False
you?,6553,False
me?,6485,False


Fasttext does include punctuation. Notable that there are variants on terms like Ph.D and PhD, or ie., i.e. and i.e. 
Also interesting that the embedding contains references to websites like Google and Amazon. This seems useful.

Stripping all punctuation shouldn't be strictly necessary. in the example of .NET or ASP.NET for example, I'd prefer to keep that period in so we can match. Hyphens may also be useulf, rather than figuring out whether to contract or space split the word.

Fixing contractions is also going to be useful. After that apostrophe's can be removed.

The presence of multiple or absence of question marks is associated with insincerity. The question mark is part of the embedding, so will space it so it.

[punc for punc in fasttext.keys() if '.' in punc]

['.',
 '...',
 'U.S.',
 'i.e.',  'ie.', 'i.e',
'index.php',
 'St.',
 'en.wikipedia.org',
 'p.m.',
 'P.M.',
 'D.C.',
 'a.m.',
 'Ph.D.',
 'Ph.D',
 'U.K.',
 'www.youtube.com',
 'Amazon.com',
 'U.S.A.',
 'N.Y.',
 '.jpg',
 'index.html',
 '.NET',
 'NYTimes.com',
 'www.facebook.com',
 '6.6',
 '8.2',
 '4.9',
 'Wikipedia-logo.png',
 '1.jpg',
 'www.google.com',
 'govt.',
 'gmail.com',
 'hotmail.com',
 'twitter.com',
 'CNN.com',
 'S.H.I.E.L.D.',
 'ASP.NET',
  '.Net',
 'Thanks.',
 'Salesforce.com',
 'msnbc.com',
 'FoxNews.com.',
 'www.nytimes.com',
 'M.I.T.',
 'Amazon.com.',
 'Last.fm',
 ...]
 
 
[punc for punc in fasttext.keys() if '?' in punc]

['?']

In [288]:
def normalize_unicode(text):
    """
    unicode string normalization
    """
    return unicodedata.normalize('NFKD', text)


def remove_newline(text):
    """
    remove \n and  \t
    """
    text = re.sub('\n', ' ', text)
    text = re.sub('\t', ' ', text)
    text = re.sub('\b', ' ', text)
    text = re.sub('\r', ' ', text)
    return text

def clean_latex(text):
    """
    convert r"[math]\vec{x} + \vec{y}" to English
    """
    # edge case
    text = re.sub(r'\[math\]', ' LaTex math ', text)
    text = re.sub(r'\[\/math\]', ' LaTex math ', text)
    text = re.sub(r'\\', ' LaTex ', text)

    pattern_to_sub = {
        r'\\mathrm': ' LaTex math mode ',
        r'\\mathbb': ' LaTex math mode ',
        r'\\boxed': ' LaTex equation ',
        r'\\begin': ' LaTex equation ',
        r'\\end': ' LaTex equation ',
        r'\\left': ' LaTex equation ',
        r'\\right': ' LaTex equation ',
        r'\\(over|under)brace': ' LaTex equation ',
        r'\\text': ' LaTex equation ',
        r'\\vec': ' vector ',
        r'\\var': ' variable ',
        r'\\theta': ' theta ',
        r'\\mu': ' average ',
        r'\\min': ' minimum ',
        r'\\max': ' maximum ',
        r'\\sum': ' + ',
        r'\\times': ' * ',
        r'\\cdot': ' * ',
        r'\\hat': ' ^ ',
        r'\\frac': ' / ',
        r'\\div': ' / ',
        r'\\sin': ' Sine ',
        r'\\cos': ' Cosine ',
        r'\\tan': ' Tangent ',
        r'\\infty': ' infinity ',
        r'\\int': ' integer ',
        r'\\in': ' in ',
    }
    # post process for look up
    pattern_dict = {k.strip('\\'): v for k, v in pattern_to_sub.items()}
    # init re
    patterns = pattern_to_sub.keys()
    pattern_re = re.compile('(%s)' % '|'.join(patterns))

    def _replace(match):
        """
        reference: https://www.kaggle.com/hengzheng/attention-capsule-why-not-both-lb-0-694 # noqa
        """
        return pattern_dict.get(match.group(0).strip('\\'), match.group(0))
    return pattern_re.sub(_replace, text)

def decontracted(text):
    """
    author: Kevin Liao
    
    de-contract the contraction
    """
    try:
        # specific
        text = re.sub(r"(W|w)on(\'|\’)t", "will not", text)
        text = re.sub(r"(C|c)an(\'|\’)t", "can not", text)
        text = re.sub(r"(Y|y)(\'|\’)all", "you all", text)
        text = re.sub(r"(Y|y)a(\'|\’)ll", "you all", text)

        # general
        text = re.sub(r"(I|i)(\'|\’)m", "i am", text)
        text = re.sub(r"(A|a)in(\'|\’)t", "is not", text)
        text = re.sub(r"n(\'|\’)t", " not", text)
        text = re.sub(r"(\'|\’)re", " are", text)
        text = re.sub(r"(\'|\’)s", " is", text)
        text = re.sub(r"(\'|\’)d", " would", text)
        text = re.sub(r"(\'|\’)ll", " will", text)
        text = re.sub(r"(\'|\’)t(?!h)", " not", text)
        text = re.sub(r"(\'|\’)ve", " have", text)
    except:
        print('error processing text:{}'.format(text))
        
    return text

def remove_string(text, string_to_omit=['']):
    """
    Substrings to delete if present.
    """    
    # light arg checking
    if type(string_to_omit) == str:
        string_to_omit = [string_to_omit]
    
    re_tok = re.compile(f'({string_to_omit})')
    return re_tok.sub(r'', text)    

def spacing_digit(text):
    """
    add space before and after digits
    """
    re_tok = re.compile('([0-9])')
    return re_tok.sub(r' \1 ', text)


def spacing_number(text):
    """
    add space before and after numbers
    """
    re_tok = re.compile('([0-9]{1,})')
    return re_tok.sub(r' \1 ', text)


def remove_number(text):
    """
    numbers are not toxic
    """
    return re.sub('\d+', ' ', text)

def remove_space(text):
    """
    remove extra spaces and ending space if any
    """
    text = re.sub('\s+', ' ', text)
    text = re.sub('\s+$', '', text)
    return text

def clean_misspell(text):
    """
    misspell list (quora vs. fasttext wiki-news-300d-1M)
    """
    misspell_to_sub = {
        '“': ' " ',
        '”': ' " ',
        '°C': 'degrees Celsius',
        '&amp;': ' & ',
        '2k17': '2017',
        '2k18': '2018',
        '9/11': 'terrorist attack',
        'Aadhar': 'Indian identification number',
        'aadhar': 'Indian identification number',
        ' adhar': 'Indian identification number',
        'Adityanath': 'Indian monk Yogi Adityanath',
        'AFCAT': 'Indian air force recruitment exam',
        'airhostess': 'air hostess',
        'Ambedkarite': 'Dalit Buddhist movement ',
        'AMCAT': 'Indian employment assessment examination',
        'and/or': 'and or',
        'antibrahmin': 'anti Brahminism',
        'articleship': 'chartered accountant internship',
        'Asifa': 'abduction rape murder case ',
        'AT&T': 'telecommunication company',
        'atrracted': 'attract',
        'Awadesh': 'Indian engineer Awdhesh Singh',
        'Awdhesh': 'Indian engineer Awdhesh Singh',
        'Babchenko': 'Arkady Arkadyevich Babchenko faked death',
        'Barracoon': 'Black slave',
        'Bathla': 'Namit Bathla',
        'bcom': 'bachelor of commerce',
        'beyon´çe': 'Beyoncé',
        'Bhakts': 'Bhakt',
        'bhakts': 'Bhakt',
        'bigdata': 'big data',
        'biharis': 'Biharis',
        'BIMARU': 'Bihar Madhya Pradesh Rajasthan Uttar Pradesh',
        'BITSAT': 'Birla Institute of Technology entrance examination',
        'BNBR': 'be nice be respectful',
        'bodycams': 'body cams',
        'bodyshame': 'body shaming',
        'bodyshoppers': 'body shopping',
        'Bolsonaro': 'Jair Bolsonaro',
        'Boshniak': 'Bosniaks ',
        'Boshniaks': 'Bosniaks',
        'bremainer': 'anti Brexit',
        'bremoaner': 'Brexit remainer',
        'Brexiteer': 'Brexit supporter',
        'Brexiteers': 'Brexit supporters',
        'Brexiter': 'Brexit supporter',
        'Brexiters': 'Brexit supporters',
        'brexiters': 'Brexit supporters',
        'Brexiting': 'Brexit',
        'Brexitosis': 'Brexit disorder',
        'Brexshit': 'Brexit bullshit',
        'C#': 'computer programming language',
        'c#': 'computer programming language',
        'C++': 'computer programming language',
        'c++': 'computer programming language',
        'Cananybody': 'Can any body',
        'cancelled': 'canceled',
        'Castrater': 'castration',
        'castrater': 'castration',
        'centre': 'center',
        'Chodu': 'fucker',
        'Chutiya': 'Tibet people ',
        'Chutiyas': 'Tibet people ',
        'cishet': 'cisgender and heterosexual person',
        'citicise': 'criticize',
        'cliché': 'cliche',
        'clichéd': 'cliche',
        'clichés': 'cliche',
        'Clickbait': 'click bait ',
        'clickbait': 'click bait ',
        'coinbase': 'bitcoin wallet',
        'Coinbase': 'bitcoin wallet',
        'colour': 'color',
        'COMEDK': 'medical engineering and dental colleges of Karnataka entrance examination',
        'counselling': 'counseling',
        'Crimean': 'Crimea people ',
        'currancies': 'currencies',
        'currancy': 'currency',
        'cybertrolling': 'cyber trolling',
        'D&D': 'dungeons & dragons game',
        'daesh': 'Islamic State of Iraq and the Levant',
        'deadbody': 'dead body',
        'deaddict': 'de addict',
        'demcoratic': 'Democratic',
        'demonetisation': 'demonetization',
        'demonetisation': 'demonetization',
        'Demonetization': 'demonetization',
        'demonitisation': 'demonetization',
        'demonitization': 'demonetization',
        'deplorables': 'deplorable',
        'doI': 'do I',
        'Doklam': 'disputed Indian Chinese border area',
        'Doklam': 'Tibet',
        'Dönmeh': 'Islam',
        'Dravidanadu': 'Dravida Nadu',
        'dropshipping': 'drop shipping',
        'Drumpf ': 'Donald Trump fool ',
        'Drumpfs': 'Donald Trump fools',
        'Dumbassistan': 'dumb ass Pakistan',
        'emiratis': 'Emiratis',
        'Eroupian': 'European',
        'Etherium': 'Ethereum',
        'Eurocentric': 'Eurocentrism ',
        'exboyfriend': 'ex boyfriend',
        'facetards': 'Facebook retards',
        'Fadnavis': 'Indian politician Devendra Fadnavis',
        'favourite': 'favorite',
        'Fck': 'Fuck',
        'fck': 'fuck',
        'Feku': 'The Man of India ',
        'feminazism': 'feminism nazi',
        'FIITJEE': 'Indian tutoring service',
        'fiitjee': 'Indian tutoring service',
        'fortnite': 'Fortnite ',
        'Fortnite': 'video game',
        'Gixxer': 'motorcycle',
        'Golang': 'computer programming language',
        'golang': 'computer programming language',
        'Gujratis': 'Gujarati',
        'Gurmehar': 'Gurmehar Kaur Indian student activist',
        'h1b': 'US work visa',
        'H1B': 'US work visa',
        'hairfall': 'hair loss',
        'harrase': 'harass',
        'he/she': 'he or she',
        'healhtcare': 'healthcare',
        'him/her': 'him or her',
        'Hindians': 'North Indian who hate British',
        'Hinduphobia': 'Hindu phobic',
        'hinduphobia': 'Hindu phobic',
        'Hinduphobic': 'Hindu phobic',
        'hinduphobic': 'Hindu phobic',
        'his/her': 'his or her',
        'Hongkongese': 'HongKong people',
        'hongkongese': 'HongKong people',
        'howcan': 'how can',
        'Howdo': 'How do',
        'howdo': 'how do',
        'howdoes': 'how does',
        'howmany': 'how many',
        'howmuch': 'how much',
        'HYPS': ' Harvard Yale Princeton Stanford',
        'HYPSM': ' Harvard Yale Princeton Stanford MIT',
        'ICOs': 'cryptocurrencies initial coin offering',
        'Idiotism': 'idiotism',
        'IITian': 'Indian Institutes of Technology student',
        'IITians': 'Indian Institutes of Technology students',
        'IITJEE': 'Indian Institutes of Technology entrance examination',
        ' incel': ' involuntary celibates',
        ' incels': ' involuntary celibates',
        'indans': 'Indian',
        'jallikattu': 'Jallikattu',
        'JEE MAINS': 'Indian university entrance examination',
        'Jewdar': 'Jew dar',
        'Jewism': 'Judaism',
        'jewplicate': 'jewish replicate',
        'JIIT': 'Jaypee Institute of Information Technology',
        'Kalergi': 'Coudenhove-Kalergi',
        'Kashmirians': 'Kashmirian',
        'Khalistanis': 'Sikh separatist movement',
        'Khazari': 'Khazars',
        'kompromat': 'compromising material',
        'koreaboo': 'Korea boo ',
        'KVPY': 'entrance examination',
        'labour': 'labor',
        'langague': 'language',
        'LGBTQ': 'lesbian  gay  bisexual  transgender queer',
        'LGBT': 'lesbian  gay  bisexual  transgender',
        'Machedo': 'Indian internet celebrity',
        'madheshi': 'Madheshi',
        'Madridiots': 'Real Madrid idiot supporters',
        'mailbait': 'mail bait',
        'MAINS': 'exam',
        'marathis': 'Marathi',
        'marksheet': 'university transcript',
        'mastrubate': 'masturbate',
        'mastrubating': 'masturbating',
        'mastrubation': 'masturbation',
        'mastuburate': 'masturbate',
        'meninism': 'male feminism',
        'MeToo': 'feminist activism campaign',
        'Mewani': 'Indian politician Jignesh Mevani',
        'MGTOWS': 'Men Going Their Own Way',
        'micropenis': 'tiny penis',
        'moeslim': 'Muslim',
        'mongloid': 'Mongoloid',
        'mtech': 'Master of Engineering',
        'muhajirs': 'Muslim immigrant',
        'Myeshia': 'widow of Green Beret killed in Niger',
        'mysoginists': 'misogynists',
        'naïve': 'naive',
        'narcisist': 'narcissist',
        'narcissit': 'narcissist',
        'narcissit': 'narcissist',
        'Naxali ': 'Naxalite ',
        'Naxalities': 'Naxalites',
        'NICMAR': 'Indian university',
        'Niggeriah': 'Nigger',
        'Niggerism': 'Nigger',
        'NMAT': 'Indian MBA exam',
        'Northindian': 'North Indian ',
        'northindian': 'north Indian ',
        'northkorea': 'North Korea',
        'Novichok': 'Soviet Union agents',
        'organisation': 'organization',
        'Padmavat': 'Indian Movie Padmaavat',
        'Pahul': 'Amrit Sanskar',
        'penish': 'penis',
        'pennis': 'penis',
        'Pizzagate': 'Pizzagate conspiracy theory',
        'Pribumi': 'Native Indonesian',
        'qouta': 'quota',
        'quorans': 'advice website user',
        'quoran': 'advice website user',
        'Quorans': 'advice website user',
        'Quoran': 'advice website user',
        'quoras': 'advice website',
        'Qoura ': 'advice website ',
        'Qoura': 'advice website',
        'Quora': 'advice website',
        'Quroa': 'advice website',
        'QUORA': 'advice website',
        'R&D': 'research and development',
        'r&d': 'research and development',
        'r-aping': 'raping',
        'raaping': 'rape',
        'rapefugees': 'rapist refugee',
        'Rapistan': 'Pakistan rapist',
        'rapistan': 'Pakistan rapist',
        'Rejuvalex': 'hair growth formula',
        'ReleaseTheMemo': 'cry for the right and Trump supporters',
        'Remainers': 'anti Brexit',
        'remainers': 'anti Brexit',
        'remoaner': 'remainer ',
        'rohingya': 'Rohingya ',
        'sallary': 'salary',
        'Sanghis': 'Sanghi',
        'sh*t': 'shit',
        'shithole': ' shithole ',
        'shitlords': 'shit lords',
        'shitpost': 'shit post',
        'shitslam': 'shit Islam',
        'sickular': 'India sick secular ',
        'signuficance': 'significance',
        'SJW': 'social justice warrior',
        'SJWs': 'social justice warrior',
        'Skripal': 'Sergei Skripal',
        'Strzok': 'Hillary Clinton scandal',
        'suckimg': 'sucking',
        'superficious': 'superficial',
        'Swachh': 'Swachh Bharat mission campaign ',
        'Tambrahms': 'Tamil Brahmin',
        'Tamilans': 'Tamils',
        'Terroristan': 'terrorist Pakistan',
        'terroristan': 'terrorist Pakistan',
        'Tharki': 'pervert',
        'tharki': 'pervert',
        'theatre': 'theater',
        'theBest': 'the best',
        'thighing': 'masturbate',
        'travelling': 'traveling',
        'trollbots': 'troll bots',
        'trollimg': 'trolling',
        'trollled': 'trolled',
        'Trumpers': 'Trump supporters',
        'Trumpanzees': 'Trump chimpanzee fool',
        'Turkified': 'Turkification',
        'turkified': 'Turkification',
        'UCEED': 'Indian Institute of Technology Bombay entrance examination',
        'unacadamy': 'Indian online classroom',
        'Unacadamy': 'Indian online classroom',
        'unoin': 'Union',
        'unsincere': 'insincere',
        'UPES': 'Indian university',
        'UPSEE': 'Indian university entrance examination',
        'vaxxer': 'vocal nationalist ',
        'VITEEE': 'Vellore institute of technology',
        'watsapp': 'Whatsapp',
        'whattsapp': 'Whatsapp',
        'WBJEE': 'West Bengal entrance examination',
        'weatern': 'western',
        'westernise': 'westernize',
        'Whatare': 'What are',
        'whatare': 'what are',
        'whst': 'what',
        'Whta': 'What',
        'whydo': 'why do',
        'Whykorean': 'Why Korean',
        'Wjy': 'Why',
        'WMAF': 'White male married Asian female',
        'wumao ': 'cheap Chinese stuff',
        'wumaos': 'cheap Chinese stuff',
        'wwii': 'world war 2',
        ' xender': ' gender',
        'XXXTentacion': 'Tentacion',
        'youtu ': 'youtube ',
        'Zerodha': 'online stock brokerage',
        'Žižek': 'Slovenian philosopher Slavoj Žižek',
        'Zoë': 'Zoe',
        '卐': 'Nazi Germany'
    }

    escape_cars = re.compile('(\+|\*)')
    misspell = '|'.join([escape_cars.sub(r"\\\1",i) for i in misspell_to_sub.keys()])
    misspell_re = re.compile(misspell)
    
    def _replace(match):
        return misspell_to_sub.get(match.group(0), match.group(0))
    
    return misspell_re.sub(_replace, text)

def space_chars(text, chars_to_space):
    """
    Takes a string and list of characters, insert space before and after 
    characters that appear in text.
    
    Parameters
    ----------
    text : str
        String to search
    chars_to_space : list
        list of characters to find and space
        
    Returns
    -------
    str
        modified text string    
    """
    
    # light arg checking
    if type(chars_to_space) == str:
        chars_to_space = [chars_to_space]
        
    chars_to_space = set(chars_to_space)
    chars_to_space = '|'.join(chars_to_space)
    re_tok = re.compile('({})'.format(chars_to_space))
    
    return re_tok.sub(r' \1 ', text)

def preprocess(text, remove_num=False):
    
    # 1. Normalize 
    # normalize_unicode(text)
    
    # 2. Remove new-lines
    # text = remove_newline(text)
    
    # 3. replace contractions (e.g. won't -> will not)
    text = decontracted(text)

    # 4. replace LateX with English
    text = clean_latex(text)
    
    # 5. space characters
    text = space_chars(text, ['\?', ',', '"', '\(', '\)', '%', ':', '\$', 
                              '\.', '\+', '\^', '/', '\{', '\}', '\!', 
                              '#', '=', '-','\|', '\[', '\]','\.'])
    
    # 6. handle number
    if remove_num:
        text = remove_number(text)
    else:
        text = spacing_digit(text)
    
    # 7. fix typos and swap terms that are not recognized by embedding
    text = clean_misspell(text)

    # 8. remove space
    text = remove_space(text)
    
    # 9. remove strings
    text = remove_string(text, '_')
    
    return text

In [289]:
df_train['question_text_pr'] = df_train['question_text'].apply(preprocess)
df_test['question_text_pr'] = df_test['question_text'].apply(preprocess)

vocab = build_vocab(np.concatenate((df_train.question_text_pr, df_test.question_text_pr)))
tokens,c,tokens_in_embedding = vocab_embedding_coverage(vocab, fasttext, True)
df_tokens = pd.DataFrame.from_dict(tokens_in_embedding, orient = 'index')
df_tokens = df_tokens.sort_values(by='frequency', ascending=False)
df_tokens[np.logical_not(df_tokens.embedding)].head(10)

66.54% of vocabulary words found in embedding files
99.48% of corpus found in embedding files


Unnamed: 0,frequency,embedding
L&T,74,False
LNMIIT,72,False
Kavalireddi,67,False
etc…,65,False
Vajiram,61,False
DAngelo,60,False
Unacademy,60,False
INFJs,58,False
Padmaavati,57,False
MUOET,56,False


In [294]:
df_train.to_csv('../data/processed/preprocessed_train.csv')
df_test.to_csv('../data/processed/preprocessed_test.csv')

### A Few Before & After Examples

In [286]:
df_train.loc[df_train.question_text.apply(lambda x:
                                          'sh*t' in x), ['question_text', 'question_text_pr', 'target']].head(6)

Unnamed: 0,question_text,question_text_pr,target
72837,"Is the ultimate Republican goal to make the US a sh*tty country ruled by oligarchs propped up by a religious right, with terrible education, a military ""cult"", no privacy rights, no human rights, and no middle class, like the phillipines or turkey?","Is the ultimate Republican goal to make the US a sh*tty country ruled by oligarchs propped up by a religious right , with terrible education , a military "" cult "" , no privacy rights , no human rights , and no middle class , like the phillipines or turkey ?",1
97480,How do you get out of your own head after a really sh*tty day?,How do you get out of your own head after a really sh*tty day ?,0
163578,Why hasn't democracy been digitalized? Are they just slow or is it borderline conspiracy sh*t? Because I feel like voting demographics would drastically change if you could vote via ex. Face ID and interact with your government directly online.,Why has not democracy been digitalized ? Are they just slow or is it borderline conspiracy sh*t ? Because I feel like voting demographics would drastically change if you could vote via ex . Face ID and interact with your government directly online .,1
212156,Why do some book smells like sh*t?,Why do some book smells like sh*t ?,1
238594,Whydo some creepy messages appear while surfing like - Virus Detected Download this sh*t. Are they serious? Go detail for image.,Whydo some creepy messages appear while surfing like - Virus Detected Download this sh*t . Are they serious ? Go detail for image .,0
364675,"People keep asking ""why do Chinese eat […] "" ten times a day. Chinese master English high enough to debate on Quora, while we can’t read sh*t on their zhihu forums. Am I the only one amazed?","People keep asking "" why do Chinese eat [ … ] "" ten times a day . Chinese master English high enough to debate on advice website , while we can not read sh*t on their zhihu forums . Am I the only one amazed ?",1


In [290]:
df_train.loc[df_train.question_text.apply(lambda x:
                                          'LGBTQ' in x), ['question_text', 'question_text_pr', 'target']].head(6)

Unnamed: 0,question_text,question_text_pr,target
2082,Do you know any Tumblr LGBTQ blogs that accept submissions?,Do you know any Tumblr lesbian gay bisexual transgender queer blogs that accept submissions ?,0
10965,"Researching who is murdering transgender people, a disproportionate number of the murderers are black. Why doesn't the LGBTQI community acknowledge this & why is it that this is happening? Is it due to homophobia & transphobia in the black community?","Researching who is murdering transgender people , a disproportionate number of the murderers are black . Why does not the lesbian gay bisexual transgender queerI community acknowledge this & why is it that this is happening ? Is it due to homophobia & transphobia in the black community ?",1
19625,"As the majority of the culture is against the LGBTQ community, can anyone provide proof that this community has committed a crime at any higher rate than society in general?","As the majority of the culture is against the lesbian gay bisexual transgender queer community , can anyone provide proof that this community has committed a crime at any higher rate than society in general ?",1
28182,Where can I read about LGBTQ+ history?,Where can I read about lesbian gay bisexual transgender queer + history ?,0
30531,Is the LGBTQ community to blame for August Ames suicide?,Is the lesbian gay bisexual transgender queer community to blame for August Ames suicide ?,1
34924,"Has any political party tried to confine you by your sex, race, age, wealth, gender or LGBTQ ever? How have they?","Has any political party tried to confine you by your sex , race , age , wealth , gender or lesbian gay bisexual transgender queer ever ? How have they ?",0


In [292]:
# More unmatched items
df_tokens[np.logical_not(df_tokens.embedding)].head(400)

Unnamed: 0,frequency,embedding
L&T,74,False
LNMIIT,72,False
Kavalireddi,67,False
etc…,65,False
Vajiram,61,False
DAngelo,60,False
Unacademy,60,False
INFJs,58,False
Padmaavati,57,False
MUOET,56,False


### Misc Utility Functions

In [None]:
def contains_punctuation(text, other_punct = ['']):
    """
    Does the string contain punctuation?
    
    Parameters
    ----------
    text : str
        String to search for punctuation
    other_punct : list
        list of additional punctuation characters to search
        
    Returns
    -------
    bool
        True if 1 or more instances of punctuation / special characters exist in string
    """
    
    # light arg checking
    if type(other_punct) == str:
        other_punct = [other_punct]
    
    regular_punct = list(string.punctuation)
    extra_punct = [
        ',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&',
        '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',
        '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',
        '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '“', '★', '”',
        '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾',
        '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', '：', '¼', '⊕', '▼',
        '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲',
        'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '∙', '）', '↓', '、', '│', '（', '»',
        '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø',
        '¹', '≤', '‡', '√', '«', '»', '´', 'º', '¾', '¡', '§', '£', '₤']
    
    all_punct = ''.join(sorted(list(set(regular_punct + extra_punct + other_punct))))
    re_tok = re.compile(f'([{all_punct}])')
    
    return re.search(re_tok, text) is not None

# Example
# vocab_with_punctuation = [word for word in vocab.keys() if contains_punctuation(word,['∈'])]
# random.sample(vocab_with_punctuation,10)

## Engineered Features for Use w/ Traditional ML

Engineered features aren't useful for deep learning models, but can be in traditional ML models, such as SVM or logistic regression applied to TF-IDF. A few potentials were identified in EDA.

In [519]:
def count_uppercase_words(text):
    tokens = text.split()
    upper = [1 if u.isupper() and len(u) > 1 else 0 for u in tokens]
    return(sum(upper))

def programming_related(text):
    programming_languages_frameworks_dbs = ['javascript', 'html', 'css', 'sql', 'java', 'bash', 'python',
                                        'c#', 'c++', 'c language', 'c programming', 'c programing', 'typescript', 'ruby', 'matlab', 'f#', 'clojure', 
                                        'haskell', 'erlang', 'coffeescript', 'cobol', 'fortran',
                                        'vba', '.net', 'asp.net', 'scala', 'perl', 'php', 'kotlin',
                                        'node.js', 'react.js', 'angular', 'django', 'cordova', 'tensorflow', 'keras',
                                        'xamarin', 'hadoop', 'pytorch', 'mongo', 'redis', 'elasticsearch', 'mariadb', 'azure',
                                        'dynamodb', ' rds', 'redshift', 'cassandra', 'apache hive', 'bigquery', 'hbase',
                                        'linux', 'raspberry pi', 'rpi ', 'arduino', 'heroku', 'drupal', 
                                        'visual studio', 'sublime text', 'rstudio', 'jupyter', 'pycharm', 'netbeans',
                                       'emacs', 'vim ', 'komodo', 'graphql', 'golang']
    
    for word in text.split():
        if word.lower() in programming_languages_frameworks_dbs:
            return True
    
    return False

class FeatureEngineering():
    def __init__(self, doc_column, max_words=10):
        self._most_common = None
        self._data = None
        self._doc_column = doc_column
        self._max_words = max_words
    
    def fit(self, df):
        # save leading tokens information from fit dataset
        leading_tokens = df[self._doc_column].apply(lambda x: re.match('\w+|\d+|.', x)[0].lower())
        leading_token_count = Counter(leading_tokens)
        max_count = min(self._max_words, len(leading_token_count)-1)
        self._most_common = [w for w,c in leading_token_count.most_common(max_count)]
        self._data = df
        
    def transform(self, df):
        # Leading tokens
        leading_tokens = df[self._doc_column].apply(lambda x: re.match('\w+|\d+|.', x)[0].lower())
        df_leading_tokens = leading_tokens.apply(lambda x: x.lower() if x.lower() in self._most_common else 'other')
        df_leading_tokens = pd.get_dummies(df_leading_tokens)
        for token in self._most_common:
            if token not in df_leading_tokens.columns:
                df_leading_tokens[token] = 0
        df_leading_tokens = df_leading_tokens.rename(columns = {c: 'leading_word_' + c for c in df_leading_tokens.columns})
        # Using 'other' as reference category
        if 'leading_word_other' in df_leading_tokens.columns:
            df_leading_tokens = df_leading_tokens.drop('leading_word_other', axis=1)
        df = pd.concat([df, df_leading_tokens], axis=1)
        
        # Word count
        df['word_count'] = df[self._doc_column].apply(lambda x: len(re.findall(r'\w+',x)))

        # Character count
        df['char_count'] = df[self._doc_column].apply(lambda x: len(x))

        # How many question marks
        df['question_mark_count'] = df[self._doc_column].apply(lambda x: len(re.findall(r'\?',x)))

        # LaTex or Math symbols
        # Programming questions
        df['programming'] = df[self._doc_column].apply(lambda x: programming_related(x))

        # ALL CAPS Words
        df['caps_count'] = df[self._doc_column].apply(lambda x: count_uppercase_words(x))
        
        return df

ef = FeatureEngineering('question_text', 20)
ef.fit(df_train)
df_train_plus = ef.transform(df_train)
df_test_plus = ef.transform(df_test)

df_train.to_csv('../data/processed/preprocessed_features_train.csv')
df_test.to_csv('../data/processed/preprocessed_features_test.csv')

## Conclusion

Can basically go on indefinitely doing cleaning and obviously there are diminishing rewards. One can't predict every strange misspelling, minor celebrity's name or creative ways to swear online.

Sitting at 66.54% of vocabulary words found in embedding files and 99.48% of corpus found in embedding files is more than a good enough place to work from. I find text cleaning surprisingly satisfying.

Next steps:
* Revisit topic and sentiment using processed data
* 