## Pre-Processing Text

Modifying text cleaning code from [KevinLiao159 (GitHub)](https://github.com/KevinLiao159/Quora/blob/master/kernels/submission_v50.py), who in turn was modifying code from [Heng Zheng (Kaggle)](https://www.kaggle.com/hengzheng/attention-capsule-why-not-both-lb-0-694), who was using code written by [Theo Viel (Kaggle)](https://www.kaggle.com/theoviel/improve-your-score-with-text-preprocessing-v2). Love ya Kaggle, don't ever change.

Also very informative is [this kernel by Dieter (Kaggle)](https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings) which discusses how to customize preprocessing to the embedding chosen.

Because Fasttext

In [5]:
import pandas as pd
import numpy as np
import os
import re
import string

pd.set_option('display.max_colwidth', -1)

In [6]:
def load_data():
    """
    Load test and train data from csv.
    
    Returns
    _______
    df_train: DataFrame
        Full raw training dataset
    df_test: DataFrame
        Full raw test dataset
    """
    
    # Select local path vs kaggle kernel
    path = os.getcwd()
    if 'data-projects/kaggle_quora/notebooks' in path:
        data_dir = '../data/raw/'
    else:
        data_dir = ''

    df_train = pd.read_csv(data_dir +'train.csv')
    df_test = pd.read_csv(data_dir +'test.csv')
    return df_train, df_test

### Misspellings, misspacings, contractions and punctuation

Embeddings are like children, if they don't know a word yet then they can't understand what you are trying to say. But we can correct to word to one they do recognize.

In [7]:
df_train_raw, df_test_raw = load_data()

df_train = df_train_raw
df_test = df_test_raw

In [8]:
def load_word_embedding(filepath):
    """
    given a filepath to embeddings file, return a word to vec
    dictionary, in other words, word_embedding
    E.g. {'word': array([0.1, 0.2, ...])}
    """
    def _get_vec(word, *arr):
        return word, np.asarray(arr, dtype='float32')

    print('load word embedding ......')
    try:
        word_embedding = dict(_get_vec(*w.split(' ')) for w in open(filepath))
    except UnicodeDecodeError:
        word_embedding = dict(_get_vec(*w.split(' ')) for w in open(
            filepath, encoding="utf8", errors='ignore'))

    return word_embedding

In [9]:
fasttext = load_word_embedding('../data/raw/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec')

load word embedding ......


### Text / Document Coverage

In [11]:
def build_vocab(docs):
    """
    given a list or np.array of strings create a dictionary of unique words with frequencies.
    
    Parameters
    __________
    docs: list or np.array
        iterable of text
    
    Returns
    _______
    dict
        unique words as keys, frequencies as values
    
    """
    vocab = {}
    
    for doc in docs:
        for word in doc.split():
            vocab[word] = vocab.get(word, 0) + 1
                
    return vocab

def vocab_embedding_coverage(vocab, embedding, verbose = False):
    """
    given a dict representing the word frequency of a corpus, 
    calculate the percentage of unique words and 
    the percentage of the corpus matched in the embedding dict.
    
    Parameters
    __________
    vocab: dict
        word frequency of corpus
    embedding: dict
        embedding vector converted to dict
    verbse: bool
        print summary statistics
    
    Returns
    _______
    
    perc_words : float
        percentage of unique words identified in corpus
    perc_corpus : float
        percentage of corpus identified in corpus
    words_in_embedding: dict
        dictionary of unique words, frequency and whether found in embedding (true / false)
    """
    
    words_in_embedding = {}
    word_found_count = 0
    corpus_found_count = 0
    corpus_count = 0 
    
    for word, freq in vocab.items():
        corpus_count += freq
        words_in_embedding[word] = {
            'frequency': freq,
            'embedding': (word in embedding)
        }
        if word in embedding:
            word_found_count += 1
            corpus_found_count += freq
    
    perc_words = word_found_count / len(vocab)
    perc_corpus = corpus_found_count / corpus_count
    
    print('{}% of vocabulary words found in embedding files'.format(round(100*perc_words,2)))
    print('{}% of corpus found in embedding files'.format(round(100*perc_corpus,2)))
    
    return perc_words, perc_corpus, words_in_embedding

In [12]:
vocab = build_vocab(np.concatenate((df_train.question_text, df_test.question_text)))
w,c,words_in_embedding = vocab_embedding_coverage(vocab, fasttext, True)
df_words = pd.DataFrame.from_dict(words_in_embedding, orient = 'index')
df_words = df_words.sort_values(by='frequency', ascending=False)
df_words[np.logical_not(df_words.embedding)].head(10)

29.77% of vocabulary words found in embedding files
87.66% of corpus found in embedding files


Unnamed: 0,frequency,embedding
India?,17082,False
don't,15642,False
it?,13436,False
I'm,13344,False
What's,12985,False
do?,9112,False
life?,8074,False
can't,7375,False
you?,6553,False
me?,6485,False


Fasttext does include punctuation. Notable that there are variants on terms like Ph.D and PhD, or ie., i.e. and i.e. 
Also interesting that the embedding contains references to websites like Google and Amazon. This seems useful.

Stripping all punctuation shouldn't be strictly necessary. in the example of .NET or ASP.NET for example, I'd prefer to keep that period in so we can match. Hyphens may also be useulf, rather than figuring out whether to contract or space split the word.

Fixing contractions is also going to be useful. After that apostrophe's can be removed.

The presence of multiple or absence of question marks is associated with insincerity. The question mark is part of the embedding, so will space it so it.

[punc for punc in fasttext.keys() if '.' in punc]

['.',
 '...',
 'U.S.',
 'i.e.',  'ie.', 'i.e',
'index.php',
 'St.',
 'en.wikipedia.org',
 'p.m.',
 'P.M.',
 'D.C.',
 'a.m.',
 'Ph.D.',
 'Ph.D',
 'U.K.',
 'www.youtube.com',
 'Amazon.com',
 'U.S.A.',
 'N.Y.',
 '.jpg',
 'index.html',
 '.NET',
 'NYTimes.com',
 'www.facebook.com',
 '6.6',
 '8.2',
 '4.9',
 'Wikipedia-logo.png',
 '1.jpg',
 'www.google.com',
 'govt.',
 'gmail.com',
 'hotmail.com',
 'twitter.com',
 'CNN.com',
 'S.H.I.E.L.D.',
 'ASP.NET',
  '.Net',
 'Thanks.',
 'Salesforce.com',
 'msnbc.com',
 'FoxNews.com.',
 'www.nytimes.com',
 'M.I.T.',
 'Amazon.com.',
 'Last.fm',
 ...]
 
 
[punc for punc in fasttext.keys() if '?' in punc]

['?']

In [None]:
def clean_latex(text):
    """
    convert r"[math]\vec{x} + \vec{y}" to English
    """
    # edge case
    text = re.sub(r'\[math\]', ' LaTex math ', text)
    text = re.sub(r'\[\/math\]', ' LaTex math ', text)
    text = re.sub(r'\\', ' LaTex ', text)

    pattern_to_sub = {
        r'\\mathrm': ' LaTex math mode ',
        r'\\mathbb': ' LaTex math mode ',
        r'\\boxed': ' LaTex equation ',
        r'\\begin': ' LaTex equation ',
        r'\\end': ' LaTex equation ',
        r'\\left': ' LaTex equation ',
        r'\\right': ' LaTex equation ',
        r'\\(over|under)brace': ' LaTex equation ',
        r'\\text': ' LaTex equation ',
        r'\\vec': ' vector ',
        r'\\var': ' variable ',
        r'\\theta': ' theta ',
        r'\\mu': ' average ',
        r'\\min': ' minimum ',
        r'\\max': ' maximum ',
        r'\\sum': ' + ',
        r'\\times': ' * ',
        r'\\cdot': ' * ',
        r'\\hat': ' ^ ',
        r'\\frac': ' / ',
        r'\\div': ' / ',
        r'\\sin': ' Sine ',
        r'\\cos': ' Cosine ',
        r'\\tan': ' Tangent ',
        r'\\infty': ' infinity ',
        r'\\int': ' integer ',
        r'\\in': ' in ',
    }
    # post process for look up
    pattern_dict = {k.strip('\\'): v for k, v in pattern_to_sub.items()}
    # init re
    patterns = pattern_to_sub.keys()
    pattern_re = re.compile('(%s)' % '|'.join(patterns))

    def _replace(match):
        """
        reference: https://www.kaggle.com/hengzheng/attention-capsule-why-not-both-lb-0-694 # noqa
        """
        return pattern_dict.get(match.group(0).strip('\\'), match.group(0))
    return pattern_re.sub(_replace, text)

def decontracted(text):
    """
    author: Kevin Liao
    
    de-contract the contraction
    """
    try:
        # specific
        text = re.sub(r"(W|w)on(\'|\’)t", "will not", text)
        text = re.sub(r"(C|c)an(\'|\’)t", "can not", text)
        text = re.sub(r"(Y|y)(\'|\’)all", "you all", text)
        text = re.sub(r"(Y|y)a(\'|\’)ll", "you all", text)

        # general
        text = re.sub(r"(I|i)(\'|\’)m", "i am", text)
        text = re.sub(r"(A|a)in(\'|\’)t", "is not", text)
        text = re.sub(r"n(\'|\’)t", " not", text)
        text = re.sub(r"(\'|\’)re", " are", text)
        text = re.sub(r"(\'|\’)s", " is", text)
        text = re.sub(r"(\'|\’)d", " would", text)
        text = re.sub(r"(\'|\’)ll", " will", text)
        text = re.sub(r"(\'|\’)t(?!h)", " not", text)
        text = re.sub(r"(\'|\’)ve", " have", text)
    except:
        print('error processing text:{}'.format(text))
        
    return text

def space_chars(text, chars_to_space):
    """
    Takes a string and list of characters, insert space before and after 
    characters that appear in text.
    
    Parameters
    ----------
    text : str
        String to search
    chars_to_space : list
        list of characters to find and space
        
    Returns
    -------
    str
        modified text string    
    """
    
    # light arg checking
    if type(chars_to_space) == str:
        chars_to_space = [chars_to_space]
    
    re_tok = re.compile(f'({chars_to_space})')
    return re_tok.sub(r' \1 ', text)

def clean_misspell(text):
    """
    misspell list (quora vs. fasttext wiki-news-300d-1M)
    """
    misspell_to_sub = {
        'quorans': 'Quran',
        'Quorans': 'Quran',
        'quoran': 'Quran',
        'Quoran': 'Quran',
        'C++': 'computer programming language',
        'c++': 'computer programming language', 
        'C#': 'computer programming language',
        'c#': 'computer programming language',
        'Golang': 'computer programming language',
        'golang': 'computer programming language',
        'and/or': 'and or',
        'his/her': 'his or her',
        'he/she': 'he or she',
        'him/her': 'him or her',
        'BITSAT': 'Birla Institute of Technology entrance examination',
        'COMEDK': 'medical engineering and dental colleges of Karnataka entrance examination',
        'KVPY': 'entrance examination',
        'WBJEE': 'West Bengal entrance examination',
        '9/11': 'terrorist attack',
        'mtech': 'Master of Engineering',
        'VITEEE': 'Vellore institute of technology',
        'articleship': 'chartered accountant internship',
        'aadhar': 'Indian identification number',
        'adhar': 'Indian identification number',
        'UPES': 'Indian university',
        'Fortnite': 'video game',
        'Quora': 'advice website',
        'Qoura': 'advice website',
        'bcom': 'bachelor of commerce',
        'AFCAT': 'Indian air force recruitment exam',
        'UCEED': 'Indian Institute of Technology Bombay entrance examination',
        'dropshipping': 'drop shipping',
        'marksheet': 'university transcript',
        '“': ' " ',
        '”': ' " ',
        'Machedo': 'Indian internet celebrity',
        'BNBR': 'be nice be respectful',
        'AMCAT': 'Indian employment assessment examination',
        'FIITJEE': 'Indian tutoring service',
        'fiitjee': 'Indian tutoring service',
        'IITians': 'Indian Institutes of Technology students',
        'IITian': 'Indian Institutes of Technology student',
        'IITJEE': 'Indian Institutes of Technology entrance examination',
        'JEE MAINS': 'Indian university entrance examination',
        'UPSEE': 'Indian university entrance examination',
        'ICOs': 'cryptocurrancies initial coin offering',
        'R&D': 'research and development',
        'r&d': 'research and development',
        'hairfall': 'hair loss',
        'Doklam': 'disputed Indian Chinese border area',
        'NICMAR': 'Indian university',
        'Zerodha': 'online stock brokerage',
        'Gixxer': 'motorcycle',
        '&amp;': ' & ',
        'D&D': 'dungeons & dragons game',
        'coinbase': 'bitcoin wallet',
        'Coinbase': 'bitcoin wallet',
        'h1b': 'US work visa',
        'H1B': 'US work visa',
        'currancies': 'currencies',
        'currancy': 'currancy',
        'r-aping': 'raping',
        'Howdo': 'How do',
        'MeToo': 'feminist activism campaign',
        'Whatare': 'What are',
        'Terroristan': 'terrorist Pakistan',
        'terroristan': 'terrorist Pakistan',
        'BIMARU': 'Bihar, Madhya Pradesh, Rajasthan, Uttar Pradesh',
        'Hinduphobic': 'Hindu phobic',
        'hinduphobic': 'Hindu phobic',
        'Hinduphobia': 'Hindu phobic',
        'hinduphobia': 'Hindu phobic',
        'Babchenko': 'Arkady Arkadyevich Babchenko faked death',
        'Boshniaks': 'Bosniaks',
        'Dravidanadu': 'Dravida Nadu',
        'mysoginists': 'misogynists',
        'MGTOWS': 'Men Going Their Own Way',
        'mongloid': 'Mongoloid',
        'unsincere': 'insincere',
        'meninism': 'male feminism',
        'jewplicate': 'jewish replicate',
        'unoin': 'Union',
        'daesh': 'Islamic State of Iraq and the Levant',
        'Kalergi': 'Coudenhove-Kalergi',
        'Bhakts': 'Bhakt',
        'bhakts': 'Bhakt',
        'Tambrahms': 'Tamil Brahmin',
        'Pahul': 'Amrit Sanskar',
        'SJW': 'social justice warrior',
        'SJWs': 'social justice warrior',
        ' incel': ' involuntary celibates',
        ' incels': ' involuntary celibates',
        'emiratis': 'Emiratis',
        'weatern': 'western',
        'westernise': 'westernize',
        'Pizzagate': 'Pizzagate conspiracy theory',
        'naïve': 'naive',
        'Skripal': 'Sergei Skripal',
        'Remainers': 'anti Brexit',
        'remainers': 'anti Brexit',
        'bremainer': 'anti Brexit',
        'antibrahmin': 'anti Brahminism',
        'HYPSM': ' Harvard, Yale, Princeton, Stanford, MIT',
        'HYPS': ' Harvard, Yale, Princeton, Stanford',
        'kompromat': 'compromising material',
        'Tharki': 'pervert',
        'tharki': 'pervert',
        'mastuburate': 'masturbate',
        'Zoë': 'Zoe',
        'indans': 'Indian',
        ' xender': ' gender',
        'Naxali ': 'Naxalite ',
        'Naxalities': 'Naxalites',
        'Bathla': 'Namit Bathla',
        'Mewani': 'Indian politician Jignesh Mevani',
        'clichéd': 'cliche',
        'cliché': 'cliche',
        'clichés': 'cliche',
        'Wjy': 'Why',
        'Fadnavis': 'Indian politician Devendra Fadnavis',
        'Awadesh': 'Indian engineer Awdhesh Singh',
        'Awdhesh': 'Indian engineer Awdhesh Singh',
        'Khalistanis': 'Sikh separatist movement',
        'madheshi': 'Madheshi',
        'BNBR': 'Be Nice, Be Respectful',
        'Bolsonaro': 'Jair Bolsonaro',
        'XXXTentacion': 'Tentacion',
        'Padmavat': 'Indian Movie Padmaavat',
        'Žižek': 'Slovenian philosopher Slavoj Žižek',
        'Adityanath': 'Indian monk Yogi Adityanath',
        'Brexiter': 'Brexit supporter',
        'Brexiters': 'Brexit supporters',
        'Brexiteer': 'Brexit supporter',
        'Brexiteers': 'Brexit supporters',
        'Brexiting': 'Brexit',
        'Brexitosis': 'Brexit disorder',
        'brexiters': 'Brexit supporters',
        'jallikattu': 'Jallikattu',
        'fortnite': 'Fortnite ',
        'Swachh': 'Swachh Bharat mission campaign ',
        'Quorans': 'Quoran',
        'Qoura ': 'Quora ',
        'quoras': 'Quora',
        'Quroa': 'Quora',
        'QUORA': 'Quora',
        'narcissit': 'narcissist',
        'Doklam': 'Tibet',
        'Drumpf ': 'Donald Trump fool ',
        'Drumpfs': 'Donald Trump fools',
        'Strzok': 'Hillary Clinton scandal',
        'rohingya': 'Rohingya ',
        'wumao ': 'cheap Chinese stuff',
        'wumaos': 'cheap Chinese stuff',
        'Sanghis': 'Sanghi',
        'Tamilans': 'Tamils',
        'biharis': 'Biharis',
        'Rejuvalex': 'hair growth formula',
        'Feku': 'The Man of India ',
        'deplorables': 'deplorable',
        'muhajirs': 'Muslim immigrant',
        'Gujratis': 'Gujarati',
        'Chutiya': 'Tibet people ',
        'Chutiyas': 'Tibet people ',
        'thighing': 'masturbate',
        '卐': 'Nazi Germany',
        'Pribumi': 'Native Indonesian',
        'Gurmehar': 'Gurmehar Kaur Indian student activist',
        'Novichok': 'Soviet Union agents',
        'Khazari': 'Khazars',
        'Demonetization': 'demonetization',
        'demonetisation': 'demonetization',
        'demonitisation': 'demonetization',
        'demonitization': 'demonetization',
        'demonetisation': 'demonetization',
        'cryptocurrencies': 'cryptocurrency',
        'Hindians': 'North Indian who hate British',
        'vaxxer': 'vocal nationalist ',
        'remoaner': 'remainer ',
        'bremoaner': 'Brexit remainer',
        'Jewism': 'Judaism',
        'Eroupian': 'European',
        'WMAF': 'White male married Asian female',
        'moeslim': 'Muslim',
        'cishet': 'cisgender and heterosexual person',
        'Eurocentric': 'Eurocentrism ',
        'Jewdar': 'Jew dar',
        'Asifa': 'abduction, rape, murder case ',
        'marathis': 'Marathi',
        'Trumpanzees': 'Trump chimpanzee fool',
        'Crimean': 'Crimea people ',
        'atrracted': 'attract',
        'LGBT': 'lesbian, gay, bisexual, transgender',
        'Boshniak': 'Bosniaks ',
        'Myeshia': 'widow of Green Beret killed in Niger',
        'demcoratic': 'Democratic',
        'raaping': 'rape',
        'Dönmeh': 'Islam',
        'feminazism': 'feminism nazi',
        'langague': 'language',
        'Hongkongese': 'HongKong people',
        'hongkongese': 'HongKong people',
        'Kashmirians': 'Kashmirian',
        'Chodu': 'fucker',
        'penish': 'penis',
        'micropenis': 'tiny penis',
        'Madridiots': 'Real Madrid idiot supporters',
        'Ambedkarite': 'Dalit Buddhist movement ',
        'ReleaseTheMemo': 'cry for the right and Trump supporters',
        'harrase': 'harass',
        'Barracoon': 'Black slave',
        'Castrater': 'castration',
        'castrater': 'castration',
        'Rapistan': 'Pakistan rapist',
        'rapistan': 'Pakistan rapist',
        'Turkified': 'Turkification',
        'turkified': 'Turkification',
        'Dumbassistan': 'dumb ass Pakistan',
        'facetards': 'Facebook retards',
        'rapefugees': 'rapist refugee',
        'superficious': 'superficial',
        'colour': 'color',
        'centre': 'center',
        'favourite': 'favorite',
        'travelling': 'traveling',
        'counselling': 'counseling',
        'theatre': 'theater',
        'cancelled': 'canceled',
        'labour': 'labor',
        'organisation': 'organization',
        'wwii': 'world war 2',
        'citicise': 'criticize',
        'youtu ': 'youtube ',
        'sallary': 'salary',
        'Whta': 'What',
        'narcisist': 'narcissist',
        'narcissit': 'narcissist',
        'howdo': 'how do',
        'whatare': 'what are',
        'howcan': 'how can',
        'howmuch': 'how much',
        'howmany': 'how many',
        'whydo': 'why do',
        'doI': 'do I',
        'theBest': 'the best',
        'howdoes': 'how does',
        'mastrubation': 'masturbation',
        'mastrubate': 'masturbate',
        'mastrubating': 'masturbating',
        'pennis': 'penis',
        'Etherium': 'Ethereum',
        'bigdata': 'big data',
        '2k17': '2017',
        '2k18': '2018',
        'qouta': 'quota',
        'exboyfriend': 'ex boyfriend',
        'airhostess': 'air hostess',
        'whst': 'what',
        'watsapp': 'whatsapp',
        'bodyshame': 'body shaming',
        'bodyshoppers': 'body shopping',
        'bodycams': 'body cams',
        'Cananybody': 'Can any body',
        'deadbody': 'dead body',
        'deaddict': 'de addict',
        'Northindian': 'North Indian ',
        'northindian': 'north Indian ',
        'northkorea': 'North Korea',
        'Whykorean': 'Why Korean',
        'koreaboo': 'Korea boo ',
        'Brexshit': 'British Exit bullshit',
        'shithole': ' shithole ',
        'shitpost': 'shit post',
        'shitslam': 'shit Islam',
        'shitlords': 'shit lords',
        'Fck': 'Fuck',
        'fck': 'fuck',
        'Clickbait': 'click bait ',
        'clickbait': 'click bait ',
        'mailbait': 'mail bait',
        'healhtcare': 'healthcare',
        'trollbots': 'troll bots',
        'trollled': 'trolled',
        'trollimg': 'trolling',
        'cybertrolling': 'cyber trolling',
        'sickular': 'India sick secular ',
        'suckimg': 'sucking',
        'Idiotism': 'idiotism',
        'Niggerism': 'Nigger',
        'Niggeriah': 'Nigger'
    }

    escape_cars = re.compile('(\+)')
    misspell = '|'.join([escape_cars.sub(r"\\\1",i) for i in misspell_to_sub.keys()])
    misspell_re = re.compile(misspell)
    
    def _replace(match):
        return misspell_to_sub.get(match.group(0), match.group(0))
    
    return misspell_re.sub(_replace, text)

def preprocess(text):
    
    # replace contractions (e.g. won't -> will not)
    text = decontracted(text)

    # replace LateX with English
    text = clean_latex(text)
    
    # space characters
    text = space_chars(text, ['?', ',', '"', '(', ')', '%', ':', '$', 
                              '.', '+', '^', '/', '{', '}', "_", '!', 
                              '#', '='])
    
    # fix typos and swap terms that are not recognized by embedding
    text = clean_misspell(text)
    
    return text

df_train['question_text_pr'] = df_train['question_text'].apply(preprocess)
df_test['question_text_pr'] = df_test['question_text'].apply(preprocess)

In [None]:
vocab = build_vocab(np.concatenate((df_train.question_text_pr, df_test.question_text_pr)))
tokens,c,tokens_in_embedding = vocab_embedding_coverage(vocab, fasttext, True)
df_tokens = pd.DataFrame.from_dict(tokens_in_embedding, orient = 'index')
df_tokens = df_tokens.sort_values(by='frequency', ascending=False)
df_tokens[np.logical_not(df_tokens.embedding)].head(10)

In [121]:
df_tokens[np.logical_not(df_tokens.embedding)].head(100)

Unnamed: 0,frequency,embedding
_,1434,False
AIndian,158,False
math],88,False
cryptocurrancies,79,False
NMAT,77,False
JIIT,74,False
L&T,74,False
LNMIIT,72,False
MAINS,66,False
Kavalireddi,66,False


In [120]:
df_train.loc[df_train.question_text_pr.apply(lambda x: 'MeToo' in x), ['question_text', 'question_text_pr', 'target']]

Unnamed: 0,question_text,question_text_pr,target
33846,"Are women in the US becoming more unpleasant, vicious even, as politics/culture give them more of the same 'power' they have been oppressed by? Just as a small percentage abused the 'MeToo' movement, will we become the very thing we fight against?","Are women in the US becoming more unpleasant , vicious even , as politics / culture give them more of the same ' power ' they have been oppressed by ? Just as a small percentage abused the ' MeToo ' movement , will we become the very thing we fight against ?",1
80257,Where is the #MeToo movement when conservative women get attacked verbally?,Where is the # MeToo movement when conservative women get attacked verbally ?,0
92723,Why do you disagree with #MeToo movement?,Why do you disagree with # MeToo movement ?,0
150944,Have any of the major allegations from the #MeToo movement been proven false?,Have any of the major allegations from the # MeToo movement been proven false ?,0
192276,Does the #MeToo movement feel as important and justified as the public lynchings of black people did in the US around the end of the 19th century?,Does the # MeToo movement feel as important and justified as the public lynchings of black people did in the US around the end of the 19th century ?,1
206821,Has the MeToo movement done more harm than good?,Has the MeToo movement done more harm than good ?,0
226403,How has the #MeToo movement affected your views toward Bill Clinton?,How has the # MeToo movement affected your views toward Bill Clinton ?,0
230206,Why is the #MeToo movement ganging up on art? How is censorship on artisitic expression going to solve anything?,Why is the # MeToo movement ganging up on art ? How is censorship on artisitic expression going to solve anything ?,0
245826,"Why is MeToo being misused by many women who conflate regret, feeling used, being manipulated, and being lied to, with lack of consent?","Why is MeToo being misused by many women who conflate regret , feeling used , being manipulated , and being lied to , with lack of consent ?",1
355826,What is the #MeToo campaign all about?,What is the # MeToo campaign all about ?,0


In [86]:
df_tokens[np.logical_not(df_tokens.embedding)]

Unnamed: 0,frequency,embedding
"""The",104,False
Machedo,103,False
BNBR,103,False
"""I",102,False
infty},99,False
AMCAT,97,False
IITian,94,False
2},92,False
IITJEE,90,False
math],87,False


### Misc Utility Functions

In [None]:
def contains_punctuation(text, other_punct = ['']):
    """
    Does the string contain punctuation?
    
    Parameters
    ----------
    text : str
        String to search for punctuation
    other_punct : list
        list of additional punctuation characters to search
        
    Returns
    -------
    bool
        True if 1 or more instances of punctuation / special characters exist in string
    """
    
    # light arg checking
    if type(other_punct) == str:
        other_punct = [other_punct]
    
    regular_punct = list(string.punctuation)
    extra_punct = [
        ',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&',
        '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',
        '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',
        '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '“', '★', '”',
        '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾',
        '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', '：', '¼', '⊕', '▼',
        '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲',
        'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '∙', '）', '↓', '、', '│', '（', '»',
        '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø',
        '¹', '≤', '‡', '√', '«', '»', '´', 'º', '¾', '¡', '§', '£', '₤']
    
    all_punct = ''.join(sorted(list(set(regular_punct + extra_punct + other_punct))))
    re_tok = re.compile(f'([{all_punct}])')
    
    return re.search(re_tok, text) is not None

# Example
# vocab_with_punctuation = [word for word in vocab.keys() if contains_punctuation(word,['∈'])]
# random.sample(vocab_with_punctuation,10)