<a href="https://colab.research.google.com/github/dohyun1411/Quora-Insincere-Questions-Classification/blob/preprocessing1/very_simple_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Fork https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings#

In this kernel I want to illustrate how I do come up with meaningful preprocessing when building deep learning NLP models. 

I start with two golden rules:

1.  **Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings** 

Some of you might used standard preprocessing steps when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc. 
The reason is simple: You loose valuable information, which would help your NN to figure things out.  

2. **Get your vocabulary as close to the embeddings as possible**

I will focus in this notebook, how to achieve that. For an example I take the GoogleNews pretrained embeddings, there is no deeper reason for this choice.

We start with a neat little trick that enables us to see a progressbar when applying functions to a pandas Dataframe

In [None]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

Lets load our data

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
print("Train shape : ",train.shape)
print("Test shape : ",test.shape)

Train shape :  (1306122, 3)
Test shape :  (375806, 2)


I will use the following function to track our training vocabulary, which goes through all our text and counts the occurance of the contained words. 

In [None]:
def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

So lets populate the vocabulary and display the first 5 elements and their count. Note that now we can use progess_apply to see progress bar

In [None]:
sentences = train["question_text"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:5]})

100%|██████████| 1306122/1306122 [00:07<00:00, 184714.57it/s]
100%|██████████| 1306122/1306122 [00:06<00:00, 193875.12it/s]

{'How': 261930, 'did': 33489, 'Quebec': 97, 'nationalists': 91, 'see': 9003}





Next we import the embeddings we want to use in our model later. For illustration I use GoogleNews here.

In [None]:
%%time
import zipfile
from gensim.models import KeyedVectors

embeddings_path = '../input/embeddings.zip'
google = 'GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'

with zipfile.ZipFile(embeddings_path) as embedding_zip:
    print("Found embeddings as a zip file")
    google_embeddings = KeyedVectors.load_word2vec_format(embedding_zip.open(google), binary=True)

Found embeddings as a zip file
CPU times: user 2min 6s, sys: 4.94 s, total: 2min 11s
Wall time: 2min 26s


In [None]:
embeddings_index = google_embeddings

Next I define a function that checks the intersection between our vocabulary and the embeddings. It will output a list of out of vocabulary (oov) words that we can use to improve our preprocessing

In [None]:
import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [None]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 508823/508823 [00:01<00:00, 262648.61it/s]


Found embeddings for 24.31% of vocab
Found embeddings for  78.75% of all text


Ouch only 24% of our vocabulary will have embeddings, making 21% of our data more or less useless. So lets have a look and start improving. For this we can easily have a look at the top oov words.

In [None]:
oov[:10]

[('to', 403183),
 ('a', 402682),
 ('of', 330825),
 ('and', 251973),
 ('India?', 16384),
 ('it?', 12900),
 ('do?', 8753),
 ('life?', 7753),
 ('you?', 6295),
 ('me?', 6202)]

On first place there is "to". Why? Simply because "to" was removed when the GoogleNews Embeddings were trained. We will fix this later, for now we take care about the splitting of punctuation as this also seems to be a Problem. But what do we do with the punctuation then - Do we want to delete or consider as a token? I would say: It depends. If the token has an embedding, keep it, if it doesn't we don't need it anymore. So lets check:

In [None]:
'?' in embeddings_index

False

In [None]:
'&' in embeddings_index

True

Interesting. While "&" is in the Google News Embeddings, "?" is not. So we basically define a function that splits off "&" and removes other punctuation.

In [None]:
def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

In [None]:
train["question_text"] = train["question_text"].progress_apply(lambda x: clean_text(x))
sentences = train["question_text"].apply(lambda x: x.split())
vocab = build_vocab(sentences)

100%|██████████| 1306122/1306122 [00:16<00:00, 80199.06it/s]
100%|██████████| 1306122/1306122 [00:05<00:00, 229140.37it/s]


In [None]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 253623/253623 [00:01<00:00, 241230.11it/s]

Found embeddings for 57.38% of vocab
Found embeddings for  89.99% of all text





Nice! We were able to increase our embeddings ratio from 24% to 57% by just handling punctiation. Ok lets check on thos oov words.

In [None]:
oov[:10]

[('to', 406298),
 ('a', 403852),
 ('of', 332964),
 ('and', 254081),
 ('2017', 8781),
 ('2018', 7373),
 ('10', 6642),
 ('12', 3694),
 ('20', 2942),
 ('100', 2883)]

Hmm seems like numbers also are a problem. Lets check the top 10 embeddings to get a clue.

In [None]:
for i in range(10):
    print(embeddings_index.index2entity[i])

</s>
in
for
that
is
on
##
The
with
said


hmm why is "##" in there? Simply because as a reprocessing all numbers bigger tha 9 have been replaced by hashs. I.e. 15 becomes ## while 123 becomes ### or 15.80€ becomes ##.##€. So lets mimic this preprocessing step to further improve our embeddings coverage

In [None]:
import re

def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

In [None]:
train["question_text"] = train["question_text"].progress_apply(lambda x: clean_numbers(x))
sentences = train["question_text"].progress_apply(lambda x: x.split())
vocab = build_vocab(sentences)

100%|██████████| 1306122/1306122 [00:25<00:00, 50585.11it/s]
100%|██████████| 1306122/1306122 [00:06<00:00, 216208.11it/s]
100%|██████████| 1306122/1306122 [00:05<00:00, 219533.84it/s]


In [None]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 242997/242997 [00:01<00:00, 195282.14it/s]

Found embeddings for 60.41% of vocab
Found embeddings for  90.75% of all text





Nice! Another 3% increase. Now as much as with handling the puntuation, but every bit helps. Lets check the oov words

In [None]:
oov[:20]

[('to', 406298),
 ('a', 403852),
 ('of', 332964),
 ('and', 254081),
 ('favourite', 1247),
 ('bitcoin', 987),
 ('colour', 976),
 ('doesnt', 918),
 ('centre', 886),
 ('Quorans', 858),
 ('cryptocurrency', 822),
 ('Snapchat', 807),
 ('travelling', 705),
 ('counselling', 634),
 ('btech', 632),
 ('didnt', 600),
 ('Brexit', 493),
 ('cryptocurrencies', 481),
 ('blockchain', 474),
 ('behaviour', 468)]

Ok now we  take care of common misspellings when using american/ british vocab and replacing a few "modern" words with "social media" for this task I use a multi regex script I found some time ago on stack overflow. Additionally we will simply remove the words "a","to","and" and "of" since those have obviously been downsampled when training the GoogleNews Embeddings. 


In [None]:
def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


mispell_dict = {'colour':'color',
                'centre':'center',
                'didnt':'did not',
                'doesnt':'does not',
                'isnt':'is not',
                'shouldnt':'should not',
                'favourite':'favorite',
                'travelling':'traveling',
                'counselling':'counseling',
                'theatre':'theater',
                'cancelled':'canceled',
                'labour':'labor',
                'organisation':'organization',
                'wwii':'world war 2',
                'citicise':'criticize',
                'instagram': 'social medium',
                'whatsapp': 'social medium',
                'snapchat': 'social medium'

                }
mispellings, mispellings_re = _get_mispell(mispell_dict)

def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)

In [None]:
train["question_text"] = train["question_text"].progress_apply(lambda x: replace_typical_misspell(x))
sentences = train["question_text"].progress_apply(lambda x: x.split())
to_remove = ['a','to','of','and']
sentences = [[word for word in sentence if not word in to_remove] for sentence in tqdm(sentences)]
vocab = build_vocab(sentences)

100%|██████████| 1306122/1306122 [00:09<00:00, 143147.53it/s]
100%|██████████| 1306122/1306122 [00:06<00:00, 213831.45it/s]
100%|██████████| 1306122/1306122 [00:06<00:00, 214454.78it/s]
100%|██████████| 1306122/1306122 [00:05<00:00, 240682.51it/s]


In [None]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 242935/242935 [00:01<00:00, 203938.67it/s]


Found embeddings for 60.43% of vocab
Found embeddings for  98.96% of all text


We see that although we improved on the amount of embeddings found for all our text from 89% to 99%. Lets check the oov words again 

In [None]:
oov[:20]

[('bitcoin', 987),
 ('Quorans', 858),
 ('cryptocurrency', 822),
 ('Snapchat', 807),
 ('btech', 632),
 ('Brexit', 493),
 ('cryptocurrencies', 481),
 ('blockchain', 474),
 ('behaviour', 468),
 ('upvotes', 432),
 ('programme', 402),
 ('Redmi', 379),
 ('realise', 371),
 ('defence', 364),
 ('KVPY', 349),
 ('Paytm', 334),
 ('grey', 299),
 ('mtech', 281),
 ('Btech', 262),
 ('bitcoins', 254)]

Looks good. No obvious oov words there we could quickly fix.
Thank you for reading and happy kaggling

This is my own code from this line.

In [None]:
import numpy as np

In [None]:
len(train[train['target'] == 1]) / len(train) # 94 : 6

0.06187017751787352

Build feature set.

In [None]:
feature_df = pd.DataFrame()

In [None]:
num_words = [len(sentence) for sentence in sentences]
feature_df['num_words'] = num_words

In [None]:
num_capital_words = [len([word for word in sentence if word.isupper()]) for sentence in sentences]
feature_df['num_capital_words'] = num_capital_words

In [None]:
ratio_capital_words = np.array(num_capital_words) / (np.array(num_words) + 1e-5)
feature_df['ratio_capital_words'] = ratio_capital_words

In [None]:
%%time
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

train_text = train['question_text']
polarity_score = [sid.polarity_scores(sentence)['compound'] for sentence in train_text]
feature_df['polarity_score'] = polarity_score

CPU times: user 4min 41s, sys: 33.8 ms, total: 4min 41s
Wall time: 4min 41s


In [None]:
len_sentence = [len(sentence) for sentence in train_text]
feature_df['len_sentence'] = len_sentence

In [None]:
oov_words = {word for word, _ in oov}
num_oov = [len([word for word in sentence if word in oov_words]) for sentence in sentences]
feature_df['num_oov'] = num_oov

In [None]:
ratio_oov = np.array(num_oov) / (np.array(num_words) + 1e-5)
feature_df['ratio_oov'] = ratio_oov

In [None]:
ratio_words = np.array(num_words) / (np.array(len_sentence) + 1e-5)
feature_df['ratio_words'] = ratio_words

In [None]:
# Please add more features..

In [None]:
feature_df['target'] = train['target']
feature_df.head(n=10)

Unnamed: 0,num_words,num_capital_words,ratio_capital_words,len_sentence,num_oov,ratio_oov,ratio_words,target,polarity_score
0,12,0,0.0,71,0,0.0,0.169014,0,0.0
1,14,0,0.0,79,0,0.0,0.177215,0,0.6124
2,10,0,0.0,65,0,0.0,0.153846,0,0.0
3,9,0,0.0,56,0,0.0,0.160714,0,0.0
4,13,2,0.153846,76,3,0.230769,0.171053,0,0.0
5,10,0,0.0,70,0,0.0,0.142857,0,0.0
6,18,0,0.0,111,0,0.0,0.162162,0,-0.3182
7,14,1,0.071429,67,0,0.0,0.208955,0,-0.34
8,16,0,0.0,99,0,0.0,0.161616,0,0.0
9,42,0,0.0,243,0,0.0,0.172839,0,0.4779


In [None]:
from sklearn.preprocessing import RobustScaler

def feature_scaling(df):
    target = df.pop('target') # We will not consider target
    scaler = RobustScaler()
    scaler.fit(df)
    df = pd.DataFrame(scaler.transform(df), columns=df.columns, index=list(df.index.values))
    df['target'] = target
    return df

In [None]:
from sklearn.impute import SimpleImputer

def missing_data_handling(df):
    target = df.pop('target') # We will not consider target
    imp = SimpleImputer(strategy='mean')
    imp.fit(df)
    df = pd.DataFrame(imp.transform(df), columns=df.columns, index=list(df.index.values))
    df['target'] = target
    return df

In [None]:
def outlier_handling(df):
    target = df.pop('target') # We will not consider target
    for column in df.columns:
        Q1 = np.percentile(df[column], 25)
        Q3 = np.percentile(df[column], 75)
        IQR = Q3 - Q1
        df = df[(df[column] <= Q3 + 1.5 * IQR) & (df[column] >= Q1 - 1.5 * IQR)]
    df['target'] = target
    return df

In [None]:
# feature_df = outlier_handling(feature_df)
feature_df = feature_scaling(feature_df)
feature_df = missing_data_handling(feature_df)
feature_df.head(n=10)

Unnamed: 0,num_words,num_capital_words,ratio_capital_words,len_sentence,num_oov,ratio_oov,ratio_words,polarity_score,target
0,0.333333,0.0,0.0,0.3,0.0,0.0,-0.144277,0.0,0
1,0.666667,0.0,0.0,0.5,0.0,0.0,0.097251,1.69546,0
2,0.0,0.0,0.0,0.15,0.0,0.0,-0.590983,0.0,0
3,-0.166667,0.0,0.0,-0.075,0.0,0.0,-0.388712,0.0,0
4,0.5,2.0,2.153846,0.425,3.0,0.230769,-0.084241,0.0,0
5,0.0,0.0,0.0,0.275,0.0,0.0,-0.914617,0.0,0
6,1.333333,0.0,0.0,1.3,0.0,0.0,-0.346071,-0.880952,0
7,0.666667,1.0,1.0,0.2,0.0,0.0,1.032016,-0.941307,0
8,1.0,0.0,0.0,1.0,0.0,0.0,-0.362151,0.0,0
9,5.333333,0.0,0.0,4.6,0.0,0.0,-0.031615,1.32309,0


In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(feature_df, test_size=0.2)
y_train = train_df['target']
X_train = train_df.drop(['target'], axis=1)
y_test = test_df['target']
X_test = test_df.drop(['target'], axis=1)

In [None]:
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

def adj_r2(X, y):
    reg = LinearRegression()
    reg.fit(X, y)
    y_pred = reg.predict(X)
    
    r2 = r2_score(y, y_pred)
    n = len(y)
    p = len(X.columns)
    adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
    
    return adjusted_r2

In [None]:
from copy import deepcopy

def forward_selection(data, target, n_features):
    remaining_features = data.columns.tolist()
    selected_features = []
    best_score = 0.
    
    prev_features = [] # selected features of previous step
    for _ in range(n_features):
        cur_best_score = 0. # best score of current step
        cur_best_feature = '' # best feature to select in current step
        for feature in remaining_features:
            
            cur_features = prev_features + [feature]
            X = data[cur_features]
            
            adjusted_r2 = adj_r2(X, target)
            # print(adjusted_r2)
            if adjusted_r2 >= cur_best_score:
                cur_best_score = adjusted_r2
                cur_best_feature = feature

    if cur_best_feature:
        prev_features.append(cur_best_feature)
        remaining_features.remove(cur_best_feature)
        
        if cur_best_score > best_score:
            selected_features = deepcopy(prev_features)


    return selected_features

In [None]:
fs = forward_selection(X_train, y_train, 8)
print(fs)

['len_sentence']


In [None]:
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import GridSearchCV

In [None]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))
print(f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.96    245227
           1       0.20      0.10      0.13     15998

   micro avg       0.92      0.92      0.92    261225
   macro avg       0.57      0.54      0.55    261225
weighted avg       0.90      0.92      0.91    261225

0.13319301199747421


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

grid_params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 10, 100]
}
gs = GridSearchCV(SVC(), grid_params, verbose=2)
gs.fit(X_train, y_train)

best_params = gs.best_params_
print(gs.best_params_)

svm = SVC(**best_params)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(classification_report(y_test, y_pred))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV] C=0.001, gamma=0.001 ............................................
