# Forward

Hey all. This is a for of @bminixhofer Kernel. My only addition is to demonstrate the use variable batch size for accelerated training times, and of course I use my picked embeddings which load faster and help with memory management.

# Preface

This kernel is a PyTorch version of the [Simple LSTM kernel](https://www.kaggle.com/thousandvoices/simple-lstm). All credit for architecture and preprocessing goes to @thousandvoices.
There is a lot of discussion whether Keras, PyTorch, Tensorflow or the CUDA C API is best. But specifically between the PyTorch and Keras version of the simple LSTM architecture, there are 2 clear advantages of PyTorch:
- Speed. The PyTorch version runs about 20 minutes faster.
- Determinism. The PyTorch version is fully deterministic. Especially when it gets harder to improve your score later in the competition, determinism is very important.

I was surprised to see that PyTorch is that much faster, so I'm not completely sure the steps taken are exactly the same. If you see any difference, we can discuss it in the comments :)

The most likely reason the score of this kernel is higher than the @thousandvoices version is that the optimizer is not reinitialized after every epoch and thus the parameter-specific learning rates of Adam are not discarded after every epoch. That is the only difference between the kernels that is intended.

# Imports & Utility functions

In [1]:
import numpy as np
import pandas as pd
import os, time, gc, pickle, random
from tqdm._tqdm_notebook import tqdm_notebook as tqdm
from keras.preprocessing import text, sequence
import torch
from torch import nn
from torch.utils import data
from torch.nn import functional as F
from pytorch_pretrained_bert import BertTokenizer, BertModel
from bert_embedding import BertEmbedding
import apex # used for 16 bit
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
import mxnet as mx # used for GPU
#for getting num good and bad words
from wordcloud import STOPWORDS
from collections import defaultdict
import operator

# Logging for BERT
import logging
logging.basicConfig(level=logging.INFO)

Using TensorFlow backend.


In [2]:
# disable progress bars when submitting
def is_interactive():
    return 'SHLVL' not in os.environ

if not is_interactive():
    def nop(it, *a, **k):
        return it

    tqdm = nop

In [3]:
def seed_everything(seed=1234):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything()

In [4]:
CRAWL_EMBEDDING_PATH = '../input/pickled-crawl300d2m-for-kernel-competitions/crawl-300d-2M.pkl'
GLOVE_EMBEDDING_PATH = '../input/pickled-glove840b300d-for-10sec-loading/glove.840B.300d.pkl'

NUM_MODELS = 2
LSTM_UNITS = 128
DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
MAX_LEN = 220

In [5]:
def build_matrix(word_index, emb_path, unknown_token='unknown'):
    with open(emb_path, 'rb') as fp:
        embedding_index = pickle.load(fp)
    
    # TODO: Build random token instead of using unknown
    unknown_token = embedding_index[unknown_token].copy()
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    unknown_words = []
    
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embedding_index[word].copy()
        except KeyError:
            embedding_matrix[i] = unknown_token
            unknown_words.append(word)
            
    del embedding_index; gc.collect()
    return embedding_matrix, unknown_words

In [6]:
class SpatialDropout(nn.Dropout2d):
    def forward(self, x):
        x = x.unsqueeze(2)    # (N, T, 1, K)
        x = x.permute(0, 3, 2, 1)  # (N, K, 1, T)
        x = super(SpatialDropout, self).forward(x)  # (N, K, 1, T), some features are masked
        x = x.permute(0, 3, 2, 1)  # (N, T, 1, K)
        x = x.squeeze(2)  # (N, T, K)
        return x
    
class NeuralNet(nn.Module):
    def __init__(self, embedding_matrix, num_aux_targets):
        super(NeuralNet, self).__init__()
        embed_size = embedding_matrix.shape[1]
        
        self.embedding = nn.Embedding(max_features, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.embedding_dropout = SpatialDropout(0.3)
        
        self.lstm1 = nn.LSTM(embed_size, LSTM_UNITS, bidirectional=True, batch_first=True)
        self.lstm2 = nn.LSTM(LSTM_UNITS * 2, LSTM_UNITS, bidirectional=True, batch_first=True)
    
        self.linear1 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS)
        self.linear2 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS)
        
        self.linear_out = nn.Linear(DENSE_HIDDEN_UNITS, 1)
        self.linear_aux_out = nn.Linear(DENSE_HIDDEN_UNITS, num_aux_targets)
        
    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = self.embedding_dropout(h_embedding)
        
        h_lstm1, _ = self.lstm1(h_embedding)
        h_lstm2, _ = self.lstm2(h_lstm1)
        
        # global average pooling
        avg_pool = torch.mean(h_lstm2, 1)
        # global max pooling
        max_pool, _ = torch.max(h_lstm2, 1)
        
        h_conc = torch.cat((max_pool, avg_pool), 1)
        h_conc_linear1  = F.relu(self.linear1(h_conc))
        h_conc_linear2  = F.relu(self.linear2(h_conc))
        
        hidden = h_conc + h_conc_linear1 + h_conc_linear2
        
        result = self.linear_out(hidden)
        aux_result = self.linear_aux_out(hidden)
        out = torch.cat([result, aux_result], 1)
        
        return out

In [7]:
# helper functions
string.printable
ascii_chars = string.printable
ascii_chars += " áéíóúàèìòùâêîôûäëïöüñõç"

#checks if a string of text contains any nonenglish characters (excluding punctuations, spanish, and french characters)
def contains_non_english(text):
    if all(char in ascii_chars for char in text):
        return 0
    else:
        return 1
    
#clean non english characters from string of text
def remove_non_english(text):
    return ''.join(filter(lambda x: x in ascii_chars, text))


def get_first_word(word):
    if(type(word) != "float"):
        return word.split(" ")[0]
    return "-1"

def get_cap_vs_length(row):
    if row["total_length"] == 0:
        return -1
    return float(row['capitals'])/float(row['total_length'])

def calc_max_word_len(sentence):
    maxLen = 0
    for word in sentence:
        maxLen = max(maxLen, len(word))
    return maxLen

def calc_min_word_len(sentence):
    minLen = 999999
    for word in sentence:
        minLen = min(minLen, len(word))
    return minLen

def calc_total_word_len(sentence):
    cnt = 0
    for x in sentence:
        cnt+=len(x)
    return cnt

def calc_total_unique_word_len(sentence):
    words = set(sentence)
    return calc_total_word_len(words)

#removes all single characters except for "I" and "a"
def remove_singles(text):
    return ' '.join( [w for w in text.split() if ((len(w)>1) or (w.lower() == "i") or (w.lower() == "a"))] )
    
#combines multiple whitespaces into single
def clean_text(x):
    x = str(x)
    for punct in "&/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~`" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&':
        x = x.replace(punct, '')
    x = re.sub( '\s+', ' ', x).strip()
    
    
# Text cleaning
# TODO: speed up this func
def pad_chars(text,punct):
    for p in punct:
        text = re.sub('(?<=\w)([!?,])', r' \1', text)
    return text
    
symbols_iv = """?,./-()"$=…*&+′[ɾ̃]%:^\xa0\\{}–“”;!<`®ạ°#²|~√_α→>—£，。´×@π÷？ʿ€の↑∞ʻ℅в•−а年！∈∩⊆§℃θ±≤͡⁴™си≠∂³ி½△¿¼∆≥⇒¬∨∫▾Ω＾γµº♭ー̂ɔ∑εντσ日Γ∪φβ¹∘¨″⅓ɑː✅✓（）∠«»்ுλ∧∀،＝ɨʋδɒ¸☹μΔʃɸηΣ₅₆◦·ВΦ☺❤♨✌≡ʌʊா≈⁰‛：ﬁ„¾ρ⟨⟩˂⅔≅－＞¢⁸ʒは⬇♀؟¡⋅ɪ₁₂ɤ◌ʱ、▒ْ；☉＄∴✏ωɹ̅।ـ☝♏̉̄♡₄∼́̀⁶⁵¦¶ƒˆ‰©¥∅・ﾟ⊥ª†ℕ│ɡ∝♣／☁✔❓∗➡ℝ位⎛⎝¯⎞⎠↓ɐ∇⋯˚⁻ˈ₃⊂˜̸̵̶̷̴̡̲̳̱̪̗̣̖̎̿͂̓̑̐̌̾̊̕\x92"""        

def split_off_symbols_iv(x):
    for punct in symbols_iv:
        x = x.replace(punct, f' {punct} ')
    return x
    
def neutrailize_bad_words(train,test):
    train1_df = train[train["target"]==1]
    train0_df = train[train["target"]==0]

    ## custom function for ngram generation ##
    def generate_ngrams(text, n_gram=1):
        token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
        ngrams = zip(*[token[i:] for i in range(n_gram)])
        return [" ".join(ngram) for ngram in ngrams]

    freq_dict_bad = defaultdict(int)
    for sent in train1_df["comment_text"]:
        for word in generate_ngrams(sent):
            freq_dict_bad[word] += 1
    freq_dict_bad = dict(freq_dict_bad)

    freq_dict_good = defaultdict(int)
    for sent in train0_df["comment_text"]:
        for word in generate_ngrams(sent):
            freq_dict_good[word] += 1
    freq_dict_good = dict(freq_dict_good)

    bad_words = sorted(freq_dict_bad, key=freq_dict_bad.get, reverse=True)[:1000]
    good_words = sorted(freq_dict_good, key=freq_dict_good.get, reverse=True)[:1000]

    print("--- Generating num_bad_words")
    train["num_bad_words"] = train["comment_text"].map(lambda x: num_bad_words(x))
    test["num_bad_words"] = test["comment_text"].map(lambda x: num_bad_words(x))

    print("--- Generating num_good_words")
    train["num_good_words"] = train["comment_text"].map(lambda x: num_good_words(x))
    test["num_good_words"] = test["comment_text"].map(lambda x: num_good_words(x))

    return x

In [8]:
def preprocess(data):
    '''
    Credit goes to https://www.kaggle.com/gpreda/jigsaw-fast-compact-solution
    '''
    
    # Feature generation
    
    def gen_feats(df):
        start_time = time.time()
        print("--- Generating non_eng")
        df["non_eng"] = df["comment_text"].map(lambda x: contains_non_english(x))

        print("--- Generating first_word")
        df["first_word"] = df["comment_text"].map(lambda x: get_first_word(x))

        print("--- Generating total_length (num chars)")
        df['total_length'] = df['comment_text'].apply(len)

        print("--- Generating capitals")
        df['capitals'] = df['comment_text'].apply(lambda comment: sum(1 for c in comment if c.isupper()))

        print("--- Generating caps_vs_length")
        df['caps_vs_length'] = df.apply(lambda row: get_cap_vs_length(row),axis=1)

        #print("--- Generating num_exclamation_marks")
        #df['num_exclamation_marks'] = df['comment_text'].apply(lambda comment: comment.count('!'))

        print("--- Generating num_question_marks")
        df['num_question_marks'] = df['comment_text'].apply(lambda comment: comment.count('?'))

        print("--- Generating num_punctuation")
        df['num_punctuation'] = df['comment_text'].apply(lambda comment: sum(comment.count(w) for w in '.,;:'))

        #print("--- Generating num_symbols")
        #df['num_symbols'] = df['comment_text'].apply(lambda comment: sum(comment.count(w) for w in '*&$%'))

        print("--- Generating num_words")
        df['num_words'] = df['comment_text'].apply(lambda comment: len(re.sub(r'[^\w\s]','',comment).split(" ")))

        print("--- Generating num_unique_words")
        df['num_unique_words'] = df['comment_text'].apply(lambda comment: len(set(w for w in comment.split())))

        print("--- Generating unique_word_over_num_words")
        df['unique_word_over_num_words'] = df['num_unique_words'] / test['num_words']

        #print("--- Generating num_smilies")
        #df['num_smilies'] = df['comment_text'].apply(lambda comment: sum(comment.count(w) for w in (':-)', ':)', ';-)', ';)')))

        print("--- Generating num_sentences")
        df['num_sentences'] = df['comment_text'].apply(lambda comment: len(re.split(r'[.!?]+', comment)))

        print("--- Generating max_word_len")
        df['max_word_len'] = df['comment_text'].apply(lambda comment: calc_max_word_len(re.sub(r'[^\w\s]','',comment).split(" ")))
        
        print("--- Generating min_word_len")
        df['max_word_len'] = df['comment_text'].apply(lambda comment: calc_min_word_len(re.sub(r'[^\w\s]','',comment).split(" ")))
        
        print("--- Generating total_word_length (num of chars in words)")
        df['total_word_length'] = df['comment_text'].apply(lambda comment: calc_total_word_len(re.sub(r'[^\w\s]','',comment).split(" ")))
        
        print("--- Generating avg_word_len")
        df['avg_word_len'] = df['total_word_length'] / df['num_words']
        
        print("--- Generating total_unique_word_length (num of chars in words)")
        df['total_unique_word_length'] = df['comment_text'].apply(lambda comment: calc_total_unique_word_len(re.sub(r'[^\w\s]','',comment).split(" ")))
        
        print("--- Generating avg_unique_word_len")
        df['avg_unique_word_len'] = df['total_unique_word_length'] / df['num_unique_words']
        
        print("--- Finished Gen Feats")
        print("--- %s seconds ---" % (time.time() - start_time))
        

    def cleanText(df):
        start_time = time.time()
        df['comment_text'] = df['comment_text'].apply(lambda x: split_off_symbols_iv(x)) #increase score
        """print("--- cleaning text")
        df["comment_text"] = df["comment_text"].apply(lambda x: clean_text(x))

        print("--- remove single characters")
        df["comment_text"] = df["comment_text"].apply(lambda x: remove_singles(x))

        print("--- cleaning numbers")
        df["comment_text"] = df["comment_text"].apply(lambda x: clean_numbers(x))

        print("--- cleaning misspellings")
        df["comment_text"] = df["comment_text"].apply(lambda x: replace_typical_misspell(x))

        print("--- filling missing values")
        #clean chinese, korean, japanese characters
        print('cleaning characters')
        df["comment_text"] = df["comment_text"].map(lambda x: remove_non_english(x))
        
        ## fill up the missing values
        df["comment_text"].fillna("").values"""
        print("--- %s seconds ---" % (time.time() - start_time))

        
    gen_feats(data)
    #data["comment_text"] = data["comment_text"].astype(str).apply(lambda x: pad_chars(x, punct))
    cleanText(data)
    # print("--- Neutralizing bad words")
    # neutrailize_bad_words(train,test)
    
    return data

# Preprocessing

In [9]:
train = pd.read_hdf('../input/train.h5')
test = pd.read_hdf('../input/test.h5')
SMALL_DATA = False
if (SMALL_DATA):
    train = train[:100]
    test = test[:100]

print("Preprocessing train data ...")
x_train = preprocess(train)
print("Preprocessing test data ...")
x_test = preprocess(test)

Preprocessing data ...
--- Generating non_eng
--- Generating first_word
--- Generating total_length (num chars)
--- Generating capitals
--- Generating caps_vs_length
--- Generating num_question_marks
--- Generating num_punctuation
--- Generating num_words
--- Generating num_unique_words
--- Generating unique_word_over_num_words
--- Generating num_sentences
--- Generating max_word_len
--- Generating min_word_len
--- Generating total_word_length (num of chars in words)
--- Generating avg_word_len
--- Generating total_unique_word_length (num of chars in words)
--- Generating avg_unique_word_len
--- Finished Gen Feats
--- 9.99587106704712 seconds ---
--- 4.029026985168457 seconds ---
--- Generating non_eng
--- Generating first_word
--- Generating total_length (num chars)
--- Generating capitals
--- Generating caps_vs_length


KeyboardInterrupt: 