Reimplementation of LSTM: GloVe + dropout in PyTorch

In [1]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import pathlib

import torch
import torchtext

from fastai.text import *

In [2]:
path = pathlib.Path('../../data/')
comp = pathlib.Path('competitions/jigsaw-toxic-comment-classification-challenge')
EMBEDDING_FILE  = pathlib.Path('glove/glove.6B.50d.txt')
TRAIN_DATA_FILE = pathlib.Path('train.csv')
TEST_DATA_FILE  = pathlib.Path('test.csv')

In [3]:
embed_size = 50
max_features = 20000
maxlen = 100

#### Read in data & replace missing values

In [4]:
train = pd.read_csv(path/comp/TRAIN_DATA_FILE)
test  = pd.read_csv(path/comp/TEST_DATA_FILE)

list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = [col for col in train.columns[2:]]
y = train[list_classes].values
list_sentences_test  = test["comment_text"].fillna("_na_").values

In [5]:
train.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0


If you are passing a field that's already numericalized and not sequential: set `sequential=False`

In [6]:
TEXT  = torchtext.data.Field(lower=True, tokenize="spacy")
LABEL = torchtext.data.Field(sequential=False, use_vocab=False)

In [7]:
trn_datafields = [("id", None), ("comment_text", TEXT)]
trn_datafields.append([(col, LABEL) for col in train.columns[2:]])

tst_datafields = [("id", None), ("comment_text", TEXT)]

trn = torchtext.data.TabularDataset.splits(path=path/comp, train=TRAIN_DATA_FILE,
                                           format='csv', skip_header=True,
                                           fields=trn_datafields)
tst = torchtext.data.TabularDataset.aplits(path=path/comp/TEST_DATA_FILE,
                                           format='csv', skip_header=True,
                                           fields=tst_datafields)

ValueError: too many values to unpack (expected 2)

In [41]:
trn = torchtext.data.TabularDataset.splits(path='../../data/competitions/jigsaw-toxic-comment-classification-challenge/',
                                           train='train.csv',
                                           format='csv', skip_header=True,
                                           fields=trn_datafields)

ValueError: too many values to unpack (expected 2)

In [42]:
trn_datafields

[('id', None),
 ('comment_text', <torchtext.data.field.Field at 0x7f6b33af56a0>),
 [('toxic', <torchtext.data.field.Field at 0x7f6b31114390>),
  ('severe_toxic', <torchtext.data.field.Field at 0x7f6b31114390>),
  ('obscene', <torchtext.data.field.Field at 0x7f6b31114390>),
  ('threat', <torchtext.data.field.Field at 0x7f6b31114390>),
  ('insult', <torchtext.data.field.Field at 0x7f6b31114390>),
  ('identity_hate', <torchtext.data.field.Field at 0x7f6b31114390>)]]

In [None]:
for (name, field), val in zip(field)

Looking at how to use Fastai's tokenizer (Spacy):

In [5]:
temp_sent = list_sentences_train[0]
temp_sent

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

In [6]:
tokenizer = Tokenizer()
tokenizer.spacy_tok(temp_sent)[:10]

['Explanation',
 '\n',
 'Why',
 'the',
 'edits',
 'made',
 'under',
 'my',
 'username',
 'Hardcore']

Tokenizer works on a single line. Looking at how to quickly do this for the entire corpus without using a slow Python for-loop.

In [7]:
tokenizer = Tokenizer()

In [9]:
np.apply_along_axis(tokenizer.spacy_tok, 0, list_sentences_train)

<fastai.text.Tokenizer at 0x7ff150a6e748>

In [13]:
temp = ['a b', 'cb j']
tokenizer.spacy_tok(temp[1])

['cb', 'j']

In [57]:
spacy_tok = spacy.load('en')

In [8]:
import torchtext.data
import torchtext.vocab

In [9]:
TEXT = torchtext.data.Field(lower=True, tokenize='spacy')

fighting with `np.apply_along_axis` to tokenize corpus in vectorized fashion.

In [55]:
np.apply_along_axis(tokenizer.spacy_tok, 0, list_sentences_train)

AxisError: axis 0 is out of bounds for array of dimension 0

In [44]:
temp = np.array([i for i in range(10)])
np.apply_along_axis(lambda x: x+2, 0, temp)

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [45]:
type(list_sentences_train)

numpy.ndarray

In [47]:
np.apply_along_axis(lambda x: x, 0, list_sentences_train)

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
       "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
       ..., 'Spitzer \n\nUmm, theres no actual article for prostitution ring.  - Crunch Captain.',
       'And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.',
       '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes 

In [48]:
temp.shape

(10,)

In [49]:
list_sentences_train.shape

(159571,)

In [53]:
tokenizer.spacy_tok(['a b'])

TypeError: expected string or bytes-like object