# PyTorch LSTM: GloVe + dropout

This is a reimplementation of J.Howard's [Improved LSTM baseline: GloVe + dropout](https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout) Kaggle kernel in PyTorch. The original kernel manages a private score of **0.09783** tied with 2747/4551 place.

-- Wayne Nixalo

---

Imports

In [1]:
import pathlib
import os
import torchtext
from torchtext.data import Field
# import spacy
import pandas as pd
import numpy as np
from torchtext.data import TabularDataset

Paths

In [2]:
data_path = pathlib.Path('../../data')
comp_path = pathlib.Path(data_path/'competitions/jigsaw-toxic-comment-classification-challenge')
EMBEDDING_FILE = 'glove/glove.6B.50d.txt'
TRAIN_DATA_FILE= 'train.csv'
TEST_DATA_FILE = 'test.csv'

Config parameters

In [3]:
embed_sz = 50    # embedding vector columns (factors)
max_feat = 20000 # embedding vector rows    (words)
maxlen   = 100   # words in comment to use

Data Loading

In [4]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

# train = pd.read_csv(comp_path/TRAIN_DATA_FILE)
# test  = pd.read_csv(comp/TEST_DATA_FILE)

In [13]:
# Torchtext Field Obects (preprocessing/numericalization)
TEXT = Field(sequential=True, tokenize="spacy", lower=True)
# SEE: Aside 1
LABEL = Field(sequential=False, use_vocab=False)

# Dataset Construction (disk-data reading)
trn_datafields = [("id",None), ("comment_text",TEXT)]
trn_datafields.append([(clss,LABEL) for clss in list_classes])

tst_datafields = [("id",None), ("comment_text",TEXT)]

trn_dataset = TabularDataset(path=comp_path, train=TRAIN_DATA_FILE,
                             format='csv', skip_header=True,
                             fields = trn_datafields)
tst_dataset = TabularDataset(path=comp_path, test)

---

## Misc / Asides / Notes

### Aside 1

Labels are already binary encoded, so no need to numericalize. Therefore `use_vocab=False`.

In [35]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

train = pd.read_csv(comp_path/TRAIN_DATA_FILE)
# test  = pd.read_csv(comp/TEST_DATA_FILE)

In [40]:
train[list_classes][55:65]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
55,1,1,1,0,1,0
56,1,0,1,0,1,0
57,0,0,0,0,0,0
58,1,0,1,0,0,0
59,1,0,0,0,0,0
60,0,0,0,0,0,0
61,0,0,0,0,0,0
62,0,0,0,0,0,0
63,0,0,0,0,0,0
64,0,0,0,0,0,0
