We will start by downloading 20-newsgroup text dataset:

```http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset```

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroup_train = fetch_20newsgroups(subset='train')
newsgroup_test = fetch_20newsgroups(subset='test') # we will use it later

In [2]:
# Split train data into actual train and validation sets

train_split = 10000
train_data = newsgroup_train.data[:train_split]
train_targets = newsgroup_train.target[:train_split]

val_data = newsgroup_train.data[train_split:]
val_targets = newsgroup_train.target[train_split:]

test_data = newsgroup_test.data
test_targets = newsgroup_test.target

print ("Train dataset size is {}".format(len(train_data)))
print ("Val dataset size is {}".format(len(val_data)))
print ("Test dataset size is {}".format(len(test_data)))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532


In [7]:
val_targets

array([7, 6, 2, ..., 3, 1, 8])

In [3]:
# Random sample from train dataset
import random
print (train_data[random.randint(0, len(train_data) - 1)])

From: klf@druwa.ATT.COM (FranklinKL)
Subject: Re: Dumbest automotive concepts of all time
Summary: Continental the first - Don't think so!
Distribution: na
Lines: 30

In article <C5HHGM.1rM@constellation.ecn.uoknor.edu>, callison@uokmax.ecn.uoknor.edu (James P. Callison) writes:
| In article <1993Apr13.215605.26252@slcs.slb.com> dcd@se.houston.geoquest.slb.com (Dan Day) writes:
| >In article <1q4466INNb85@ctron-news.ctron.com> smith@ctron.com writes:
| >>
| >>It's a big aftermarket business.  Almost no cars come from the factory with
| >>vynal any more, and any fake "convertible" job _definitely_ came from some
| >>aftermarket place.  What amazes me is how much people are willing to pay for
| >>bad taste
| >
| >How about those really ugly fake wheel compartments stuck onto the
| >trunk or side (or both sides!) of some tacky luxury cars?
| 
| Some of 'em aren't fake (if you're talking about the Continental kit,
| named after the Lincoln Continental, the first car to sport one). I
| pers

Before training the classifier, first we are going to tokenize the dataset using spacy.io

Run (shown in the cell below):

* ```pip install spacy```
* ```python -m spacy download en_core_web_sm```

In [4]:
!pip install spacy
!python -m spacy download en_core_web_sm


[93m    Linking successful[0m
    /Users/derin/miniconda3/lib/python3.6/site-packages/en_core_web_sm -->
    /Users/derin/miniconda3/lib/python3.6/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [6]:
# Let's write the tokenization function 

import spacy
import string

# Load English tokenizer, tagger, parser, NER and word vectors
tokenizer = spacy.load('en_core_web_sm')
punctuations = string.punctuation

# lowercase and remove punctuation
def tokenize(sent):
    tokens = tokenizer(sent)
    return [token.text.lower() for token in tokens if (token.text not in punctuations)]

# Example
tokens = tokenize(u'Apple is looking at buying U.K. startup for $1 billion')
print (tokens)

['apple', 'is', 'looking', 'at', 'buying', 'u.k.', 'startup', 'for', '1', 'billion']


In [13]:
# This is the code cell that tokenizes train/val/test datasets
# However it takes about 15-20 minutes to run it
# For convinience we have provided the preprocessed datasets
# Please see the next code cell
import pickle as pkl

def tokenize_dataset(dataset):
    token_dataset = []
    # we are keeping track of all tokens in dataset 
    # in order to create vocabulary later
    all_tokens = []
    
    for sample in dataset:
        tokens = tokenize(sample)
        token_dataset.append(tokens)
        all_tokens += tokens

    return token_dataset, all_tokens

# val set tokens
print ("Tokenizing val data")
val_data_tokens, _ = tokenize_dataset(val_data)
pkl.dump(val_data_tokens, open("val_data_tokens.p", "wb"))

# test set tokens
print ("Tokenizing test data")
test_data_tokens, _ = tokenize_dataset(test_data)
pkl.dump(test_data_tokens, open("test_data_tokens.p", "wb"))

# train set tokens
print ("Tokenizing train data")
train_data_tokens, all_train_tokens = tokenize_dataset(train_data)
pkl.dump(train_data_tokens, open("train_data_tokens.p", "wb"))
pkl.dump(all_train_tokens, open("all_train_tokens.p", "wb"))

Tokenizing val data
Tokenizing test data
Tokenizing train data


In [14]:
# Alternatively try running the following multi-threaded version of tokenization
# Credit to Ilya Kulikov

def lower_case_remove_punc(parsed):
    return [token.text.lower() for token in parsed if (token.text not in punctuations)]

def tokenize_dataset(dataset):
    token_dataset = []
    # we are keeping track of all tokens in dataset
    # in order to create vocabulary later
    all_tokens = []

    for sample in tqdm_notebook(tokenizer.pipe(dataset, disable=['parser', 'tagger', 'ner'], batch_size=512, n_threads=1)):
        tokens = lower_case_remove_punc(sample)
        token_dataset.append(tokens)
        all_tokens += tokens

    return token_dataset, all_tokens

# val set tokens
#print ("Tokenizing val data")
#val_data_tokens, _ = tokenize_dataset(val_data)
#pkl.dump(val_data_tokens, open("val_data_tokens.p", "wb"))

# test set tokens
#print ("Tokenizing test data")
#test_data_tokens, _ = tokenize_dataset(test_data)
#pkl.dump(test_data_tokens, open("test_data_tokens.p", "wb"))

# train set tokens
#print ("Tokenizing train data")
#train_data_tokens, all_train_tokens = tokenize_dataset(train_data)
#pkl.dump(train_data_tokens, open("train_data_tokens.p", "wb"))
#pkl.dump(all_train_tokens, open("all_train_tokens.p", "wb"))

In [15]:
# First, download datasets from here
# Use your NYU account
#https://drive.google.com/open?id=1eR2LFI5MGliHlaL1S2nsX4ouIO1k_ip2
#https://drive.google.com/open?id=133QCWbiz_Xc7Qm4r6t-fJP1K669xjNlM
#https://drive.google.com/open?id=1SuUIUpJ1iznU707ktkpnEGSwt_XIqOYp
#https://drive.google.com/open?id=1UQsrZ2LVfcxdxxa47344fMs_qvya72KR
import pickle as pkl
# Then, load preprocessed train, val and test datasets
train_data_tokens = pkl.load(open("train_data_tokens.p", "rb"))
all_train_tokens = pkl.load(open("all_train_tokens.p", "rb"))

val_data_tokens = pkl.load(open("val_data_tokens.p", "rb"))
test_data_tokens = pkl.load(open("test_data_tokens.p", "rb"))

# double checking
print ("Train dataset size is {}".format(len(train_data_tokens)))
print ("Val dataset size is {}".format(len(val_data_tokens)))
print ("Test dataset size is {}".format(len(test_data_tokens)))

print ("Total number of tokens in train dataset is {}".format(len(all_train_tokens)))

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532
Total number of tokens in train dataset is 3433739


Now, we are going to create the vocabulary of most common 10,000 tokens in the training set.

In [16]:
from collections import Counter

max_vocab_size = 10000
# save index 0 for unk and 1 for pad
PAD_IDX = 0
UNK_IDX = 1

def build_vocab(all_tokens):
    # Returns:
    # id2token: list of tokens, where id2token[i] returns token that corresponds to token i
    # token2id: dictionary where keys represent tokens and corresponding values represent indices
    token_counter = Counter(all_tokens)
    vocab, count = zip(*token_counter.most_common(max_vocab_size))
    id2token = list(vocab)
    token2id = dict(zip(vocab, range(2,2+len(vocab)))) 
    id2token = ['<pad>', '<unk>'] + id2token
    token2id['<pad>'] = PAD_IDX 
    token2id['<unk>'] = UNK_IDX
    return token2id, id2token

token2id, id2token = build_vocab(all_train_tokens)

In [17]:
# Lets check the dictionary by loading random token from it

random_token_id = random.randint(0, len(id2token)-1)
random_token = id2token[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id[random_token]))

Token id 87 ; token other
Token other; token id 87


In [18]:
# convert token to id in the dataset
def token2index_dataset(tokens_data):
    indices_data = []
    for tokens in tokens_data:
        index_list = [token2id[token] if token in token2id else UNK_IDX for token in tokens]
        indices_data.append(index_list)
    return indices_data

train_data_indices = token2index_dataset(train_data_tokens)
val_data_indices = token2index_dataset(val_data_tokens)
test_data_indices = token2index_dataset(test_data_tokens)

# double checking
print ("Train dataset size is {}".format(len(train_data_indices)))
print ("Val dataset size is {}".format(len(val_data_indices)))
print ("Test dataset size is {}".format(len(test_data_indices)))

######### !!!! Look at this

Train dataset size is 10000
Val dataset size is 1314
Test dataset size is 7532


Now we are going to create PyTorch DataLoader 

In [19]:
MAX_SENTENCE_LENGTH = 200

import numpy as np
import torch
from torch.utils.data import Dataset

class NewsGroupDataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    
    def __init__(self, data_list, target_list):
        """
        @param data_list: list of newsgroup tokens 
        @param target_list: list of newsgroup targets 

        """
        self.data_list = data_list
        self.target_list = target_list
        assert (len(self.data_list) == len(self.target_list))

    def __len__(self):
        return len(self.data_list)
        
    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        
        token_idx = self.data_list[key][:MAX_SENTENCE_LENGTH]
        label = self.target_list[key]
        return [token_idx, len(token_idx), label]

def newsgroup_collate_func(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all 
    data have the same length
    """
    data_list = []
    label_list = []
    length_list = []
    #print("collate batch: ", batch[0][0])
    #batch[0][0] = batch[0][0][:MAX_SENTENCE_LENGTH]
    for datum in batch:
        label_list.append(datum[2])
        length_list.append(datum[1])
    # padding
    for datum in batch:
        padded_vec = np.pad(np.array(datum[0]), 
                                pad_width=((0,MAX_SENTENCE_LENGTH-datum[1])), 
                                mode="constant", constant_values=0)
        data_list.append(padded_vec)
    return [torch.from_numpy(np.array(data_list)), torch.LongTensor(length_list), torch.LongTensor(label_list)]

# create pytorch dataloader
#train_loader = NewsGroupDataset(train_data_indices, train_targets)
#val_loader = NewsGroupDataset(val_data_indices, val_targets)
#test_loader = NewsGroupDataset(test_data_indices, test_targets)

BATCH_SIZE = 32
train_dataset = NewsGroupDataset(train_data_indices, train_targets)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=True)

val_dataset = NewsGroupDataset(val_data_indices, val_targets)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=True)

test_dataset = NewsGroupDataset(test_data_indices, test_targets)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=newsgroup_collate_func,
                                           shuffle=False)

#for i, (data, lengths, labels) in enumerate(train_loader):
#    print (data)
#    print (labels)
#    break

Here we will define Bag-of-Words model in PyTorch

In [20]:
# First import torch related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

class BagOfWords(nn.Module):
    """
    BagOfWords classification model
    """
    def __init__(self, vocab_size, emb_dim):
        """
        @param vocab_size: size of the vocabulary. 
        @param emb_dim: size of the word embedding
        """
        super(BagOfWords, self).__init__()
        # pay attention to padding_idx 
        self.embed = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.linear = nn.Linear(emb_dim,20)
    
    def forward(self, data, length):
        """
        
        @param data: matrix of size (batch_size, max_sentence_length). Each row in data represents a 
            review that is represented using n-gram index. Note that they are padded to have same length.
        @param length: an int tensor of size (batch_size), which represents the non-trivial (excludes padding)
            length of each sentences in the data.
        """
        out = self.embed(data)
        out = torch.sum(out, dim=1)
        out /= length.view(length.size()[0],1).expand_as(out).float()
     
        # return logits
        out = self.linear(out.float())
        return out

emb_dim = 100
model = BagOfWords(len(id2token), emb_dim)

In [22]:
learning_rate = 0.01
num_epochs = 10 # number epoch to train

# Criterion and Optimizer
criterion = torch.nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Function for testing the model
def test_model(loader, model):
    """
    Help function that tests the model's performance on a dataset
    @param: loader - data loader for the dataset to test against
    """
    correct = 0
    total = 0
    model.eval()
    for data, lengths, labels in loader:
        data_batch, length_batch, label_batch = data, lengths, labels
        outputs = F.softmax(model(data_batch, length_batch), dim=1)
        predicted = outputs.max(1, keepdim=True)[1]
        
        total += labels.size(0)
        correct += predicted.eq(labels.view_as(predicted)).sum().item()
    return (100 * correct / total)

for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        data_batch, length_batch, label_batch = data, lengths, labels
        optimizer.zero_grad()
        outputs = model(data_batch, length_batch)
        loss = criterion(outputs, label_batch)
        loss.backward()
        optimizer.step()
        # validate every 100 iterations
        if i > 0 and i % 100 == 0:
            # validate
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format( 
                       epoch+1, num_epochs, i+1, len(train_loader), val_acc))


Epoch: [1/10], Step: [101/313], Validation Acc: 47.03196347031963
Epoch: [1/10], Step: [201/313], Validation Acc: 71.91780821917808
Epoch: [1/10], Step: [301/313], Validation Acc: 80.441400304414
Epoch: [2/10], Step: [101/313], Validation Acc: 83.10502283105023
Epoch: [2/10], Step: [201/313], Validation Acc: 84.93150684931507
Epoch: [2/10], Step: [301/313], Validation Acc: 86.91019786910198
Epoch: [3/10], Step: [101/313], Validation Acc: 87.89954337899543
Epoch: [3/10], Step: [201/313], Validation Acc: 88.43226788432268
Epoch: [3/10], Step: [301/313], Validation Acc: 88.35616438356165
Epoch: [4/10], Step: [101/313], Validation Acc: 88.96499238964992
Epoch: [4/10], Step: [201/313], Validation Acc: 88.88888888888889
Epoch: [4/10], Step: [301/313], Validation Acc: 88.35616438356165
Epoch: [5/10], Step: [101/313], Validation Acc: 89.19330289193303
Epoch: [5/10], Step: [201/313], Validation Acc: 88.50837138508372
Epoch: [5/10], Step: [301/313], Validation Acc: 89.19330289193303
Epoch: [6/10

In [23]:
print ("After training for {} epochs".format(num_epochs))
print ("Val Acc {}".format(test_model(val_loader, model)))
print ("Test Acc {}".format(test_model(test_loader, model)))

After training for 10 epochs
Val Acc 88.66057838660578
Test Acc 79.68667020711631


## Exercise 1
### Try training the model with larger embedding size and for larger number of epochs
### Also plot the training curves of the model

## Exercise 2:
### Try downloading IMDB Large Movie Review Dataset that is used for Assignment 1 http://ai.stanford.edu/~amaas/data/sentiment/
### and tokenize it

## Exercise 3:
### If you have time, after tokenizing the dataset try training Bag-of-Words model on it and report your initial results
### on validation set.