## **_Using News Data to Predict Movements in the Financial Movements_**

We'll be using four apporaches here:

* Continuous Bag of Words Model
* Neural Network Model with Glove Word Embeddings
* RNN Models using Word Embeddings
* Character Level RNN Model

In [0]:
%load_ext autoreload
%autoreload 2

import torch
import torch.utils.data as tud
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import torchtext
from torchtext import data

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

import matplotlib.pyplot as plt

from collections import Counter, defaultdict
import operator
import os, math
import random
import copy
import string
import multiprocessing as mp
import time

from split_data import split_data

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [0]:
# set the random seeds so the experiments can be replicated exactly
random.seed(72689)
np.random.seed(72689)
torch.manual_seed(72689)
if torch.cuda.is_available():
    torch.cuda.manual_seed(72689)

# Global class labels.
POS_LABEL = 'up'
NEG_LABEL = 'down'


**Reading in all the Data**

In [0]:
all_data = pd.read_csv("ProcessedData/CombinedData.csv")
all_data.drop(columns=['Unnamed: 0'], inplace=True)
all_data.head()

Unnamed: 0,Title,Date,Content,OpenMove,CloseMove
0,Top U.S. General Praises Iran-Backed Shiite Mi...,2017-01-04,The top commander of the U.S.-led coalition ag...,1.0,1.0
1,Extremists Turn to a Leader to Protect Western...,2017-01-04,As the founder of the Traditionalist Worker Pa...,1.0,1.0
2,How Julian Assange evolved from pariah to paragon,2017-01-04,President-elect Donald Trump tweeted some pra...,1.0,1.0
3,House panel recommends cutting funding for Pla...,2017-01-04,A House panel formed by Republicans to invest...,1.0,1.0
4,Missouri Bill: Gun-Banning Businesses Liable f...,2017-01-04,As Missouri lawmakers convene for the 2017 leg...,1.0,1.0


*Using a Small Subset of Data fro Development*

In [0]:
data_sample = all_data.sample(10000, random_state=68)
data_sample.reset_index(drop=True, inplace=True)
data_sample.head()

## **Preprocessing the Data For Feeding Into The Model**

Preprocessing Involves (in our case):
* Turning All Words into lower/upper case, Normalization
* removing punctuations, accent marks and other diacritics
* removing stop words, sparse terms, and particular words
* Lemmatize using NLTK (It's generally better than Stemming, but way slower)

In [0]:
# Removing all Punctuation
def remove_punctuation(text):
    more_puncs = '—'+ '’'+ '“'+ '”'+ '…'
    return text.translate(str.maketrans('', '', string.punctuation+more_puncs))

# Removing all Stop Words
def remove_stopwords(text, stop_words):
    text = word_tokenize(text)
    return  " ".join([i for i in text if i not in stop_words])

def lemmetize(text, lemmatizer, pos_tag_dict):
    text = word_tokenize(text)
    pos = nltk.pos_tag(text)
    results = []
    for pair in pos:
        tag = pos_tag_dict.get(pair[1][0],wordnet.NOUN)
        results.append(lemmatizer.lemmatize(pair[0], tag))
        
    return " ".join(results)

In [0]:
sub_data = all_data[['Content', 'CloseMove']]

The pre_process function below performs all the preprocessing we defined above. 

In [0]:
def pre_process(df):
    # Normalization
#     df['Title'] = df['Title'].str.lower()
    df['Content'] = df['Content'].str.lower()

    # Removing Punctuation
#     df['Title'] = df['Title'].apply(remove_punctuation)
    df['Content'] = df['Content'].apply(remove_punctuation)
    
    STOP_WORDS = set(stopwords.words('english'))
    # Remove Stopwords
#     df['Title'] = df['Title'].apply(remove_stopwords, args=(STOP_WORDS, ))
    df['Content'] = df['Content'].apply(remove_stopwords, args=(STOP_WORDS, ))

    # Lemmetization
    lemmer = WordNetLemmatizer()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV
               }
#     df['Title'] = df['Title'].apply(lemmetize, args=(lemmer, tag_dict))
    df['Content'] = df['Content'].apply(lemmetize, args=(lemmer, tag_dict))
    
    return df

**We run the pre_process function in parallel to make it faster using the Multi-Processing Module**

In [0]:
# Processing in Parallel
n_threads = mp.cpu_count()-1
data_pieces = np.array_split(sub_data, n_threads)
startTime = time.time()
pool = mp.Pool(n_threads)
data_sample = pd.concat(pool.map(pre_process, data_pieces))
pool.close()
pool.join()

totalTime = time.time() - startTime
print("Time taken in Pre-Processing: {}m {}s".format(totalTime // 60, totalTime%60))
data_sample.head()

**We drop the rows which exceed default python's csv field max limit**

In [0]:
csv_max_len = 131072
to_drop = []
for i in range(data_sample.shape[0]):
    if len(data_sample.iloc[i,0]) >= csv_max_len-1:
        to_drop.append(i)
        
data_sample.drop(to_drop, inplace=True)

**Splitting the Data and Storing it such that torch text can easily ingest it.**

In [0]:
SEED = 68
split_data(df=data_sample,prefix='prod',seed=SEED)

## **Preparing Data**
* Building the Vocabulary (Using Spacy) | **MAX_VOCAB_SIZE** = 70000
* Splitting the data for Test and Training


In [0]:
TEXT = data.Field(tokenize = 'spacy')

LABEL = data.LabelField(dtype = torch.float)

### Reading in Data Using TorchText

In [0]:
dev_prod_sel = {1:"dev", 0:"prod"}
prefix = dev_prod_sel[0]
train, val, test = data.TabularDataset.splits(
        path='./ProcessedData/', train=prefix+'_train.csv',
        validation=prefix+'_val.csv', test=prefix+'_test.csv', format='csv',
        fields=[('Text', TEXT), ('Label', LABEL)])


In [0]:
MAX_VOCAB_SIZE = 70000

## **Bag Of Words Model**

### Setting up the dataloader

In [0]:
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class BOWDataLoader(tud.Dataset):
    def __init__(self, data, vocab_size, text, field):
        self.vocab_size = vocab_size
        self.TEXT = text
        self.LABEL = field
        self.TEXT.build_vocab(data, max_size = vocab_size)
        self.LABEL.build_vocab(data)
        self.data = data
        
    def __len__(self):
        '''
        Returns the number of Examples
        '''
        return len(self.data.examples)
    
    def __getitem__(self, idx):
        """
        Returns a tuple of text and label at the given index.
        If label is not present None is returned.
        """
        itm = torch.zeros(self.vocab_size+2)
        for word in self.data[idx].Text:
            itm[self.TEXT.vocab.stoi[word]] += 1
        
        # To Differentiate Train and Test data
        if len(self.data.fields) == 2:
            label = self.data[idx].Label
            return itm, label
        else:
            return itm, None

train_dataset = BOWDataLoader(train, MAX_VOCAB_SIZE, TEXT, LABEL)
val_dataset = BOWDataLoader(val, MAX_VOCAB_SIZE, TEXT, LABEL)
test_dataset = BOWDataLoader(test, MAX_VOCAB_SIZE, TEXT, LABEL)

### Bag of Words Model Training Module

Here we define the training and evaluation functions.


In [0]:
class BOWTrainingModule():
    
    def __init__(self, model, batch_size):
        
        self.model = model
        
        # Batch Size
        self.batch_size = batch_size
        
        # Cuda Availability
        self.cuda = torch.cuda.is_available()
                
        # Loss Function
        self.loss_fn = nn.CrossEntropyLoss()
        
        # Optimizer
        self.optimizer = torch.optim.Adam(model.parameters())

    def train_epoch(self, dataset):
        """
        Trains a logistic regression model across all examples in the dataset.
        """
        self.dataloader = tud.DataLoader(dataset, batch_size=self.batch_size, shuffle=True)
        self.model.train()
        for i, (X,y) in enumerate(self.dataloader):
            X = X.float()
            y = torch.Tensor(np.asarray(y, dtype=np.float64)).long()
            if self.cuda:
                X  = X.cuda()
                y = y.cuda()
            
            self.optimizer.zero_grad()
            
            predictions = self.model.forward(X)
            
            loss = self.loss_fn(predictions, y)
            
            loss.backward()
            
            if (i) % 25 == 0:
                print("Iteration : {:4d} | Loss : {:4.4f}".format(i, loss.item()))
            
            self.optimizer.step()
        
    def train_model(self, train_data, val_data, num_epocs = 2):
        """
        Trains the model and saves the best model according to the validation score
        """
        self.model.train()
        accuracy = [0.]
        for epoch in range(num_epocs):
            self.train_epoch(train_data)
            val_accuracy = self.evaluate(val_data)
            print("Validation Accuracy: {:4.4f}".format(val_accuracy))
            if val_accuracy > max(accuracy):
                best_model = copy.deepcopy(self.model)        
            accuracy.append(val_accuracy)
        
        return best_model
                
    def evaluate(self, data):
        self.model.eval()
        dataloader = tud.DataLoader(data, batch_size=self.batch_size, shuffle=False)
        correct = 0
        total = 0
        
        for i, (X,y) in enumerate(dataloader):
            X = X.float()
            if self.cuda:
                X = X.cuda()
            predictions = self.model.forward(X).max(1)[1].cpu().numpy().reshape(-1)
            correct += (predictions == np.asarray(y, dtype=np.float64)).sum()
            total += predictions.shape[0]
        
        return correct/total

### Model

In [0]:
class BOWClassifier(nn.Module):
    
    def __init__(self, input_size, output_size):
        """
        Constructing a Logistic Regression Model
        """
        super(BOWClassifier, self).__init__()
        
        # Linear layer
        self.fc = nn.Linear(input_size, output_size)
    
    def forward(self, text):
        """
        Passes the data through the network and return the output
        """
        result = self.fc(text)
        return (result)

### Initializing the Model

In [0]:
INPUT_DIM = MAX_VOCAB_SIZE + 2
OUTPUT_DIM = 2
BATCH_SIZE = 64
model = BOWClassifier(INPUT_DIM, OUTPUT_DIM)
if torch.cuda.is_available():
    model = model.cuda()


### Training the BOW Model

In [0]:
bow_trainer = BOWTrainingModule(model, BATCH_SIZE)
best_bow_model = bow_trainer.train_model(train_dataset, val_dataset, num_epocs=5)

Iteration :    0 | Loss : 0.6933
Iteration :   25 | Loss : 0.6823
Iteration :   50 | Loss : 0.7543
Iteration :   75 | Loss : 0.7001
Iteration :  100 | Loss : 0.7558
Iteration :  125 | Loss : 0.8163
Iteration :  150 | Loss : 0.7412
Iteration :  175 | Loss : 0.6882
Iteration :  200 | Loss : 0.7221
Iteration :  225 | Loss : 0.7458
Iteration :  250 | Loss : 0.7762
Iteration :  275 | Loss : 0.6335
Iteration :  300 | Loss : 0.6829
Iteration :  325 | Loss : 0.7780
Iteration :  350 | Loss : 0.6827
Iteration :  375 | Loss : 0.6732
Iteration :  400 | Loss : 0.9911
Iteration :  425 | Loss : 0.8563
Iteration :  450 | Loss : 0.7798
Iteration :  475 | Loss : 0.7535
Validation Accuracy: 0.5811
Iteration :    0 | Loss : 0.4825
Iteration :   25 | Loss : 0.5236
Iteration :   50 | Loss : 0.4894
Iteration :   75 | Loss : 0.5556
Iteration :  100 | Loss : 0.5200
Iteration :  125 | Loss : 0.4149
Iteration :  150 | Loss : 0.5863
Iteration :  175 | Loss : 0.6131
Iteration :  200 | Loss : 0.5814
Iteration :  22

### Evaluating on Test Set

In [0]:
bow_accuracy = BOWTrainingModule(best_bow_model, BATCH_SIZE).evaluate(test_dataset)
print("Bag Of Words Model Accuracy : {:4.4f}".format(bow_accuracy))

Bag Of Words Model Accuracy : 0.5826


-------

## **Glove Embeddings**

We use Glove Embeddings throughout the notebook. Below are a few functions that help us load and transform the Glove Encodings as we want.

**Refrence:**

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)

In [0]:
def load_glove(path_file):
    """
    Loads the Glove Pre-Trained Embeddings
    
    Args:
        path_file: Path to the official glove embedding text file
    
    Returns: Dictionary {Word: [Embedding]}
    
    """
    start_time = time.time()
    print("Loading Glove Model ...")
    glove = {}
    with open(path_file) as f:
        for line in f:
            tmp = line.split()
            glove[tmp[0]] = np.asarray(tmp[1:], dtype=np.float64)
    print("Glove Model Loaded in {} s".format(time.time()-start_time))
    return glove

def gloveWordIndex(glove):
    """
    Generates word to index mappings
    0 --> <unk>
    1 --> <pad>
    Args:
        Loaded Glove Model as a dict
        
    Returns:
        word to index map {word:idx} and index to word map{idx:word}
    
    """
    w_i = {k:v+2 for v,k in enumerate(glove.keys())}
    w_i['<unk>'] = 0
    w_i['<pad>'] = 1
    i_w = {v+2:k for v,k in enumerate(glove.keys())}
    i_w[0] = '<unk>'
    i_w[1] = '<pad>'
    return w_i, i_w

def getWeightMatrix(glove):
    embd_dim = glove['a'].shape[0]
    num_embeddings = len(glove.keys())
    w_m = np.zeros((num_embeddings+2, embd_dim))
    w_m[0] = np.random.rand(embd_dim)
    w_m[1] = np.zeros(embd_dim)
    for i, word in enumerate(glove.keys()):
        w_m[i+2] = glove[word]
    
    return w_m

In [0]:
glove = load_glove("Embeddings/glove.6B.100d.txt")
word_to_idx, idx_to_word = gloveWordIndex(glove)

Loading Glove Model ...
Glove Model Loaded in 11.158129692077637 s


In [0]:
weight_matrix = getWeightMatrix(glove)
weight_matrix.shape

(400002, 100)

--------
## **Data Loader**

We set up the data loader to pad the sequqnces and return us sequences of length 1200. If longer then trim them to 1200 words.

We can also use pad-packed-sequence functions from PyTorch

In [0]:
MAX_VOCAB_SIZE = len(glove.keys())

In [0]:
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class NeuralNetDataLoader(tud.Dataset):
    def __init__(self, data, word_to_idx, idx_to_word, vocab_size):
        self.vocab_size = vocab_size
        self.data = data
        self.word_to_idx = word_to_idx
        self.idx_to_Word = idx_to_word
        
    def __len__(self):
        '''
        Returns the number of Examples
        '''
        return len(self.data.examples)
    
    def __getitem__(self, idx):
        """
        Returns a tuple of text and label at the given index.
        If label is not present None is returned.
        """
        MAX_LEN = 1200
        itm = []
        l = 0
        for word in self.data[idx].Text:
            indx = self.word_to_idx.get(word,0)
            itm.append(indx)
            l += 1
            if l == MAX_LEN:
                break
        
        if len(itm) < MAX_LEN:
            itm  = itm + [1 for i in range(MAX_LEN-len(itm))]
        
        itm = torch.tensor(itm).long()
        # To Differentiate Train and Test data
        if len(self.data.fields) == 2:
            label = self.data[idx].Label
            return itm, label
        else:
            return itm, None

train_dataset = NeuralNetDataLoader(train, word_to_idx, idx_to_word, MAX_VOCAB_SIZE)
val_dataset = NeuralNetDataLoader(val, word_to_idx, idx_to_word, MAX_VOCAB_SIZE)
test_dataset = NeuralNetDataLoader(test, word_to_idx, idx_to_word, MAX_VOCAB_SIZE)

### Setting up the Data Iterators

Using the data loader we set-up above.

In [0]:
BATCH_SIZE = 64
train_iter = tud.DataLoader(train_dataset, batch_size= BATCH_SIZE, shuffle=False)
test_iter = tud.DataLoader(test_dataset, batch_size= BATCH_SIZE, shuffle=False)
val_iter = tud.DataLoader(val_dataset, batch_size= BATCH_SIZE, shuffle=False)

------
## **Training Module**
This module contains the evaluate and training functios.

This module will help train us all the future models we make

In [0]:
class TrainingModule():
    
    def __init__(self, model):
        self.model = model
        self.loss_fn = nn.BCEWithLogitsLoss()
        self.cuda = torch.cuda.is_available()
        self.optimizer = optim.Adam(self.model.parameters())
        
    def train_epoch(self, iterator):
        epoch_loss = 0
        epoch_acc = 0
        self.model.train()
        for i, (X,y) in enumerate(iterator):
            self.optimizer.zero_grad()
            X = X.long()
            y = torch.Tensor(np.asarray(y, dtype=np.float64)).float()
            if self.cuda:
                X = X.cuda()
                y = y.cuda()
            preds = self.model.forward(X).squeeze(1)
            
            loss = self.loss_fn(preds, y)
            
            acc = (torch.round(torch.sigmoid(preds))==y).sum().item()/y.shape[0]
            if i % 25 == 0:
                print("Iteration: {:4d} | Loss : {:4.4f} | Accuracy : {:4.4f}".format(i, loss.item(), acc))
                                
            loss.backward()
            
            epoch_loss += loss.item()
            epoch_acc += acc
            
            self.optimizer.step()
        
        return epoch_loss/len(iterator), epoch_acc/len(iterator)
    
        
    def train_model(self, train_iterator, dev_iterator, num_epocs = 5):

        val_acc = [0.]
        for epoch in range(num_epocs):
            ep_loss, ep_accu = self.train_epoch(train_iterator)
            dev_acc = self.evaluate(dev_iterator)
            print("Dev. Loss : {} | Dev. Accuracy : {}".format(dev_acc[0], dev_acc[1]))
            if dev_acc[1] > max(val_acc):
                best_model = copy.deepcopy(self.model)
            val_acc.append(dev_acc[1])

        return best_model
        
    
    def evaluate(self, iterator):
        epoch_loss  = 0
        epoch_acc = 0
        
        model.eval()
        
        with torch.no_grad():
            for i, (X,y) in enumerate(iterator):
                X = X.long()
                y = torch.Tensor(np.asarray(y, dtype=np.float64)).float()
                if self.cuda:
                    X = X.cuda()
                    y = y.cuda()
                preds = self.model.forward(X).squeeze(1)

                loss = self.loss_fn(preds, y)
                
                acc = (torch.round(torch.sigmoid(preds))==y).sum().item()/y.shape[0]

                epoch_loss += loss.item()
                epoch_acc += acc
        
        return epoch_loss/len(iterator), epoch_acc/len(iterator)      

## **Neural Network based Model with Word Embeddings**

We use a Neural Network now with Word Embeddings, whoose :
* Input : A sentence
* Output: Label : {UP, DOWN}

The basic structure of a model class is as above. Functions like classify, evaluate and train will be defined along with pretrained word-embeddings.

In [0]:
class NeuralNetClassifier(nn.Module):
    
    def __init__(self, input_dim, output_dim, pad_index, embedding_weights):
        
        super().__init__()
        embd_dim = embedding_weights.shape[1]
        self.embedding = nn.Embedding.from_pretrained(torch.tensor(embedding_weights), freeze=False, padding_idx = pad_index)
        
        hid_dim1 = 64
        hid_dim2 = 32
        
        self.drop_out = nn.Dropout()
        
        self.hd1 = nn.Linear(embd_dim, hid_dim1)
        self.hd2 = nn.Linear(hid_dim1, hid_dim2)
        self.out = nn.Linear(hid_dim2, output_dim)
        
        self.activate = nn.ReLU()
    
    def forward(self,text):
        
#         print("Text: ", text.shape)
        embds = self.embedding(text)
#         print("Embds: ", embds.shape)
        mean_embd = torch.mean(embds, 1)
#         print("Embedding:", mean_embd.shape)
        output = self.activate(self.hd1(mean_embd.float()))
#         print("Layer 1: ",output.shape)
        output = self.drop_out(output)
        output = self.activate(self.hd2(output))
        output = self.drop_out(output)
        output = self.out(output)
        return output
        

### Initializing the Neural Net Model 
with the appropriate dimensions

In [0]:
INPUT_DIM = weight_matrix.shape[0]
OUTPUT_DIM = 1
PAD_IDX = 1
model = NeuralNetClassifier(INPUT_DIM, OUTPUT_DIM, PAD_IDX, weight_matrix)
if torch.cuda.is_available():
    model = model.cuda()

### Training the Neural Net Model

In [0]:
neural_trainer = TrainingModule(model)
best_nn_model = neural_trainer.train_model(train_iter, val_iter)

Iteration:    0 | Loss : 0.6915 | Accuracy : 0.5000
Iteration:   25 | Loss : 0.6579 | Accuracy : 0.7188
Iteration:   50 | Loss : 0.6676 | Accuracy : 0.6562
Iteration:   75 | Loss : 0.7013 | Accuracy : 0.5156
Iteration:  100 | Loss : 0.6647 | Accuracy : 0.6562
Iteration:  125 | Loss : 0.6749 | Accuracy : 0.6250
Iteration:  150 | Loss : 0.6432 | Accuracy : 0.7031
Iteration:  175 | Loss : 0.6840 | Accuracy : 0.5938
Iteration:  200 | Loss : 0.6920 | Accuracy : 0.5469
Iteration:  225 | Loss : 0.6488 | Accuracy : 0.6875
Iteration:  250 | Loss : 0.7026 | Accuracy : 0.5000
Iteration:  275 | Loss : 0.6555 | Accuracy : 0.6719
Iteration:  300 | Loss : 0.6742 | Accuracy : 0.5938
Iteration:  325 | Loss : 0.7014 | Accuracy : 0.5469
Iteration:  350 | Loss : 0.6424 | Accuracy : 0.7031
Iteration:  375 | Loss : 0.6509 | Accuracy : 0.6562
Iteration:  400 | Loss : 0.6937 | Accuracy : 0.5312
Iteration:  425 | Loss : 0.6732 | Accuracy : 0.6094
Iteration:  450 | Loss : 0.6645 | Accuracy : 0.6094
Iteration:  

### Evaluating on Test Set

In [0]:
neural_accuracy = TrainingModule(best_nn_model).evaluate(test_iter)
print("Neural Network Model Accuracy : {:4.4f}".format(neural_accuracy[1]))

Neural Network Model Accuracy : 0.6086


## **Recurrent Neural Network (GRU) with Glove Embeddings**

We use GRU as a RNN model. 

In [0]:
class WordRNNClassifier(nn.Module):
    
    def __init__(self, input_dim, output_dim, hidden_dim, pad_index, embedding_weights, drop_out = 0):
        
        super().__init__()
        embd_dim = embedding_weights.shape[1]
        self.nhid = hidden_dim
        self.embedding = nn.Embedding.from_pretrained(torch.tensor(embedding_weights), freeze=False, padding_idx = pad_index)   
        self.rnn = nn.GRU(embd_dim, hidden_dim, dropout=drop_out)
        self.output = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout()

    def forward(self, text):
        
        embds = self.embedding(text).float()
        embds = embds.permute(1,0,2)
        hidden = torch.zeros((1,embds.size(1), self.nhid))
        if torch.cuda.is_available():
            hidden = hidden.cuda() 
        out, hid = self.rnn(embds, hidden)    
        hid = self.dropout(hid)
        out = self.output(hid.squeeze(0))
        
        return out
    
    

### Initializing the RNN (GRU) Model

with correct parameters.

In [0]:
INPUT_DIM = weight_matrix.shape[0]
OUTPUT_DIM = 1
PAD_IDX = 1
HIDDEN_DIM = 64
model = WordRNNClassifier(INPUT_DIM, OUTPUT_DIM, HIDDEN_DIM, PAD_IDX, weight_matrix)
if torch.cuda.is_available():
    model = model.cuda()

### Training the Model

In [0]:
RNN_Trainer = TrainingModule(model)
best_rnn_model = RNN_Trainer.train_model(train_iter, val_iter)

Iteration:    0 | Loss : 0.7071 | Accuracy : 0.4062
Iteration:   25 | Loss : 0.6321 | Accuracy : 0.7188
Iteration:   50 | Loss : 0.6533 | Accuracy : 0.6562
Iteration:   75 | Loss : 0.6945 | Accuracy : 0.5156
Iteration:  100 | Loss : 0.6567 | Accuracy : 0.6406
Iteration:  125 | Loss : 0.6712 | Accuracy : 0.6094
Iteration:  150 | Loss : 0.6371 | Accuracy : 0.7031
Iteration:  175 | Loss : 0.6807 | Accuracy : 0.5938
Iteration:  200 | Loss : 0.6889 | Accuracy : 0.5469
Iteration:  225 | Loss : 0.6445 | Accuracy : 0.6875
Iteration:  250 | Loss : 0.7060 | Accuracy : 0.5000
Iteration:  275 | Loss : 0.6463 | Accuracy : 0.6719
Iteration:  300 | Loss : 0.6830 | Accuracy : 0.5938
Iteration:  325 | Loss : 0.6964 | Accuracy : 0.5469
Iteration:  350 | Loss : 0.6373 | Accuracy : 0.7031
Iteration:  375 | Loss : 0.6511 | Accuracy : 0.6562
Iteration:  400 | Loss : 0.7018 | Accuracy : 0.5312
Iteration:  425 | Loss : 0.6720 | Accuracy : 0.6094
Iteration:  450 | Loss : 0.6647 | Accuracy : 0.6094
Iteration:  

### Evaluating on Test Set

In [0]:
rnn_accuracy = TrainingModule(best_rnn_model).evaluate(test_iter)
print("RNN Model Accuracy : {:4.4f}".format(rnn_accuracy[1]))

  self.dropout, self.training, self.bidirectional, self.batch_first)


RNN Model Accuracy : 0.6082


## **Chacracter Level RNN Model** _With Letter Embeddings_

### Data Loader

Here we need to do something different. We don't want to Normalize the data, remove punctuation or any lemmetization. We want the model to learn how all the differene characters work together and relate to each other. So we will manually create our own mappings from index to letters and use them in the data loader.

Also, we won'e be using TorchText here, we'll just be using pandas.

In [0]:
# train_data, val_data, test_data = split_data(df = data_sample[['Content', 'CloseMove']],prefix='char_dev', seed = 68, ret=1)
train_data = pd.read_csv("ProcessedData/char_prod_train.csv")
test_data = pd.read_csv("ProcessedData/char_prod_test.csv")
val_data = pd.read_csv("ProcessedData/char_prod_val.csv")

In [0]:
# Character to Index Mapping
char_to_idx = {v:i+2 for i,v in enumerate(string.printable)}
char_to_idx['<unk>'] = 0
char_to_idx['<pad>'] = 1

# Index to Character Mapping
idx_to_char = {char_to_idx[i]:i for i in char_to_idx}

In [0]:
MAX_CHAR_VOCAB = len(idx_to_char)

BATCH_SIZE = 64

class NeuralNetDataLoader(tud.Dataset):
    def __init__(self, data, char_to_idx, idx_to_char, vocab_size):
        self.vocab_size = vocab_size
        self.data = data
        self.char_to_idx = char_to_idx
        self.idx_to_char = idx_to_char
        
    def __len__(self):
        '''
        Returns the number of Examples
        '''
        return self.data.shape[0]
    
    def __getitem__(self, idx):
        """
        Returns a tuple of text and label at the given index.
        If label is not present None is returned.
        """
        MAX_LEN = 5600
        itm = []
        l = 0
        for char in self.data.iloc[idx,0].strip():
            indx = self.char_to_idx.get(char,0)
            itm.append(indx)
            l += 1
            if l == MAX_LEN:
                break
        
        if len(itm) < MAX_LEN:
            itm  = itm + [1 for i in range(MAX_LEN-len(itm))]
        
        itm = torch.tensor(itm).long()
        # To Differentiate Train and Test data
        if self.data.shape[1] == 2:
            label = self.data.iloc[idx,1]
            return itm, label
        else:
            return itm, None

train_dataset = NeuralNetDataLoader(train_data, char_to_idx, idx_to_char, MAX_CHAR_VOCAB)
val_dataset = NeuralNetDataLoader(val_data, char_to_idx, idx_to_char, MAX_CHAR_VOCAB)
test_dataset = NeuralNetDataLoader(test_data, char_to_idx, idx_to_char, MAX_CHAR_VOCAB)

In [0]:
BATCH_SIZE = 64
train_iter = tud.DataLoader(train_dataset, batch_size= BATCH_SIZE, shuffle=False)
test_iter = tud.DataLoader(test_dataset, batch_size= BATCH_SIZE, shuffle=False)
val_iter = tud.DataLoader(val_dataset, batch_size= BATCH_SIZE, shuffle=False)

### Model

In [0]:
class CharRNNClassifier(nn.Module):
    
    def __init__(self, input_dim, output_dim, hidden_dim, embd_dim, pad_index):
        
        super().__init__()
        self.nhid = hidden_dim
        self.embedding = nn.Embedding(num_embeddings = input_dim, embedding_dim=embd_dim, padding_idx=pad_index)   
        self.rnn = nn.GRU(embd_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout()
        
    def forward(self, text):
        
        embds = self.embedding(text).float()
        embds = embds.permute(1,0,2)
        hidden = torch.zeros((1,embds.size(1), self.nhid))
        if torch.cuda.is_available():
            hidden = hidden.cuda() 
        out, hid = self.rnn(embds, hidden)       
        hid = self.dropout(hid)
        out = self.output(hid.squeeze(0))
        
        return out
    
    

### Initializing the Model

In [0]:
INPUT_DIM = len(char_to_idx)
OUTPUT_DIM = 1
PAD_IDX = 1
HIDDEN_DIM = 64
EMBD_DIM = 128
model = CharRNNClassifier(INPUT_DIM, OUTPUT_DIM, HIDDEN_DIM, EMBD_DIM, PAD_IDX)
if torch.cuda.is_available():
    model = model.cuda()

### Training the Model

In [0]:
CharRNNTrainer = TrainingModule(model)
best_charnn_model = CharRNNTrainer.train_model(train_iter, val_iter)

Iteration:    0 | Loss : 0.7169 | Accuracy : 0.3906
Iteration:   25 | Loss : 0.6416 | Accuracy : 0.7188
Iteration:   50 | Loss : 0.6762 | Accuracy : 0.6250
Iteration:   75 | Loss : 0.6954 | Accuracy : 0.5781
Iteration:  100 | Loss : 0.6771 | Accuracy : 0.5781
Iteration:  125 | Loss : 0.6970 | Accuracy : 0.5312
Iteration:  150 | Loss : 0.7089 | Accuracy : 0.5156
Iteration:  175 | Loss : 0.6956 | Accuracy : 0.5781
Iteration:  200 | Loss : 0.6826 | Accuracy : 0.5781
Iteration:  225 | Loss : 0.5975 | Accuracy : 0.7812
Iteration:  250 | Loss : 0.6398 | Accuracy : 0.7031
Iteration:  275 | Loss : 0.6376 | Accuracy : 0.6719
Iteration:  300 | Loss : 0.6729 | Accuracy : 0.5938
Iteration:  325 | Loss : 0.6761 | Accuracy : 0.5625
Iteration:  350 | Loss : 0.6463 | Accuracy : 0.6719
Iteration:  375 | Loss : 0.5997 | Accuracy : 0.7969
Iteration:  400 | Loss : 0.6903 | Accuracy : 0.5781
Iteration:  425 | Loss : 0.6503 | Accuracy : 0.6719
Iteration:  450 | Loss : 0.6456 | Accuracy : 0.6719
Iteration:  

### Evaluating the Model

In [0]:
char_rnn_accuracy = TrainingModule(best_charnn_model).evaluate(test_iter)
print("Character Level Character Accuracy: {:4.4f}".format(char_rnn_accuracy[1]))

  self.dropout, self.training, self.bidirectional, self.batch_first)


Character Level Character Accuracy: 0.6029
