## Part 1:  Text Classification

## Part 1.a :- Neural Networks Model

## 1. Introduction

Neural Networks are made up of layers of neurons. These neurons are the core processing units of network. Neural Networks take input data, train themselves 
to identify the patterns in the input data and then predict the output. 

During this process, it consists of three layers namely, 

1. Input layer
2. Hidden layer
3. Output layer

If the neural networks consists of more than one hidden layer, then it is called "Recurrent  Neural Networks". 

Recurrent Neural Networks, play a very important role in word embeddings i.e, it takes the words from a document in a sequence and capture the context of a word in a document, identify the similiarity between semantic and syntactic words, relation with other words likewise.

We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$. 

$$h_t = \text{RNN}(x_t, h_{t-1})$$

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

In this task, we use two csv files namely, axcs_train, axcs_test, Where we train our model using "Abstract" column in axcs_train data and try to predict the
categorical values in three columns namely InfoTheory, CompVis, Math, by building three classifiers one for each prediction column using "Abstract" column.

## 2. Libraries Used

- **torch**:
It is an open source machine learning library used for developing and training neural network based on deep learning models.

- **TabularDataset**:
It is a subclass in torch.util.data.Dataset in pytorch, where it does feature embeddings with categorical variables.

- **torch.nn**:
A kind of tensor in neural networks, that is to be considered a module parameter.

- **torchtext**:
It is used to process a text using tokenizer.

- **time**:
To retrieve the time taken for each epoch.

- **optim**:
Rather than manually updating the weights of the model in SGD, we use the optim package to define an Optimizer that will update the weights for us.

- **sklearn.metrics**:
sklearn.metrics is used calculate the metrics of the drawn fit. In this assignment sklearn.metrics is used to calculate the r2-score for the drawn linear regression model.

- **numpy**:
numPy is a package in Python used for scientific computing to perform different operations.

## 3. Importing Libraries

In [0]:
import torch
from torchtext import data
from torchtext.data import TabularDataset
import torch.nn as nn
import time
import torch.optim as optim
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
import numpy as np
import nltk.data

## 4. Defining a class to build Neural Networks Model

Building the RNN model, is done through the below class RNN

The embedding layer is used to convert sparse vector into a dense embedding vector. The sparse vector is a vector which consists of most of the elements as 0 and dense vector is the vector, which is much smaller than sparse vector and consists of only real numbers. This embedding layer is simply a single fully connected layer. 

As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. For more information about word embeddings.

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

In [0]:
class RNN(nn.Module):
    '''
    __init__ is a default method called , when RNN class is called, to define the layers of the neural networks module. It consists of three layers
    namely input, embedding(hidden), output. In the embedding layer, it converts one hot encoding representation into random distributed input representation.
    In the next stage, the embedding layer input size is given into hidden layer(RNN) and the output of this hidden layer size is given into Linear layer 
    and finally we get output dimension size in Linear layer.
    
    '''
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
      
      super().__init__()
      self.embedding = nn.Embedding(input_dim, embedding_dim)
      self.rnn = nn.RNN(embedding_dim, hidden_dim)
      self.fc = nn.Linear(hidden_dim, output_dim)
        
    
    def forward(self, text):
      '''
      When we feed different examples in our dataset, forward method is called. 
      The input text is passed into the embedding layer , which returns an dense vectors. The output of this embedded layer is passed into the hidden layer
      (rnn) and returns two values i.e tensors as output. First return value is the "output" which consists of sentence length, batch size, hidden dim. Batch
      size is present in the input of text. Next return value is the "hidden" which consists of 1, batch size, hidden dim. 
      
      output is the concatenation of every hidden state dimension with the respective input of "Abstract" for every step, where as hidden is just the final 
      hidden state of "RNN" layer.
      
      We verify the above results using assert function, and then finally remove dimension of size 1 and then return the obtained result.
      
      '''
      embedded = self.embedding(text)
      output, hidden = self.rnn(embedded)
      assert torch.equal(output[-1,:,:], hidden.squeeze(0))
      return self.fc(hidden.squeeze(0))

## 5.Defining a method to get no.of rounded predictions equal to the actual class

This function takes the input "preds", in sigmoid function, and make those values fit between 0 to 1. As sigmoid function value , takes the values between 0 and 1, the input "preds" is also squashed between the values 0 and 1, and is rounded to 1; if the value is greater than 0.5 and is equal to 0, if the value
is less than 0.5.

In [0]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()  
    acc = correct.sum() / len(correct)
    return acc

## 6.Defining a function to know how long an epoch takes to compare training times between each epoch

The below function tells us the start and the end time of each epoch.

In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

## 7. Initializing the required parameters

In [0]:
#seed is a pseudo random generator which helps us in retrieving the same output the each time we run. The same seed value will always return the same output.
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

#tokenizing using spacy attribute value for tokenize.
TEXT = data.Field(sequential=True, tokenize = 'spacy', lower=True)

#LABEL is defined by a LabelField, a special subset of the `Field` class specifically used for handling labels
LABEL = data.LabelField(dtype = torch.float, use_vocab=False, preprocessing=int)

## 8.1 Building a Neural Network classifier1 for InfoTheory

In the below cell, we are reading our data through Tabular Dataset and the fields are passed in the same order the columns are present in the dataset, where
the column which has to be predicted is labelled as 'LABEL' and the column which is used as the input to predict our required fiels is labelled as 'TEXT' and the remaining columns as 'None'.

The tabular dataset can take any format of data like 'csv', 'json', 'tsv'.

In [0]:
infotheory_datafields = [("ID", None), 
                 ("URL", None),
                 ("Date", None),
                 ("Title", None),
                 ("InfoTheory", LABEL),
                 ("CompVis", None),
                 ('Math',None),
                 ("Abstract", TEXT)]

train_data_infotheory, test_data_infotheory = TabularDataset.splits(path='',
                                              train = 'axcs_train.csv', 
                                              test = 'axcs_test.csv', 
                                              format = 'csv',
                                              skip_header = True,
                                              fields = infotheory_datafields)

In [0]:
#printing the length of training and testing datasets
print(f'Number of training examples for building InfoTheory classifier: {len(train_data_infotheory)}')
print(f'Number of testing examples for building InfoTheory classifier: {len(test_data_infotheory)}')

Number of training examples for building InfoTheory classifier: 54731
Number of testing examples for building InfoTheory classifier: 19679


In [0]:
#printing the first row of training dataset, after the dataset is read in the tabular format using pytorch
print(vars(train_data_infotheory.examples[0]))

{'InfoTheory': 0, 'Abstract': [' ', 'nested', 'satisfiability', 'a', 'special', 'case', 'of', 'the', 'satisfiability', 'problem', ',', 'in', 'which', 'the', 'clauses', 'have', 'a', 'hierarchical', 'structure', ',', 'is', 'shown', 'to', 'be', 'solvable', 'in', 'linear', 'time', ',', 'assuming', 'that', 'the', 'clauses', 'have', 'been', 'represented', 'in', 'a', 'convenient', 'way', '.']}


In [0]:
#MAX_VOCAB_SIZE is a parameter that is taking a random vocab size, of how much maximum the vocab size could be.
MAX_VOCAB_SIZE = 6000
#building the vocabulary for abstract with the maximum size which is mentioned above.
TEXT.build_vocab(train_data_infotheory, max_size = MAX_VOCAB_SIZE)

In [0]:
print(f"Unique tokens in TEXT vocabulary for building InfoTheory classifier: {len(TEXT.vocab)}")

Unique tokens in TEXT vocabulary for building InfoTheory classifier: 6002


In [0]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 512010), ('of', 344238), ('.', 332875), (',', 307484), ('a', 218648), ('and', 213963), ('-', 184859), ('to', 176340), ('in', 174585), ('is', 130627), ('for', 120901), ('we', 110661), ('that', 89421), ('this', 74411), ('on', 68779), ('with', 66354), ('are', 57723), (' ', 54731), ('an', 49791), ('by', 49768)]


The final step of preparing the data after acquiring all the required parameters, is creating the iterators. We iterate over these in the training data, and they return a batch of examples at each iteration, which consists of indexing and they converting into tensors.

We'll use a "BucketIterator" which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using "torch.device", we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 25

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data_infotheory, test_data_infotheory), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.Abstract),
    sort_within_batch = False)

In [0]:
batch_info_theory = next(train_iterator.__iter__())
batch_info_theory


[torchtext.data.batch.Batch of size 25]
	[.InfoTheory]:[torch.cuda.FloatTensor of size 25 (GPU 0)]
	[.Abstract]:[torch.cuda.LongTensor of size 330x25 (GPU 0)]

Now, we will create an instance of our RNN class. 

Input dimension = dimension of the one-hot vectors = vocabulary size. 

Embedding dimension = size of the dense word vectors. 

It usually takes value between 50-250 , but depends on the size of the vocabulary.

Hidden dimension = size of the hidden states. 

It usually takes value between 100-500 , but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 250
HIDDEN_DIM = 200
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In [0]:
#Based on the vocabulary size, we will get the total number of trainable parameters.
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,591,101 trainable parameters


In order to update the weights of the parameters in the model, we use stochastic gradient descent(SGD), which is used along with optim function in Python, 
which takes two parameters. The first argument model.parameters(), is updated by the optimizer and the second is learning rate, which is equal to lr.

In [0]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)

In the below cell, the loss function is being defined, which is called commonly as criterion in PyTorch. This loss function is "Binary Cross Entropy" with logits.

Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the sigmoid (or) logit functions. 

We then use this this bound scalar to calculate the loss using binary cross entropy. 

The "BCEWithLogitsLoss" criterion carries out both the sigmoid and the binary cross entropy steps.

In [0]:
criterion = nn.BCEWithLogitsLoss()

Now,  ".to"   will help in placing our model, criterion to the device, i.e to our GPU's.

In [0]:
model = model.to(device)
criterion = criterion.to(device)

## 8.2 Training the InfoTheory model

The below "train" function is used to iterate over all examples(one batch at a time). 

"model.train()" is used to put the model in "training mode".

For each batch_info_theory in iterator, we first zero the gradients. Then, build our model for each example. Each parameter in the model will have a grad attribute, which is calculated by criterion, stored as grad. Later, attain a loss value using criterion and calculate accuracy using binary_accuracy. 

Later, we calculate the gradient of each parameter in the model using loss.backward(), then update the parameters using gradient and the optimizer using optimizer.step().

Finally, the loss and accuracy of each epoch is recorded and returned based on the length of iterator.

In [0]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch_info_theory in iterator:
        optimizer.zero_grad()
        predictions = model(batch_info_theory.Abstract).squeeze(1)
        loss = criterion(predictions, batch_info_theory.InfoTheory)
        acc = binary_accuracy(predictions, batch_info_theory.InfoTheory)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

## 8.3 Evaluating the InfoTheory Model

The below "evaluate" function is used to iterate over all examples(one batch at a time). 

"model.eval()" is used to put the training model in evaluation state.

Build our model for each example. Later, attain a loss value using criterion and calculate accuracy using binary_accuracy. 

Finally, the loss and accuracy of each epoch is recorded and returned based on the length of iterator.

In [0]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch_info_theory in iterator:
            predictions = model(batch_info_theory.Abstract).squeeze(1)
            loss = criterion(predictions, batch_info_theory.InfoTheory)
            acc = binary_accuracy(predictions, batch_info_theory.InfoTheory)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The below code is used to run for 5 epochs, and the start and end times are recorded, for each epoch and training loss, training accuracy is recorded.

In [0]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss_infotheory, train_acc_infotheory = train(model, train_iterator, optimizer, criterion)
    end_time = time.time()
    epoch_mins_infotheory, epoch_secs_infotheory = epoch_time(start_time, end_time)
    test_loss_infotheory, test_acc_infotheory = evaluate(model, test_iterator, criterion)
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins_infotheory}m {epoch_secs_infotheory}s')
    print(f'\tTrain Loss: {train_loss_infotheory:.3f} | Train Acc: {train_acc_infotheory*100:.2f}%')
    print(f'\tTest Loss: {test_loss_infotheory:.3f} | Test Acc: {test_acc_infotheory*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 15s
	Train Loss: 0.493 | Train Acc: 80.70%
	Test Loss: 0.506 | Test Acc: 81.25%
Epoch: 02 | Epoch Time: 2m 15s
	Train Loss: 0.490 | Train Acc: 80.71%
	Test Loss: 0.498 | Test Acc: 81.29%
Epoch: 03 | Epoch Time: 2m 16s
	Train Loss: 0.490 | Train Acc: 80.71%
	Test Loss: 0.495 | Test Acc: 81.32%
Epoch: 04 | Epoch Time: 2m 17s
	Train Loss: 0.490 | Train Acc: 80.73%
	Test Loss: 0.495 | Test Acc: 81.35%
Epoch: 05 | Epoch Time: 2m 16s
	Train Loss: 0.490 | Train Acc: 80.72%
	Test Loss: 0.492 | Test Acc: 81.36%


The below cell generates true and predicted values of each field.

In [0]:
y_predict_infotheory = []
y_test_infotheory = []

with torch.no_grad():
    for batch in test_iterator:
        predictions = model(batch.Abstract).squeeze(1)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        y_predict_infotheory += rounded_preds.tolist()
        y_test_infotheory += batch.InfoTheory.tolist()

## 8.4 Confusion Matrix for InfoTheory Model

Now, print the confusion matrix as well as based on the prediction values and the truth values, the efficiency of the model is calculated using the following metrics,

1. Accuracy
2. Macro Precision
3. Macro Recall
4. Macro F1 score
5. MCC

In [0]:
y_predict_infotheory = np.asarray(y_predict_infotheory)
print("Metrics for Neural Network Classifier1 for building InfoTheory")
y_test_infotheory = np.asarray(y_test_infotheory)
print(confusion_matrix(y_test_infotheory,y_predict_infotheory))
recall=recall_score(y_test_infotheory,y_predict_infotheory,average='macro')
precision=precision_score(y_test_infotheory,y_predict_infotheory,average='macro')
f1score=f1_score(y_test_infotheory,y_predict_infotheory,average='macro')
accuracy=accuracy_score(y_test_infotheory,y_predict_infotheory)
matthews = matthews_corrcoef(y_test_infotheory,y_predict_infotheory) 
print('Accuracy: '+ str(accuracy))
print('Macro Precision: '+ str(precision))
print('Macro Recall: '+ str(recall))
print('Macro F1 score:'+ str(f1score))
print('MCC:'+ str(matthews))

Metrics for Neural Network Classifier1 for building InfoTheory
[[15998    65]
 [ 3608     8]]
Accuracy: 0.8133543371106255
Macro Precision: 0.46278187135892146
Macro Recall: 0.49908291136834554
Macro F1 score:0.45068132350074164
MCC:-0.011684574903382008


## 8.5 Results of confusion matrix for InfoTheory

In [0]:
print("The number of true negatives for Info theory classifier of neural network model is: " + str(confusion_matrix(y_test_infotheory,y_predict_infotheory)[0][0]))
print("The number of false positives for Info theory classifier of neural network model is: " + str(confusion_matrix(y_test_infotheory,y_predict_infotheory)[0][1]))
print("The number of false negatives for Info theory classifier of neural network model is: " + str(confusion_matrix(y_test_infotheory,y_predict_infotheory)[1][0]))
print("The number of true positives for Info theory classifier of neural network model is: " + str(confusion_matrix(y_test_infotheory,y_predict_infotheory)[1][1]))


The number of true negatives for Info theory classifier of neural network model is: 15998
The number of false positives for Info theory classifier of neural network model is: 65
The number of false negatives for Info theory classifier of neural network model is: 3608
The number of true positives for Info theory classifier of neural network model is: 8


## 9. Building a Neural Network classifier2 for CompVis

In the below cell, we are reading our data through Tabular Dataset and the fields are passed in the same order the columns are present in the dataset, where
the column which has to be predicted is labelled as 'LABEL' and the column which is used as the input to predict our required fiels is labelled as 'TEXT' and the remaining columns as 'None'.

The tabular dataset can take any format of data like 'csv', 'json', 'tsv'.

In [0]:
compvis_datafields = [("ID", None), 
                 ("URL", None),
                 ("Date", None),
                 ("Title", None),
                 ("InfoTheory", None),
                 ("CompVis", LABEL),
                 ('Math',None),
                 ("Abstract", TEXT)]

train_data_compvis, test_data_compvis = TabularDataset.splits(path='',
                                              train = 'axcs_train.csv', 
                                              test = 'axcs_test.csv', 
                                              format = 'csv',
                                              skip_header = True,
                                              fields = compvis_datafields)

In [8]:
#printing the length of training and testing datasets
print(f'Number of training examples for building CompVis classifier: {len(train_data_compvis)}')
print(f'Number of testing examples for building CompVis classifier: {len(test_data_compvis)}')

Number of training examples for building CompVis classifier: 54731
Number of testing examples for building CompVis classifier: 19679


In [9]:
#printing the first row of training dataset, after the dataset is read in the tabular format using pytorch
print(vars(train_data_compvis.examples[0]))

{'CompVis': 0, 'Abstract': [' ', 'nested', 'satisfiability', 'a', 'special', 'case', 'of', 'the', 'satisfiability', 'problem', ',', 'in', 'which', 'the', 'clauses', 'have', 'a', 'hierarchical', 'structure', ',', 'is', 'shown', 'to', 'be', 'solvable', 'in', 'linear', 'time', ',', 'assuming', 'that', 'the', 'clauses', 'have', 'been', 'represented', 'in', 'a', 'convenient', 'way', '.']}


In [0]:
#printing the first row of training dataset, after the dataset is read in the tabular format using pytorch
MAX_VOCAB_SIZE = 6000

#building the vocabulary for abstract with the maximum size which is mentioned above.
TEXT.build_vocab(train_data_compvis, max_size = MAX_VOCAB_SIZE)

The final step of preparing the data after acquiring all the required parameters, is creating the iterators. We iterate over these in the training data, and they return a batch of examples at each iteration, which consists of indexing and they converting into tensors.

We'll use a "BucketIterator" which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using "torch.device", we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 20

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data_compvis, test_data_compvis), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.Abstract),
    sort_within_batch = False)

In [12]:
batch_compvis = next(train_iterator.__iter__())
batch_compvis


[torchtext.data.batch.Batch of size 20]
	[.CompVis]:[torch.cuda.FloatTensor of size 20 (GPU 0)]
	[.Abstract]:[torch.cuda.LongTensor of size 291x20 (GPU 0)]

Now, we will create an instance of our RNN class. 

Input dimension = dimension of the one-hot vectors = vocabulary size. 

Embedding dimension = size of the dense word vectors. 

It usually takes value between 50-250 , but depends on the size of the vocabulary.

Hidden dimension = size of the hidden states. 

It usually takes value between 100-500 , but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 250
HIDDEN_DIM = 200
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In order to update the weights of the parameters in the model, we use stochastic gradient descent(SGD), which is used along with optim function in Python, 
which takes two parameters. The first argument model.parameters(), is updated by the optimizer and the second is learning rate, which is equal to lr.

In [0]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

## 9.1 Training the compvis model

The below "train" function is used to iterate over all examples(one batch at a time). 

"model.train()" is used to put the model in "training mode".

For each batch_info_theory in iterator, we first zero the gradients. Then, build our model for each example. Each parameter in the model will have a grad attribute, which is calculated by criterion, stored as grad. Later, attain a loss value using criterion and calculate accuracy using binary_accuracy. 

Later, we calculate the gradient of each parameter in the model using loss.backward(), then update the parameters using gradient and the optimizer using optimizer.step().

Finally, the loss and accuracy of each epoch is recorded and returned based on the length of iterator.

In [0]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch_compvis in iterator:
        optimizer.zero_grad()
        predictions = model(batch_compvis.Abstract).squeeze(1)
        loss = criterion(predictions, batch_compvis.CompVis)
        acc = binary_accuracy(predictions, batch_compvis.CompVis)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

## 9.2 Evaluating the compvis Model

The below "evaluate" function is used to iterate over all examples(one batch at a time). 

"model.eval()" is used to put the training model in evaluation state.

Build our model for each example. Later, attain a loss value using criterion and calculate accuracy using binary_accuracy. 

Finally, the loss and accuracy of each epoch is recorded and returned based on the length of iterator.

In [0]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch_compvis in iterator:
            predictions = model(batch_compvis.Abstract).squeeze(1)
            loss = criterion(predictions, batch_compvis.CompVis)
            acc = binary_accuracy(predictions, batch_compvis.CompVis)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The below code is used to run for 5 epochs, and the start and end times are recorded, for each epoch and training loss, training accuracy is recorded.

In [17]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss_compvis, train_acc_compvis = train(model, train_iterator, optimizer, criterion)
    end_time = time.time()
    epoch_mins_compvis, epoch_secs_compvis = epoch_time(start_time, end_time)
    test_loss_compvis, test_acc_compvis = evaluate(model, test_iterator, criterion)
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins_compvis}m {epoch_secs_compvis}s')
    print(f'\tTrain Loss: {train_loss_compvis:.3f} | Train Acc: {train_acc_compvis*100:.2f}%')
    print(f'\tTest Loss: {test_loss_compvis:.3f} | Test Acc: {test_acc_compvis*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 35s
	Train Loss: 0.180 | Train Acc: 95.90%
	Test Loss: 0.354 | Test Acc: 88.61%
Epoch: 02 | Epoch Time: 0m 34s
	Train Loss: 0.171 | Train Acc: 95.89%
	Test Loss: 0.370 | Test Acc: 88.60%
Epoch: 03 | Epoch Time: 0m 34s
	Train Loss: 0.171 | Train Acc: 95.90%
	Test Loss: 0.380 | Test Acc: 88.61%
Epoch: 04 | Epoch Time: 0m 34s
	Train Loss: 0.171 | Train Acc: 95.90%
	Test Loss: 0.385 | Test Acc: 88.62%
Epoch: 05 | Epoch Time: 0m 34s
	Train Loss: 0.171 | Train Acc: 95.90%
	Test Loss: 0.389 | Test Acc: 88.63%


The below cell generates true and predicted values of each field.

In [0]:
y_predict_compvis = []
y_test_compvis = []
with torch.no_grad():
    for batch in test_iterator:
        predictions = model(batch.Abstract).squeeze(1)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        y_predict_compvis += rounded_preds.tolist()
        y_test_compvis += batch.CompVis.tolist()

Now, print the confusion matrix as well as based on the prediction values and the truth values, the efficiency of the model is calculated using the following metrics,

1. Accuracy
2. Macro Precision
3. Macro Recall
4. Macro F1 score
5. MCC

In [19]:
y_predict_compvis = np.asarray(y_predict_compvis)
print("Metrics for Neural Network Classifier2 for building CompVis")
y_test_compvis = np.asarray(y_test_compvis)
print(confusion_matrix(y_test_compvis,y_predict_compvis))
recall=recall_score(y_test_compvis,y_predict_compvis,average='macro')
precision=precision_score(y_test_compvis,y_predict_compvis,average='macro')
f1score=f1_score(y_test_compvis,y_predict_compvis,average='macro')
accuracy=accuracy_score(y_test_compvis,y_predict_compvis)
matthews = matthews_corrcoef(y_test_compvis,y_predict_compvis) 
print('Accuracy: '+ str(accuracy))
print('Macro Precision: '+ str(precision))
print('Macro Recall: '+ str(recall))
print('Macro F1 score:'+ str(f1score))
print('MCC:'+ str(matthews))

Metrics for Neural Network Classifier2 for building CompVis
[[17429    98]
 [ 2140    12]]
Accuracy: 0.886274709080746
Macro Precision: 0.4998671367979968
Macro Recall: 0.49999241743434397
Macro F1 score:0.47514006243174917
MCC:-6.348051519816112e-05


## 9.3 Results of compvis model

In [20]:
print("The number of true negatives for compvis classifier of neural network model is: " + str(confusion_matrix(y_test_compvis,y_predict_compvis)[0][0]))
print("The number of false positives for compvis classifier of neural network model is: " + str(confusion_matrix(y_test_compvis,y_predict_compvis)[0][1]))
print("The number of false negatives for compvis classifier of neural network model is: " + str(confusion_matrix(y_test_compvis,y_predict_compvis)[1][0]))
print("The number of true positives for compvis classifier of neural network model is: " + str(confusion_matrix(y_test_compvis,y_predict_compvis)[1][1]))


The number of true negatives for compvis classifier of neural network model is: 17429
The number of false positives for compvis classifier of neural network model is: 98
The number of false negatives for compvis classifier of neural network model is: 2140
The number of true positives for compvis classifier of neural network model is: 12


## 10. Building a Neural Network classifier3 for Math

In the below cell, we are reading our data through Tabular Dataset and the fields are passed in the same order the columns are present in the dataset, where
the column which has to be predicted is labelled as 'LABEL' and the column which is used as the input to predict our required fiels is labelled as 'TEXT' and the remaining columns as 'None'.

The tabular dataset can take any format of data like 'csv', 'json', 'tsv'.

In [0]:
math_datafields = [("ID", None), 
                 ("URL", None),
                 ("Date", None),
                 ("Title", None),
                 ("InfoTheory", None),
                 ("CompVis", None),
                 ('Math',LABEL),
                 ("Abstract", TEXT)]

train_data_math, test_data_math = TabularDataset.splits(path='',
                                              train = 'axcs_train.csv', 
                                              test = 'axcs_test.csv', 
                                              format = 'csv',
                                              skip_header = True,
                                              fields = math_datafields)

In [0]:
#printing the length of training and testing datasets
print(f'Number of training examples for building Math classifier: {len(train_data_math)}')
print(f'Number of testing examples for building Math classifier: {len(test_data_math)}')

Number of training examples for building Math classifier: 54731
Number of testing examples for building Math classifier: 19679


In [0]:
#printing the first row of training dataset, after the dataset is read in the tabular format using pytorch
print(vars(train_data_math.examples[0]))

{'Math': 0, 'Abstract': [' ', 'nested', 'satisfiability', 'a', 'special', 'case', 'of', 'the', 'satisfiability', 'problem', ',', 'in', 'which', 'the', 'clauses', 'have', 'a', 'hierarchical', 'structure', ',', 'is', 'shown', 'to', 'be', 'solvable', 'in', 'linear', 'time', ',', 'assuming', 'that', 'the', 'clauses', 'have', 'been', 'represented', 'in', 'a', 'convenient', 'way', '.']}


In [0]:
#printing the first row of training dataset, after the dataset is read in the tabular format using pytorch
MAX_VOCAB_SIZE = 6000

#building the vocabulary for abstract with the maximum size which is mentioned above.
TEXT.build_vocab(train_data_math, max_size = MAX_VOCAB_SIZE)

The final step of preparing the data after acquiring all the required parameters, is creating the iterators. We iterate over these in the training data, and they return a batch of examples at each iteration, which consists of indexing and they converting into tensors.

We'll use a "BucketIterator" which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using "torch.device", we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 20

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data_math, test_data_math), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.Abstract),
    sort_within_batch = False)

In [0]:
batch_math = next(train_iterator.__iter__())
batch_math


[torchtext.data.batch.Batch of size 20]
	[.Math]:[torch.cuda.FloatTensor of size 20 (GPU 0)]
	[.Abstract]:[torch.cuda.LongTensor of size 334x20 (GPU 0)]

Now, we will create an instance of our RNN class. 

Input dimension = dimension of the one-hot vectors = vocabulary size. 

Embedding dimension = size of the dense word vectors. 

It usually takes value between 50-250 , but depends on the size of the vocabulary.

Hidden dimension = size of the hidden states. 

It usually takes value between 100-500 , but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 250
HIDDEN_DIM = 200
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

In order to update the weights of the parameters in the model, we use stochastic gradient descent(SGD), which is used along with optim function in Python, 
which takes two parameters. The first argument model.parameters(), is updated by the optimizer and the second is learning rate, which is equal to lr.

In [0]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

## 10.1 Training the Math model

The below "train" function is used to iterate over all examples(one batch at a time). 

"model.train()" is used to put the model in "training mode".

For each batch_info_theory in iterator, we first zero the gradients. Then, build our model for each example. Each parameter in the model will have a grad attribute, which is calculated by criterion, stored as grad. Later, attain a loss value using criterion and calculate accuracy using binary_accuracy. 

Later, we calculate the gradient of each parameter in the model using loss.backward(), then update the parameters using gradient and the optimizer using optimizer.step().

Finally, the loss and accuracy of each epoch is recorded and returned based on the length of iterator.

In [0]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch_math in iterator:
        optimizer.zero_grad()
        predictions = model(batch_math.Abstract).squeeze(1)
        loss = criterion(predictions, batch_math.Math)
        acc = binary_accuracy(predictions, batch_math.Math)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

## 10.2 Evaluating the Math model

The below "evaluate" function is used to iterate over all examples(one batch at a time). 

"model.eval()" is used to put the training model in evaluation state.

Build our model for each example. Later, attain a loss value using criterion and calculate accuracy using binary_accuracy. 

Finally, the loss and accuracy of each epoch is recorded and returned based on the length of iterator.

In [0]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch_math in iterator:
            predictions = model(batch_math.Abstract).squeeze(1)
            loss = criterion(predictions, batch_math.Math)
            acc = binary_accuracy(predictions, batch_math.Math)
            epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The below code is used to run for 5 epochs, and the start and end times are recorded, for each epoch and training loss, training accuracy is recorded.

In [0]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss_math, train_acc_math = train(model, train_iterator, optimizer, criterion)
    end_time = time.time()
    epoch_mins_math, epoch_secs_math = epoch_time(start_time, end_time)
    test_loss_math, test_acc_math = evaluate(model, test_iterator, criterion)
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins_math}m {epoch_secs_math}s')
    print(f'\tTrain Loss: {train_loss_math:.3f} | Train Acc: {train_acc_math*100:.2f}%')
    print(f'\tTest Loss: {test_loss_math:.3f} | Test Acc: {test_acc_math*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 45s
	Train Loss: 0.617 | Train Acc: 69.41%
	Test Loss: 0.616 | Test Acc: 0.08%
Epoch: 02 | Epoch Time: 2m 42s
	Train Loss: 0.616 | Train Acc: 69.43%
	Test Loss: 0.615 | Test Acc: 0.08%
Epoch: 03 | Epoch Time: 2m 43s
	Train Loss: 0.616 | Train Acc: 69.42%
	Test Loss: 0.614 | Test Acc: 0.08%
Epoch: 04 | Epoch Time: 2m 44s
	Train Loss: 0.616 | Train Acc: 69.42%
	Test Loss: 0.614 | Test Acc: 0.08%
Epoch: 05 | Epoch Time: 2m 43s
	Train Loss: 0.616 | Train Acc: 69.43%
	Test Loss: 0.613 | Test Acc: 0.08%


The below cell generates true and predicted values of each field.

In [0]:
y_predict_math = []
y_test_math= []
with torch.no_grad():
    for batch in test_iterator:
        predictions = model(batch.Abstract).squeeze(1)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        y_predict_math += rounded_preds.tolist()
        y_test_math += batch.Math.tolist()

## 10.3 Confusion Matrix for Math

Now, print the confusion matrix as well as based on the prediction values and the truth values, the efficiency of the model is calculated using the following metrics,

1. Accuracy
2. Macro Precision
3. Macro Recall
4. Macro F1 score
5. MCC

In [0]:
y_predict_math = np.asarray(y_predict_math)
print("Metrics for Neural Network Classifier3 for building Math")
y_test_math = np.asarray(y_test_math)
print(confusion_matrix(y_test_math,y_predict_math))
recall=recall_score(y_test_math,y_predict_math,average='macro')
precision=precision_score(y_test_math,y_predict_math,average='macro')
f1score=f1_score(y_test_math,y_predict_math,average='macro')
accuracy=accuracy_score(y_test_math,y_predict_math)
matthews = matthews_corrcoef(y_test_math,y_predict_math) 
print('Accuracy: '+ str(accuracy))
print('Macro Precision: '+ str(precision))
print('Macro Recall: '+ str(recall))
print('Macro F1 score:'+ str(f1score))
print('MCC:'+ str(matthews))

Metrics for Neural Network Classifier3 for building Math
[[13670    79]
 [ 5906    24]]
Accuracy: 0.6958686925148636
Macro Precision: 0.46565687725409755
Macro Recall: 0.49915067255542855
Macro F1 score:0.41418067097579964
MCC:-0.010801584454214426


## 10.4 Results of confusion matrix for math model

In [0]:
print("The number of true negatives for math classifier of neural network model is: " + str(confusion_matrix(y_test_math,y_predict_math)[0][0]))
print("The number of false positives for math classifier of neural network model is: " + str(confusion_matrix(y_test_math,y_predict_math)[0][1]))
print("The number of false negatives for math classifier of neural network model is: " + str(confusion_matrix(y_test_math,y_predict_math)[1][0]))
print("The number of true positives for math classifier of neural network model is: " + str(confusion_matrix(y_test_math,y_predict_math)[1][1]))


The number of true negatives for math classifier of neural network model is: 13670
The number of false positives for math classifier of neural network model is: 79
The number of false negatives for math classifier of neural network model is: 5906
The number of true positives for math classifier of neural network model is: 24


## 11. Neural Network model with pre-processing

### Type 1:- Adding weight to the Abstract using Title column

The above pre processing is done using "pandas", then the dataframe is written into a csv file and that "csv" file is read using Tabular dataset, and the above code has been run for each classifier and the final accuracies has been checked.

In [0]:
import pandas as pd
train_data = pd.read_csv('axcs_train.csv')
train_data['Abstract'] = train_data['Title'] + " " + train_data['Title'] + " " + train_data['Abstract']

Above step is done, to add weight to the Abstract column using title, because what the document is dealing about can be summarized maximum by the tokens present in the Title. So, adding title 2-3 times to a abstract column, and getting nouns and adjectives, can make the tokens in a title to add more weight to the abstract, which there by can add more weight to the particular document. So, when RNN is read over these tokens, if the same token is repeated, then the tokens will have a higher weights, which there by can have a more probability to get the context of those words and predict from those whether the context of those words is dealing about "InfoTheory" (or) "CompVis" (or) "Math", finally which can make the accuracy of the training and testing data to increase to some extent.

### Type -2 :- Getting Nouns and Adjectives

Below are the pre-processing steps that are done, and the generated output is passed through the tabular dataset, which is passed as an input to neural network model.

In [0]:
import nltk.data
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import MWETokenizer

In [0]:
#using the below function word segmentation has been performed.
def segmentation(sentences):
    
    if re.search(r'(.*)''(.*)',sentences):
        sentences = re.sub(r"''","'",sentences)
        
    if re.search(r'(.*)``(.*)',sentences):
        sentences = re.sub(r"``","'",sentences)
      
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentencee = sent_detector.tokenize(sentences.strip())
    return sentencee

In [0]:
#using the below function tokenization has been done, by taking only the words that contain alphabets and the length of the word must be greater than 1
def tokenization(sentee):
    l1 = []
    for sent in sentee:
        l = []
        tokenizer = RegexpTokenizer( r"[A-Za-z]\w+(?:[-'?]\w+)?")
        unigram_tokens = tokenizer.tokenize(sent)
        for word in unigram_tokens:
            if word == unigram_tokens[0]:
                word = word.lower()
            l.append(word)
        l1.extend(l)
    return l1

In [0]:
'''
For each sentence after doing the segmentation, tokenization, the tokenized words has been generated.
For these tokenized words, POS tagging has been done, and only the nouns and adjectives are obtained.
'''
for sentencee in range(0, len(train_data)):
    sentee = segmentation(train_data[sentencee])
    tokenize_sentee = tokenization(sentee)
    filtered_sentence = [w for w in tokenize_sentee if not w in stop_words] 
    pos_fitered_sentence = nltk.pos_tag(filtered_sentence)
    noun_adj_filtered_sentences = []
    for word in pos_fitered_sentence:
        if word[1] == 'NN':
            noun_adj_filtered_sentences.append(word[0])
        if word[1] == 'JJ':
            noun_adj_filtered_sentences.append(word[0])
    print(noun_adj_filtered_sentences)

Now, these words which contain only the nouns and adjectives in a trained data, is saved with the same "axcs_csv" file in anew column namely "Abstract_Filtered" and it is read and passed as the input to the Tabular dataset. This pre-processing has been done because, using the "Abstract" column, we are trying to predict whether the document is about "InfoTheory" (or) "CompVis" (or) "Math". So, the content of document will always be maximum a 'Noun' or 'Adjective', as there is no point of having the document name as verb, conjuction, interjection likewise, as it doesn't make any sense. Now, passing only the respective tokens to my Neural Network model, will increase the accuracy of my trained model, take less time to pre-process and can learn in a better way, as it contains the more appropriate tokens.

The below snippet shows, the increase of accuracy after doing pre-processing, for one of the classifiers i.e, compvis

<img src = "compvis_cl2.png"/>

### Type 3:- Tuning the different hyper parameters based on sampling

Sampling has been illustrated in the below cell only for compvis read data. Similiarly, for infotheory and math also the same process is repeated.

In [0]:
from torch.utils.data import Dataset, DataLoader, random_split, SubsetRandomSampler, WeightedRandomSampler
train_dataset, val_dataset = random_split(train_data_compvis, (54531, 200)) #splitting the read tabular data and then doing the pre-processing

The below snippet shows, the increase of accuracy after tuning hyper paremeters to the best possible value for one of the classifiers i.e Infotheory.

<img src = "new_inff.png"/>

Finally, it can be concluded that if we do "pre-processing" and "tuning of hyper-parameters" gives the better results.

### All the hyper-parameters are initialized with the best possible accuarcy values obtained above when we build the neural network.

## Part 1.b :- Machine Learning Model

### Libraries used

- **nltk**:
It is one of the platforms for working with human language data and python, the module nltk is used for natural language processing.

- **stopwords**:
It is a method in nltk package, which is used to remove stop words from nltk data.

- **word_tokenize**:
It is a method in nltk package, used for tokenization.

- **WordNetLemmatizer**:
It is used for doing lemmatization for a given words in nltk data.

- **Scikit-learn**:
It is an open source python library, that has many powerful packages, functions to deal with data analysis and data mining.

- **TfidfVectorizer**:
It is a type of vectorizer in sklearn, which transforms text into a sparse matrix.

- **sklearn.metrics**:
It is used to calculate the accuracy metrics of the build model.

- **sklearn.linear_model**:
Used to build linear models like linear regression, logistic regression.

- **sklearn.naive_bayes**:
Used to build naive models like Multinomial Naive Bayes, Gaussian Naive Bayes, Bernoulli Naive Bayes.

- **sklearn.svm**:
Used to build support vector machine models like linear svc, svc.

- **sklearn.ensemble**:
Used to build Ramdom Forest Classifier Model.

- **sklearn.model_selection**:
Used to do cross validation to check accuracy of the given model.


## 1. Importing Libraries

In [0]:
from nltk.corpus import stopwords
from nltk import word_tokenize    
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB,BernoulliNB
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import text
from sklearn import cross_validation
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
import re

## 2. Defining a class to do lemmatization

The below class is used to do lemmatization, for the tokenized words , which is tokenized using word_tokenize function from nltk. Lemmatization is the process of grouping the similiar words into one word, when there exists different forms of a single word. It is similiar process to stemming, but doing stemming will always loose the actual meaning of word and is a bad idea to do it always. Doing lemmatization is always a good one to do.

In [0]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [0]:
#reading train and test data
train_data = pd.read_csv("axcs_train.csv")
test_data = pd.read_csv("axcs_test.csv")

In [0]:
train_data[0:5]

Unnamed: 0,ID,URL,Date,Title,InfoTheory,CompVis,Math,Abstract
0,cs-9301111,arxiv.org/abs/cs/9301111,1989-12-31,Nested satisfiability,0,0,0,Nested satisfiability A special case of the s...
1,cs-9301112,arxiv.org/abs/cs/9301112,1990-03-31,A note on digitized angles,0,0,0,A note on digitized angles We study the confi...
2,cs-9301113,arxiv.org/abs/cs/9301113,1991-07-31,Textbook examples of recursion,0,0,0,Textbook examples of recursion We discuss pro...
3,cs-9301114,arxiv.org/abs/cs/9301114,1991-10-31,Theory and practice,0,0,0,Theory and practice The author argues to Sili...
4,cs-9301115,arxiv.org/abs/cs/9301115,1991-11-30,Context-free multilanguages,0,0,0,Context-free multilanguages This article is a...


In [0]:
test_data[0:5]

Unnamed: 0,ID,URL,Date,Title,InfoTheory,CompVis,Math,Abstract
0,no-150100335,arxiv.org/abs/1501.00335,01-01-2015,A Data Transparency Framework for Mobile Appli...,0,0,0,A Data Transparency Framework for Mobile Appl...
1,no-14024178,arxiv.org/abs/1402.4178,01-01-2015,A reclaimer scheduling problem arising in coal...,0,0,0,A reclaimer scheduling problem arising in coa...
2,no-150100263,arxiv.org/abs/1501.00263,01-01-2015,Communication-Efficient Distributed Optimizati...,0,0,1,Communication-Efficient Distributed Optimizat...
3,no-150100287,arxiv.org/abs/1501.00287,01-01-2015,Consistent Classification Algorithms for Multi...,0,0,0,Consistent Classification Algorithms for Mult...
4,no-11070586,arxiv.org/abs/1107.0586,01-01-2015,Managing key multicasting through orthogonal s...,0,0,0,Managing key multicasting through orthogonal ...


In [0]:
#converting the required columns into list using training data
train_abstract = train_data.Abstract.tolist()
train_InfoTheory = train_data.InfoTheory.tolist()
train_CompVis = train_data.CompVis.tolist()
train_Math = train_data.Math.tolist()

In [0]:
#converting the required columns into list using testing data
test_abstract = test_data.Abstract.tolist()
test_InfoTheory = test_data.InfoTheory.tolist()
test_CompVis = test_data.CompVis.tolist()
test_Math = test_data.Math.tolist()

### **TfidfVectorizer**:

Tfidf stands for term frequency inverse document frequency. It is used to convert a collection of raw documents into Tf-idf vectors. Here, each word is converted into a feature index in the matrix, which is passed as an input of an estimator to predict the predictor. Tfidf creates its own sets of vocabulary, with sorted words and the output being an integer for each word, which is index. Now, when we pass certain words in a particular document it gives, weighing values in the form of vector for those repeated words and the remaining words being 0, based on the term frequency of a word in a particular document with respect to all documents.

## 3. Preprocessing steps involved in Machine Learning for Text-Classification:-

The preprocessing steps involved are, 

1.Convert into lower case

2.Tokenization

3.Removal of stop words

4.Generating bigrams along with unigrams

5.Lemmatization

In [0]:
my_stop_words = text.ENGLISH_STOP_WORDS
'''
It is the type of vectorizer used to convert into sparse matrix by taking the given content / sentence.
It consists of following parameters like analyzer, which is a word. It means it analyzes word to word, by taking the total input.
lowercase parameter is used to convert all words to either lower/upper, it can be by instantiating True/False value to a lower case parameter.
token_pattern consists of identifying the pattern in the given content.
By default stop words parameters is not set, but adding the "stop_words = set(stop_words)" will remove stopwords from the generated tokens.
min_df deals with the term must present in atleast 3 documents.
ngram_range deals with generation of no.of grams like unigrams, bigrams,trigrams likewise. Here we generate only unigrams and bigrams.
tokenizer attribute is used to retrieve the words after called lemmatokenizer where for each word lemmatization has been performed.
'''
vectorizer = TfidfVectorizer(analyzer='word',input='content',
                           lowercase=True,
                           token_pattern=r"[A-Za-z]\w+(?:[-'?]\w+)?", stop_words = set(my_stop_words),
                           min_df=3,
                           ngram_range=(1,2),
                           tokenizer=LemmaTokenizer())

In [0]:
#Building the machine learning models
models = [
    LogisticRegression(),
    BernoulliNB(),
    LinearSVC(),
    RandomForestClassifier()
]

## 4. Building a Machine Learning Classifier 1 for InfoTheory

In [0]:
#convert abstract, infotheory columns of train data to a vector format using tfidf vectorizer
x_train = vectorizer.fit_transform(train_abstract)
y_train = np.asarray(train_InfoTheory)

  'stop_words.' % sorted(inconsistent))


In [0]:
'''
For each of the models, acquiring the accuracies using cross validation and appending their values into a dataframe with respect to each fold, where 
final dataframe consists of three columns namely, model_name, fold_idx indicates fold id of the cross validation and its respective accuracy, which iterates
through all folds for each model.
'''
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
     model_name = model.__class__.__name__
     accuracies = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=CV)
     for fold_idx, accuracy in enumerate(accuracies):
          entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])



In [0]:
#convert abstract, infotheory columns of test data to a vector format using tfidf vectorizer
x_test = vectorizer.transform(test_abstract)
y_test = np.asarray(test_InfoTheory)

## Results of Machine Learning classifier 1 for InfoTheory

In [0]:
#Building a confusion matrix and the accuracy metrics for each model
for clf in models:
    model_name = clf.__class__.__name__
    clf.fit(x_train, y_train)
    print(model_name)
    y_predict=clf.predict(x_test)
    print(confusion_matrix(y_test,y_predict))
    recall=recall_score(y_test,y_predict,average='macro')
    precision=precision_score(y_test,y_predict,average='macro')
    f1score=f1_score(y_test,y_predict,average='macro')
    accuracy=accuracy_score(y_test,y_predict)
    matthews = matthews_corrcoef(y_test,y_predict) 
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))



LogisticRegression
[[15881   182]
 [  851  2765]]
Accuracy: 0.9475074952995579
Macro Precision: 0.9436908269701535
Macro Recall: 0.876663346521633
Macro F1 score:0.9055518821563093
MCC:0.8176113299301296
BernoulliNB
[[15658   405]
 [  594  3022]]
Accuracy: 0.9492352253671427
Macro Precision: 0.9226357433882932
Macro Recall: 0.9052584327804403
Macro F1 score:0.9136212996669142
MCC:0.8277117831770574
LinearSVC
[[15818   245]
 [  600  3016]]
Accuracy: 0.957060826261497
Macro Precision: 0.9441622083360464
Macro Recall: 0.9094091764782364
Macro F1 score:0.9255557225864601
MCC:0.8528636091137095




RandomForestClassifier
[[15917   146]
 [ 1426  2190]]
Accuracy: 0.9201178921693175
Macro Precision: 0.9276383122873781
Macro Recall: 0.7982761908447876
Macro F1 score:0.8444148409692677
MCC:0.7142949722583042


## 5. Building a Machine Learning Classifier 2 for CompVis

In [0]:
#convert abstract, compvis columns of train data to a vector format using tfidf vectorizer
x_train = vectorizer.fit_transform(train_abstract)
y_train = np.asarray(train_CompVis)

In [0]:
'''
For each of the models, acquiring the accuracies using cross validation and appending their values into a dataframe with respect to each fold, where 
final dataframe consists of three columns namely, model_name, fold_idx indicates fold id of the cross validation and its respective accuracy, which iterates
through all folds for each model.
'''
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
     model_name = model.__class__.__name__
     accuracies = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=CV)
     for fold_idx, accuracy in enumerate(accuracies):
          entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])



In [0]:
#convert abstract, compvis columns of test data to a vector format using tfidf vectorizer
x_test = vectorizer.transform(test_abstract)
y_test = np.asarray(test_CompVis)

## Results of Machine Learning classifier 2 for CompVis

In [0]:
#Building a confusion matrix and the accuracy metrics for each model
for clf in models:
    model_name = clf.__class__.__name__
    clf.fit(x_train, y_train)
    print(model_name)
    y_predict=clf.predict(x_test)
    print(confusion_matrix(y_test,y_predict))
    recall=recall_score(y_test,y_predict,average='macro')
    precision=precision_score(y_test,y_predict,average='macro')
    f1score=f1_score(y_test,y_predict,average='macro')
    accuracy=accuracy_score(y_test,y_predict)
    matthews = matthews_corrcoef(y_test,y_predict) 
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))



LogisticRegression
[[17462    65]
 [  833  1319]]
Accuracy: 0.9543675999796738
Macro Precision: 0.9537515580396425
Macro Recall: 0.8046048258417231
Macro F1 score:0.8604861651286868
MCC:0.7435453296526738
BernoulliNB
[[17477    50]
 [ 1299   853]]
Accuracy: 0.9314497687890645
Macro Precision: 0.9377224748164642
Macro Recall: 0.6967613615997241
Macro F1 score:0.7606346709160439
MCC:0.5869475961197504
LinearSVC
[[17437    90]
 [  471  1681]]
Accuracy: 0.9714924538848518
Macro Precision: 0.9614400795230835
Macro Recall: 0.8879994471620312
Macro F1 score:0.9205826956552985
MCC:0.8462588156193356
RandomForestClassifier
[[17487    40]
 [ 1519   633]]
Accuracy: 0.9207784948422176
Macro Precision: 0.9303212530523324
Macro Recall: 0.645931394112493
Macro F1 score:0.7027339229485574
MCC:0.5011881098915327


## 6. Building a Machine Learning Classifier 3 for Math

In [0]:
#convert abstract, math columns of train data to a vector format using tfidf vectorizer
x_train = vectorizer.fit_transform(train_abstract)
y_train = np.asarray(train_Math)

In [0]:
'''
For each of the models, acquiring the accuracies using cross validation and appending their values into a dataframe with respect to each fold, where 
final dataframe consists of three columns namely, model_name, fold_idx indicates fold id of the cross validation and its respective accuracy, which iterates
through all folds for each model.
'''
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
     model_name = model.__class__.__name__
     accuracies = cross_val_score(model, x_train, y_train, scoring='accuracy', cv=CV)
     for fold_idx, accuracy in enumerate(accuracies):
          entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])



In [0]:
#convert abstract, math columns of test data to a vector format using tfidf vectorizer
x_test = vectorizer.transform(test_abstract)
y_test = np.asarray(test_Math) 

## Results of Machine Learning classifier 3 for Math

In [0]:
#Building a confusion matrix and the accuracy metrics for each model
for clf in models:
    model_name = clf.__class__.__name__
    clf.fit(x_train, y_train)
    print(model_name)
    y_predict=clf.predict(x_test)
    print(confusion_matrix(y_test,y_predict))
    recall=recall_score(y_test,y_predict,average='macro')
    precision=precision_score(y_test,y_predict,average='macro')
    f1score=f1_score(y_test,y_predict,average='macro')
    accuracy=accuracy_score(y_test,y_predict)
    matthews = matthews_corrcoef(y_test,y_predict) 
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))



LogisticRegression
[[13018   731]
 [ 1702  4228]]
Accuracy: 0.8763656689872453
Macro Precision: 0.8684831241177657
Macro Recall: 0.8299086599215494
Macro F1 score:0.845551062705082
MCC:0.6973256733590067
BernoulliNB
[[12517  1232]
 [ 1306  4624]]
Accuracy: 0.8710300320138218
Macro Precision: 0.8475686361014687
Macro Recall: 0.8450786977363492
Macro F1 score:0.8463049300612113
MCC:0.6926428583906539
LinearSVC
[[12810   939]
 [ 1379  4551]]
Accuracy: 0.8822094618628995
Macro Precision: 0.8658868930638566
Macro Recall: 0.849578874784332
Macro F1 score:0.8570266673804481
MCC:0.7152798847321185
RandomForestClassifier
[[13391   358]
 [ 3131  2799]]
Accuracy: 0.8227044057116724
Macro Precision: 0.8485481505601021
Macro Recall: 0.7229842440173788
Macro F1 score:0.7503930353624455
MCC:0.5575688149683405


## Task - 2. Topic Modelling

Topic model is a type of statistical model used in Machine Learning and Natural Language Processing , used for the discovery of abstract "topics" that occur in a collection of documents.

Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

In [1]:
#read the dataframe using pandas
import pandas as pd
monash_df = pd.read_csv('Monash_crawled.csv')

In [2]:
monash_df[0:5]

Unnamed: 0,uri,url,date,title,body
0,1395271488,http://www.theguardian.com/environment/2020/ja...,2020-01-01,Canberra experiences worst air quality on reco...,Canberra\n has experienced its worst air qual...
1,1396563053,https://weather.com/news/news/2020-01-02-thous...,2020-01-02,Thousands Clog Roads Fleeing Australian Bushfi...,As\n dawn broke over a blackened Australi...
2,1397549175,https://www.businessinsider.com/baby-milestone...,2020-01-03,Key milestones your baby can reach in the firs...,Your baby's brain and body grow a lot during ...
3,1397689515,https://www.dailymail.co.uk/health/article-784...,2020-01-03,"Air pollution can break your BONES, study sugg...",Living in polluted cities may make your bones...
4,1397806413,https://www.independent.co.uk/life-style/gadge...,2020-01-03,'World's most efficient battery' can power a s...,Researchers have developed a new battery they...


In [3]:
len(monash_df)

366

In [4]:
#convert the body column to list
docs = monash_df['body'].tolist()
print(len(docs))
print(docs[0][0:500])

366
 Canberra
 has experienced its worst air quality on record, as bushfire smoke 
became trapped by atmospheric conditions and residents were told to stay
 indoors and brace for more smog in the coming days.   The ACTas 
acting chief health officer, Dr Paul Dugdale, said the smoke was the 
worst since the 2003 bushfires and was acertainly the worsta since air 
quality monitoring started in the city 15 years ago.   Air quality 
index readings in Canberra city were at 3,463 on Wednesday afternoon, 
a


## 2.1 Preprocessing required for LDA(Latent Dirichlet Allocation)

- **Latent Dirichlet Allocation**:-

It is one of the kinds used to build Topic Modelling, which is used to classify the text in a document to a particular topic. It builds a topic per document model as well as words per topic model, which are modeled as "Dirichlet" distributions.

The main goal of LDA is to map all documents to the topics in such a way that the words in each document are mostly captured by these imaginary topics.

The preprocessing required for LDA are,

1.Tokenization

2.Lemmatization

3.Generating Bigrams

4.Acquiring Nouns and Adjectives

5.Compute a bag-of-words 

### Tokenization

In [5]:
# importing the RegexpTokenizer of nltk
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens, based on tokenizer and convert the obtained tokens into a lower case
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  
    docs[idx] = tokenizer.tokenize(docs[idx])  

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

### Lemmatization

In [6]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

### Generating Bigrams

In [7]:
from gensim.models import Phrases
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

### Obtaining Nouns, Adjective tokens

In [8]:
import nltk

In [9]:
tagged_tokens = []
for word in docs:
    tagged_tokens.append(nltk.pos_tag(word))

In [10]:
#getting only nouns and adjectives tokens
new_list = []
for i in tagged_tokens:
    temp = [j[0] for j in i if j[1].startswith("NN") or j[1].startswith("JJ")]
    new_list.append(temp)

print(new_list)



In [11]:
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(new_list)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)
print(dictionary)

Dictionary(903 unique tokens: ['able', 'act', 'activity', 'advice', 'afternoon']...)


In [12]:
corpus = [dictionary.doc2bow(doc) for doc in new_list]
print(corpus)

[[(0, 1), (1, 3), (2, 1), (3, 1), (4, 1), (5, 9), (6, 6), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 3), (13, 1), (14, 4), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 4), (23, 1), (24, 7), (25, 1), (26, 1), (27, 1), (28, 2), (29, 1), (30, 1), (31, 2), (32, 2), (33, 1), (34, 2), (35, 1), (36, 2), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 2), (48, 1), (49, 1), (50, 1), (51, 2), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 3), (60, 1), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 6), (68, 1), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 6), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 8), (90, 1), (91, 1), (92, 1), (93, 2), (94, 1), (95, 1), (96, 1), (97, 6), (98, 1), (99, 1), (100, 3), (101, 1)], [(5, 3), (6, 1), (10, 1), (12, 5), (14, 4), (18, 1), (24, 2), (37, 11), (40, 1), (46, 1

In [13]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 903
Number of documents: 366


In [14]:
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## 2.2 Training one turn of LDA model

In [15]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
NUM_TOPICS = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None 
# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)

In [16]:
top_topics = model.top_topics(corpus) 

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.0447.
[([(0.038073655, 'ship'),
   (0.02877492, 'cruise'),
   (0.021672508, 'princess'),
   (0.021050977, 'cruise_ship'),
   (0.020372745, 'passenger'),
   (0.020066619, 'diamond'),
   (0.019980077, 'diamond_princess'),
   (0.017404117, 'japan'),
   (0.010913877, 'pandemic'),
   (0.010723525, 'board'),
   (0.009597029, 'medical'),
   (0.009423708, 'positive'),
   (0.008654467, 'tested_positive'),
   (0.008336073, 'quarantine'),
   (0.007957554, 'minister'),
   (0.0077833314, 'february_february'),
   (0.0077095474, 'yokohama'),
   (0.007400305, 'january_january'),
   (0.00690637, 'friday'),
   (0.0068566157, 'thursday')],
  -0.6394128144554867),
 ([(0.032558367, 'area'),
   (0.023451507, 'wuhan'),
   (0.014295968, 'chinese'),
   (0.01333323, 'symptom'),
   (0.012335805, 'hospital'),
   (0.010656995, 'human'),
   (0.009952416, 'data'),
   (0.009669711, 'authority'),
   (0.008704036, 'medical'),
   (0.0083388975, 'man'),
   (0.008269775, 'result'),
   (0.008191

## 2.3 Visualizing the LDA model for one turn

In [17]:
# !pip3 install pyLDAvis
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=True)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## 2.4 Training second turn of LDA model

Second turn of LDA model is trained by changing the hyper parameters that are required for LDA model.

In [18]:
# Train LDA model.
from gensim.models import LdaModel
import random
# Set training parameters.
random.seed(150)
NUM_TOPICS = 50
chunksize = 3000
passes = 70
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)
outputfile = f'model{NUM_TOPICS}.gensim'
print("Saving model in " + outputfile)
print("")
model.save(outputfile)

Saving model in model50.gensim



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## 2.5 Visualizing the LDA model for second turn

In [19]:
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=True)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
