# Assignment 1 for FIT5212, Semester 1, 2020

**Student Name:**  VIVEK VEERI

**Student ID:**    30081092

Version: 1.0

Environment: Python 3.7.0 (64-bit)

Libraries used:
* [re 2.2.1 (for regular expression, included in Python 3.7)](https://docs.python.org/3/library/re.html)
* [pandas 0.25.0 (for data frame, included in Python 3.7) ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)
* [Numpy Official Documentation)](https://docs.scipy.org/doc/)
* [PyTorch](https://pytorch.org/docs/stable/torch.html)
* [Time](https://docs.python.org/3/library/time.html)
* [TorchText](https://pypi.org/project/torchtext/)
* [scikit-learn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)
* [gensim](https://pypi.org/project/gensim/) 
* [pyLDAvis](https://pypi.org/project/pyLDAvis/)

##  Introduction
 This assessment comprises of two sections: 
 
 [__1. Text Classification__](#Text_Classification)
 
 [__2. Topic Modelling__](#Topic_Modelling)
 
 
  **`1. Text Classification`**:
  
 In __`Text Classification`__ the content was gathered from __`arXiv.org`__, the influential academic web site which consists of contents from Mathematics, Physics as well as Computer Science. We are provided with two input files which is in _[`.csv`_] (Comma Separated Values) file format for `axcs_train` which has data from 1990 -2014 & `axcs_test` has data from 2015 & a bit of 2016.
 
Given dataset has following columns `ID`, `URL`, `Date`, `Title`, `InfoTheory`, `CompVis`, `Math` & `Abstract`.
We are provided with three classes `InfoTheory`, `CompVis`, `Math`. This can occur in any combination, and all three at once, two, one or none may be an post. Main task is to build three text classifiers which predicts these three classes using only the field `Abstract`. We have to build two different text classifiers i.e., 

one [__Neural Network Model__](#NN) and

one [__Machine Learning Model__](#ML) for each of the three classes.

  **`2. Topic Modelling`**:
  
In __`Topic Modelling`__ the content was gathered from news sites which contains the word or label `Monash University`. We are provided with one input 
_[`.csv`_] (Comma Separated Values) file `Monash_crawled.csv`.

Here we have to perform initial text pre-processing and then followed by two runs of LDA model. Here we have to chose appropriate choice for model parameters & pre-processing techniques. Finally visualise the output and interpret the analysis.

More details for each task will be given in the following sections.

## Part 1:  Text Classification<a id='Text_Classification' ></a>

Here we need to build three text classifiers that predict these three classes using `Abstract` field. For each of the three classes we need to build a Neural Network model using `PyTorch` and `TorchText`.

A neural network is a collection of algorithms that attempt to identify underlying associations in a data set through a mechanism that imitates how the human brain functions. We use neural networks to identify associations and hidden trends in raw data as well as to cluster and classify raw data, and to learn and grow continuously over time. There are many neural network strategies like `Perceptrons`, `Convolutional Neural Networks`, `Recurrent Neural Networks` etc. 

Here I am using `Recurrent Neural Network` as this is the most commonly used strateging for analysing sequences. An RNN takes in sequence of words,  𝑋={𝑥1,...,𝑥𝑇} , one at a time, and for each word produces a hidden state `h`. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to generate the next hidden state, $h_t$. 

$$h_t = \text{RNN}(x_t, h_{t-1})$$ 


Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

<img src = "Recurrent-Neural-Network.png"/>

### Import Libraries

1. __re__: This library provides regular expression pattern matching operations used to extract the contents of the data. 
2. __pandas__:This library was used to read the contents & convert the contents of the extracted data into CSV format.
3. __numpy__ : This library is used for operating on arrays.
4. __torch__ : PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
5. __torch.nn__ : Base class for all neural network models.
6. __time__ : This library provides various time related functions.
7. __torchtext__ : This library has ability to define a preprocessing pipeline
8. __sklearn.metrics__ : This library includes score functions, performance metrics and pairwise metrics and distance computations.
9. __TabularDataset__ : This library reads our data in csv, tsv & json formats.

In [0]:
import re
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import time
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from torchtext import data
from torchtext.data import TabularDataset

### Part 1a: Neural Network Model<a id='NN' ></a>

Initially we will create a Recurrent Neural network model (RNN) on train data and evaluate it on test data.

When building models in PyTorch, there is a code to remember how our `RNN` class is `nn.Module` sub-class, and use of `super`.

Usually RNN module consists of three layers: 

**1. Embedding layer** - This layer is used to transform our sparse one-hot vector into a dense embedding vector. It is simply a single, fully connected layer.

**2. RNN layer** - This layer takes $h_{t-1}$ our previous hidden state into our dense vector, which it uses to calculate the next hidden state, $h_{t}$.

**3. Linear layer** - This layer takes the final hidden state and feed it to the correct output dimension through a fully connected layer, $f(h_T)$.


Typically we have to feed the initial hidden state, $h_0$ into the RNN, but in PyTorch, if no initial hidden state is passed it assigns default statement to a tensor of all zeros.

RNN returns two tensors: 

**a.output** - concatenation of the hidden state from every time step

**b.hidden** - final hidden state

**Steps:**

     a. forward method is called when we feed examples to our model.
     b. `text`, is a tensor of size which is a batch of sentences, each with a single one-hot vector translated to each word. 
     c. The input batch is then transferred to get 'embedded' via the embedding layer, which gives us a 
     dense vector representation of our sentences & then feed into RNN.
     d. RNN returns two tensors: output & hidden. It is verified using assert statement.
     e.`squeeze` method, which is used to remove a dimension of size 1.
Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction.

In [0]:
## Creating a class for RNN model
class RNN(nn.Module):
    '''
    __init__ method is created to define the layers present in RNN module. Usually the values for the parameters
    inside this method are initialised randomly, unless specified. 
    super() returns the delegate object to the parent class, so you call directly the method you want.
    '''
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):        
        super().__init__()        
        self.embedding = nn.Embedding(input_dim, embedding_dim)        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)        
        self.fc = nn.Linear(hidden_dim, output_dim)
  
    def forward(self, text):        
        embedded = self.embedding(text)        
        output, hidden = self.rnn(embedded)
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))        
        return self.fc(hidden.squeeze(0))

### Loading data

One of the main concepts in Torchtext is `Field` as it defines how the data can be processed.

Here we use two fields: 

**`TEXT`** - Define how the review should be processed

**`LABEL`** - Define to process the sentiment analysis

In [0]:
# Setting Random seed for reproducability
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

'''
Here we use `spacy` tokenizer for tokenize argument. If no argument is passed, default splitting is done by using spaces.
'''
TEXT = data.Field(sequential=True, tokenize = 'spacy', lower=True)
'''
Special field `Label` is used to handle labels.
'''
LABEL = data.LabelField(dtype = torch.float, use_vocab=False, preprocessing=int)

First of all For train data we have to process `Label`. All fields in the data should pass in the same order of columns.
We won't be needing the few columns, so we pass in `None` for the field which we are not using.

### a. Neural Network Model for Compvis Column

In [0]:
## Applying NN model on CompVis column
'''
For train data we have to process `Label`. All fields in the data should pass in the same order of columns.
We won't be needing the few columns, so we pass in `None` for the field.
'''
datafields_compvis = [("ID", None), 
                 ("URL", None),
                 ("Date", None),
                 ("Title", None),
                 ("InfoTheory", None),
                 ("CompVis", LABEL),
                 ('Math',None),
                 ("Abstract", TEXT)]

train_data_compvis, test_data_compvis = TabularDataset.splits(path='',
                                              train='axcs_train.csv', 
                                              test='axcs_test.csv', 
                                              format='csv',
                                              skip_header=True,
                                              fields=datafields_compvis)

In [5]:
'''
Checking the length of both train data & test data.
'''
print(f'Number of training examples: {len(train_data_compvis)}')
print(f'Number of testing examples: {len(test_data_compvis)}')

Number of training examples: 54731
Number of testing examples: 19679


In [6]:
print(vars(train_data_compvis.examples[0]))

{'CompVis': 0, 'Abstract': [' ', 'nested', 'satisfiability', 'a', 'special', 'case', 'of', 'the', 'satisfiability', 'problem', ',', 'in', 'which', 'the', 'clauses', 'have', 'a', 'hierarchical', 'structure', ',', 'is', 'shown', 'to', 'be', 'solvable', 'in', 'linear', 'time', ',', 'assuming', 'that', 'the', 'clauses', 'have', 'been', 'represented', 'in', 'a', 'convenient', 'way', '.']}


Now here we have to build a vocabulary. It is a look up table effectively, where every single word in your data set has a corresponding index. 

Here we are performing this operation since our model doesn't work on strings. A one-hot vector is a vector where all of the elements in the vocabulary are 0, except one, which is 1 and dimensionality is the total number of unique terms.

There are two ways to efficiently reduce our vocabulary, we can either take only the most popular words in the top $n$ or disregard terms that sound less than $m$ times. Hence we are taking top 5000 words.

In [7]:
MAX_VOCAB_SIZE = 6000

TEXT.build_vocab(train_data_compvis, max_size = MAX_VOCAB_SIZE)
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")

Unique tokens in TEXT vocabulary: 6002


Our vocab size is 6000 but we got 6002. Other two are `<unk>` token and `<pad>` token.

While feeding sentences to our model we will batch the sentences and send more than one at a time. Also it has to be made sure both sentences & the batch should be of same size.

Other way of seeing the vocabulary is directly using either the `stoi` (**s**tring **to** **i**nt) or `itos` (**i**nt **to**  **s**tring) method.

In [8]:
print(TEXT.vocab.itos[:10])
print(LABEL)

['<unk>', '<pad>', 'the', 'of', '.', ',', 'a', 'and', '-', 'to']
<torchtext.data.field.LabelField object at 0x7f4259d64278>


Creating the iterators is the final step in preparing the results. We iterate over these in the training loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.

We also want to place the tensors returned by the iterator on the GPU. PyTorch handles this using `torch.device`, we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 16

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data_compvis, test_data_compvis), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.Abstract),
    sort_within_batch = False)

In [10]:
batch_compvis = next(train_iterator.__iter__())
batch_compvis


[torchtext.data.batch.Batch of size 16]
	[.CompVis]:[torch.cuda.FloatTensor of size 16 (GPU 0)]
	[.Abstract]:[torch.cuda.LongTensor of size 227x16 (GPU 0)]

We now create an instance of our RNN class. 

**Input dimension** - Dimension of the one-hot vectors, which is equal to the vocabulary size. 

**Embedding dimension** - Size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.

**Hidden dimension** - Size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

**Output dimension** - Usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 350
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Create a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models.

In [12]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 758,751 trainable parameters


### Train the model for Compvis 

Now we'll set up the training and then train the model.

First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use **_stochastic gradient descent_ (SGD)**. The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update.

In [0]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)

Loss function called `criterion` is declared in PyTorch and I had used `Binary Cross Entropy with logits`.

As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the **_sigmoid_ or _logit_** functions. 

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

In [0]:
criterion = nn.BCEWithLogitsLoss()

Using `.to`, we can place the model and the criterion on the GPU

In [0]:
model = model.to(device)
criterion = criterion.to(device)

Function to calculate the accuracy and feeds the predictions through a sigmoid layer.

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

`train` function iterates over all examples, one batch at a time. 

`model.train()` is used to put the model in "training mode".

For each batch, we first zero the gradients. Each parameter in a model has an attribute of 'grade' that stores the gradient determined by the criterion.

The loss and accuracy are then calculated using our predictions and the labels, `batch.label`, with the loss being averaged over all examples in the batch.

We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.

Loss and accuracy are accumulated over the epoch, the process `.item()` is used to extract a scalar from a tensor that includes only a single value.

Used to calculate train loss & train accuracy for each epochs.

In [0]:
def train(model, iterator, optimizer, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.train()    
    for batch_compvis in iterator:        
        optimizer.zero_grad()                
        predictions = model(batch_compvis.Abstract).squeeze(1)        
        loss = criterion(predictions, batch_compvis.CompVis)        
        acc = binary_accuracy(predictions, batch_compvis.CompVis)        
        loss.backward()        
        optimizer.step()        
        epoch_loss += loss.item()
        epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Function `evaluate` is similar to the function `train`.

`model.eval()` puts the model in "evaluation mode".

`with no_grad()` block causes less memory to be used and speeds up computation.

The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating. Used to calculate test loss & test accuracy for each epochs.1

In [0]:
def evaluate(model, iterator, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.eval()    
    with torch.no_grad():    
        for batch_compvis in iterator:
            predictions = model(batch_compvis.Abstract).squeeze(1)            
            loss = criterion(predictions, batch_compvis.CompVis)            
            acc = binary_accuracy(predictions, batch_compvis.CompVis)
            epoch_loss += loss.item()
            epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Function to tell how long an each iteration takes to compare training time between models

In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Calculation of accuracy for CompVis Column
Setting the number of iterations to 5 then train the model after passing through all the iterations.

Also here I had calculated `Train Loss` , `Train Accuracy`, `Test Loss` & `Test Accuracy`.

In [20]:
N_EPOCHS = 5
for epoch in range(N_EPOCHS):
    start_time = time.time()    
    train_loss_compvis, train_acc_compvis = train(model, train_iterator, optimizer, criterion)    
    end_time = time.time()
    epoch_mins_compvis, epoch_secs_compvis = epoch_time(start_time, end_time)
    test_loss_compvis, test_acc_compvis = evaluate(model, test_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins_compvis}m {epoch_secs_compvis}s')
    print(f'\tTrain Loss: {train_loss_compvis:.3f} | Train Acc: {train_acc_compvis*100:.2f}%')
    print(f'\tTest Loss: {test_loss_compvis:.3f} | Test Acc: {test_acc_compvis*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 39s
	Train Loss: 0.185 | Train Acc: 95.81%
	Test Loss: 0.352 | Test Acc: 88.87%
Epoch: 02 | Epoch Time: 2m 38s
	Train Loss: 0.171 | Train Acc: 95.92%
	Test Loss: 0.363 | Test Acc: 88.85%
Epoch: 03 | Epoch Time: 2m 39s
	Train Loss: 0.171 | Train Acc: 95.92%
	Test Loss: 0.377 | Test Acc: 88.88%
Epoch: 04 | Epoch Time: 2m 38s
	Train Loss: 0.171 | Train Acc: 95.92%
	Test Loss: 0.384 | Test Acc: 88.89%
Epoch: 05 | Epoch Time: 2m 38s
	Train Loss: 0.171 | Train Acc: 95.92%
	Test Loss: 0.388 | Test Acc: 88.89%


### Metrics calculation for CompVis Column
After calculating the test & train accuracy, I had calculated the following metrics of the obtained model:

- __`Confusion Matrix`__ : A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

While there are many different types of classification algorithms, the evaluation of classification models all share similar principles. In a supervised classification problem, there exists a true output and a model-generated predicted output for each data point. For this reason, the results for each data point can be assigned to one of four categories:

 - `True Positive (TP)` - label is positive and prediction is also positive
 - `True Negative (TN)` - label is negative and prediction is also negative
 - `False Positive (FP)` - label is negative but prediction is positive
 - `False Negative (FN)` - label is positive but prediction is negative

These four numbers are the building blocks for most classifier evaluation metrics.

- __`Precision`__ : Measures the percentage of the correct classification from the predicted members. Also called as `positive predictive value`. 

\begin{gather*}
    \therefore Precision = \frac{True Positive}{(True Positive + False Positive)}
\end{gather*}

- __`Recall`__ : Measures the percentage of the correct classification from the overall members. Also called as `Sensitivity`. 

\begin{gather*}
    \therefore Recall = \frac{True Positive}{(True Positive + False Negative)}
\end{gather*}

Both precision and recall are therefore based on an understanding and measure of relevance. 

- __`F1 Score`__ : Measures the balances of Precision & Recall. Also called as `F-score or F-measure`. 

\begin{gather*}
    \therefore F1Score = \frac{2*(Precision * Recall)}{(Precision + Recall)}
\end{gather*}

- __`Accuracy Score`__ : It is the ratio of number of correct predictions to the total number of input samples.

\begin{gather*}
    \therefore AccuracyScore = \frac{Number of Correct Predictions}{Total number of predictions made}
\end{gather*}

- __`Matthews_correlation coeffcient`__ : It is used in machine learning as a measure of the quality of binary (two-class) classifications

\begin{gather*}
    \therefore MCC = \frac{TP * TN - FP * FN}{\sqrt {(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
\end{gather*}


In [0]:
y_predict_compvis = []
y_test_compvis = []

with torch.no_grad():
    for batch in test_iterator:
        predictions = model(batch.Abstract).squeeze(1)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        y_predict_compvis += rounded_preds.tolist()
        y_test_compvis += batch.CompVis.tolist()

In [22]:
y_predict_compvis = np.asarray(y_predict_compvis)
y_test_compvis = np.asarray(y_test_compvis)
confusion_matrix_cv = confusion_matrix(y_test_compvis,y_predict_compvis)
recall=recall_score(y_test_compvis,y_predict_compvis,average='macro')
precision=precision_score(y_test_compvis,y_predict_compvis,average='macro')
f1score=f1_score(y_test_compvis,y_predict_compvis,average='macro')
accuracy=accuracy_score(y_test_compvis,y_predict_compvis)
matthews = matthews_corrcoef(y_test_compvis,y_predict_compvis)
print('Full confusion matrix for method on CompVis:\n',str(confusion_matrix_cv))
print('Accuracy: '+ str(accuracy))
print('Macro Precision: '+ str(precision))
print('Macro Recall: '+ str(recall))
print('Macro F1 score:'+ str(f1score))
print('MCC:'+ str(matthews))

Full confusion matrix for method on CompVis:
 [[17490    37]
 [ 2149     3]]
Accuracy: 0.8889171197723461
Macro Precision: 0.4827874382606039
Macro Recall: 0.4996415116730151
Macro F1 score:0.4719600138813004
MCC:-0.004968099218445334


In [23]:
print('True Negative value in confusion matrix for CompVis Column:',confusion_matrix_cv[0][0])
print('True Positive value in confusion matrix for CompVis Column:',confusion_matrix_cv[1][1])
print('False Positive value in confusion matrix for CompVis Column:',confusion_matrix_cv[0][1])
print('False Negative value in confusion matrix for CompVis Column:',confusion_matrix_cv[1][0])

True Negative value in confusion matrix for CompVis Column: 17490
True Positive value in confusion matrix for CompVis Column: 3
False Positive value in confusion matrix for CompVis Column: 37
False Negative value in confusion matrix for CompVis Column: 2149


### b. Neural Network Model for InfoTheory Column

In [0]:
## Applying NN model on InfoTheory column
'''
For train data we have to process `Label`. All fields in the data should pass in the same order of columns.
We won't be needing the few columns, so we pass in `None` for the field.
'''
datafields_infotheory = [("ID", None), 
                 ("URL", None),
                 ("Date", None),
                 ("Title", None),
                 ("InfoTheory", LABEL),
                 ("CompVis", None),
                 ('Math',None),
                 ("Abstract", TEXT)]

train_data_infotheory, test_data_infotheory = TabularDataset.splits(path='',
                                              train='axcs_train.csv', 
                                              test='axcs_test.csv', 
                                              format='csv',
                                              skip_header=True,
                                              fields=datafields_infotheory)

In [25]:
'''
Checking the length of both train data & test data.
'''
print(f'Number of training examples: {len(train_data_infotheory)}')
print(f'Number of testing examples: {len(test_data_infotheory)}')

Number of training examples: 54731
Number of testing examples: 19679


In [26]:
print(vars(train_data_infotheory.examples[0]))

{'InfoTheory': 0, 'Abstract': [' ', 'nested', 'satisfiability', 'a', 'special', 'case', 'of', 'the', 'satisfiability', 'problem', ',', 'in', 'which', 'the', 'clauses', 'have', 'a', 'hierarchical', 'structure', ',', 'is', 'shown', 'to', 'be', 'solvable', 'in', 'linear', 'time', ',', 'assuming', 'that', 'the', 'clauses', 'have', 'been', 'represented', 'in', 'a', 'convenient', 'way', '.']}


In [27]:
MAX_VOCAB_SIZE = 6000

TEXT.build_vocab(train_data_infotheory, max_size = MAX_VOCAB_SIZE)
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")

Unique tokens in TEXT vocabulary: 6002


Creating the iterators is the final step in preparing the results. We iterate over these in the training loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.

We also want to place the tensors returned by the iterator on the GPU. PyTorch handles this using `torch.device`, we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 16

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data_infotheory, test_data_infotheory), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.Abstract),
    sort_within_batch = False)

In [29]:
batch_infotheory = next(train_iterator.__iter__())
batch_infotheory


[torchtext.data.batch.Batch of size 16]
	[.InfoTheory]:[torch.cuda.FloatTensor of size 16 (GPU 0)]
	[.Abstract]:[torch.cuda.LongTensor of size 227x16 (GPU 0)]

We now create an instance of our RNN class. 

**Input dimension** - Dimension of the one-hot vectors, which is equal to the vocabulary size. 

**Embedding dimension** - Size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.

**Hidden dimension** - Size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 350
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Create a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models.

In [31]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 758,751 trainable parameters


### Train the model for InfoTheory 

Now we'll set up the training and then train the model.

First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use **_stochastic gradient descent_ (SGD)**. The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update.

In [0]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)

Loss function called `criterion` is declared in PyTorch and I had used `Binary Cross Entropy with logits`.

As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the **_sigmoid_ or _logit_** functions. 

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

In [0]:
criterion = nn.BCEWithLogitsLoss()

Using `.to`, we can place the model and the criterion on the GPU

In [0]:
model = model.to(device)
criterion = criterion.to(device)

`train` function iterates over all examples, one batch at a time. 

`model.train()` is used to put the model in "training mode".

For each batch, we first zero the gradients. Each parameter in a model has an attribute of 'grade' that stores the gradient determined by the criterion.

The loss and accuracy are then calculated using our predictions and the labels, `batch.label`, with the loss being averaged over all examples in the batch.

We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.

Loss and accuracy are accumulated over the epoch, the process `.item()` is used to extract a scalar from a tensor that includes only a single value.

Used to calculate train loss & train accuracy for each epochs.

In [0]:
def train(model, iterator, optimizer, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.train()    
    for batch_infotheory in iterator:        
        optimizer.zero_grad()                
        predictions = model(batch_infotheory.Abstract).squeeze(1)        
        loss = criterion(predictions, batch_infotheory.InfoTheory)        
        acc = binary_accuracy(predictions, batch_infotheory.InfoTheory)        
        loss.backward()        
        optimizer.step()        
        epoch_loss += loss.item()
        epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Function `evaluate` is similar to the function `train`.

`model.eval()` puts the model in "evaluation mode".

`with no_grad()` block causes less memory to be used and speeds up computation.

The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating. Used to calculate test loss & test accuracy for each epochs.

In [0]:
def evaluate(model, iterator, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.eval()    
    with torch.no_grad():    
        for batch_infotheory in iterator:
            predictions = model(batch_infotheory.Abstract).squeeze(1)            
            loss = criterion(predictions, batch_infotheory.InfoTheory)            
            acc = binary_accuracy(predictions, batch_infotheory.InfoTheory)
            epoch_loss += loss.item()
            epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Calculation of accuracy for InfoTheory Column
Setting the number of iterations to 5 then train the model after passing through all the iterations.

Also here I had calculated `Train Loss` , `Train Accuracy`, `Test Loss` & `Test Accuracy`.

In [37]:
N_EPOCHS = 5
for epoch in range(N_EPOCHS):
    start_time = time.time()    
    train_loss_infotheory, train_acc_infothoery = train(model, train_iterator, optimizer, criterion)    
    end_time = time.time()
    epoch_mins_infotheory, epoch_secs_infotheory = epoch_time(start_time, end_time)
    test_loss_infotheory, test_acc_infotheory = evaluate(model, test_iterator, criterion)
        
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins_infotheory}m {epoch_secs_infotheory}s')
    print(f'\tTrain Loss: {train_loss_infotheory:.3f} | Train Acc: {train_acc_infothoery*100:.2f}%')
    print(f'\tTest Loss: {test_loss_infotheory:.3f} | Test Acc: {test_acc_infotheory*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 38s
	Train Loss: 0.493 | Train Acc: 80.67%
	Test Loss: 0.498 | Test Acc: 81.00%
Epoch: 02 | Epoch Time: 2m 38s
	Train Loss: 0.490 | Train Acc: 80.71%
	Test Loss: 0.494 | Test Acc: 81.03%
Epoch: 03 | Epoch Time: 2m 38s
	Train Loss: 0.490 | Train Acc: 80.68%
	Test Loss: 0.491 | Test Acc: 81.10%
Epoch: 04 | Epoch Time: 2m 38s
	Train Loss: 0.490 | Train Acc: 80.70%
	Test Loss: 0.494 | Test Acc: 81.12%
Epoch: 05 | Epoch Time: 2m 37s
	Train Loss: 0.490 | Train Acc: 80.70%
	Test Loss: 0.491 | Test Acc: 81.18%


### Metrics calculation for InfoTheory Column
After calculating the test & train accuracy, I had calculated the following metrics of the obtained model:

- __`Confusion Matrix`__ : A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

While there are many different types of classification algorithms, the evaluation of classification models all share similar principles. In a supervised classification problem, there exists a true output and a model-generated predicted output for each data point. For this reason, the results for each data point can be assigned to one of four categories:

 - `True Positive (TP)` - label is positive and prediction is also positive
 - `True Negative (TN)` - label is negative and prediction is also negative
 - `False Positive (FP)` - label is negative but prediction is positive
 - `False Negative (FN)` - label is positive but prediction is negative

These four numbers are the building blocks for most classifier evaluation metrics.

- __`Precision`__ : Measures the percentage of the correct classification from the predicted members. Also called as `positive predictive value`. 

\begin{gather*}
    \therefore Precision = \frac{True Positive}{(True Positive + False Positive)}
\end{gather*}

- __`Recall`__ : Measures the percentage of the correct classification from the overall members. Also called as `Sensitivity`. 

\begin{gather*}
    \therefore Recall = \frac{True Positive}{(True Positive + False Negative)}
\end{gather*}

Both precision and recall are therefore based on an understanding and measure of relevance. 

- __`F1 Score`__ : Measures the balances of Precision & Recall. Also called as `F-score or F-measure`. 

\begin{gather*}
    \therefore F1Score = \frac{2*(Precision * Recall)}{(Precision + Recall)}
\end{gather*}

- __`Accuracy Score`__ : It is the ratio of number of correct predictions to the total number of input samples.

\begin{gather*}
    \therefore AccuracyScore = \frac{Number of Correct Predictions}{Total number of predictions made}
\end{gather*}

- __`Matthews_correlation coeffcient`__ : It is used in machine learning as a measure of the quality of binary (two-class) classifications

\begin{gather*}
    \therefore MCC = \frac{TP * TN - FP * FN}{\sqrt {(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
\end{gather*}


In [0]:
y_predict_infotheory = []
y_test_infotheory = []

with torch.no_grad():
    for batch in test_iterator:
        predictions = model(batch.Abstract).squeeze(1)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        y_predict_infotheory += rounded_preds.tolist()
        y_test_infotheory += batch.InfoTheory.tolist()

In [39]:
y_predict_infotheory = np.asarray(y_predict_infotheory)
y_test_infotheory = np.asarray(y_test_infotheory)
confusion_matrix_it = confusion_matrix(y_test_infotheory,y_predict_infotheory)
recall=recall_score(y_test_infotheory,y_predict_infotheory,average='macro')
precision=precision_score(y_test_infotheory,y_predict_infotheory,average='macro')
f1score=f1_score(y_test_infotheory,y_predict_infotheory,average='macro')
accuracy=accuracy_score(y_test_infotheory,y_predict_infotheory)
matthews = matthews_corrcoef(y_test_infotheory,y_predict_infotheory)
print('Full confusion matrix for method on InfoTheory: \n'+ str(confusion_matrix_it))
print('Accuracy: '+ str(accuracy))
print('Macro Precision: '+ str(precision))
print('Macro Recall: '+ str(recall))
print('Macro F1 score:'+ str(f1score))
print('MCC:'+ str(matthews))

Full confusion matrix for method on InfoTheory: 
[[15962   101]
 [ 3603    13]]
Accuracy: 0.8117790538137101
Macro Precision: 0.4649398541075408
Macro Recall: 0.4986536953637751
Macro F1 score:0.4515036671762517
MCC:-0.013740689496781608


In [40]:
print('True Negative value in confusion matrix for InfoTheory Column:',confusion_matrix_it[0][0])
print('True Positive value in confusion matrix for InfoTheory Column:',confusion_matrix_it[1][1])
print('False Positive value in confusion matrix for InfoTheory Column:',confusion_matrix_it[0][1])
print('False Negative value in confusion matrix for InfoTheory Column:',confusion_matrix_it[1][0])

True Negative value in confusion matrix for InfoTheory Column: 15962
True Positive value in confusion matrix for InfoTheory Column: 13
False Positive value in confusion matrix for InfoTheory Column: 101
False Negative value in confusion matrix for InfoTheory Column: 3603


### c. Neural Network Model for Math Column

In [0]:
## Applying NN model on Math column
'''
For train data we have to process `Label`. All fields in the data should pass in the same order of columns.
We won't be needing the few columns, so we pass in `None` for the field.
'''
datafields_math = [("ID", None), 
                 ("URL", None),
                 ("Date", None),
                 ("Title", None),
                 ("InfoTheory", None),
                 ("CompVis", None),
                 ('Math',LABEL),
                 ("Abstract", TEXT)]

train_data_math, test_data_math = TabularDataset.splits(path='',
                                              train='axcs_train.csv', 
                                              test='axcs_test.csv', 
                                              format='csv',
                                              skip_header=True,
                                              fields=datafields_math)

In [42]:
'''
Checking the length of both train data & test data.
'''
print(f'Number of training examples: {len(train_data_math)}')
print(f'Number of testing examples: {len(test_data_math)}')

Number of training examples: 54731
Number of testing examples: 19679


In [43]:
print(vars(train_data_math.examples[0]))

{'Math': 0, 'Abstract': [' ', 'nested', 'satisfiability', 'a', 'special', 'case', 'of', 'the', 'satisfiability', 'problem', ',', 'in', 'which', 'the', 'clauses', 'have', 'a', 'hierarchical', 'structure', ',', 'is', 'shown', 'to', 'be', 'solvable', 'in', 'linear', 'time', ',', 'assuming', 'that', 'the', 'clauses', 'have', 'been', 'represented', 'in', 'a', 'convenient', 'way', '.']}


In [44]:
MAX_VOCAB_SIZE = 6000

TEXT.build_vocab(train_data_math, max_size = MAX_VOCAB_SIZE)
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")

Unique tokens in TEXT vocabulary: 6002


Creating the iterators is the final step in preparing the results. We iterate over these in the training loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.

We also want to place the tensors returned by the iterator on the GPU. PyTorch handles this using `torch.device`, we then pass this device to the iterator.

In [0]:
BATCH_SIZE = 16

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data_math, test_data_math), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.Abstract),
    sort_within_batch = False)

In [46]:
batch_math = next(train_iterator.__iter__())
batch_math


[torchtext.data.batch.Batch of size 16]
	[.Math]:[torch.cuda.FloatTensor of size 16 (GPU 0)]
	[.Abstract]:[torch.cuda.LongTensor of size 227x16 (GPU 0)]

We now create an instance of our RNN class. 

**Input dimension** - Dimension of the one-hot vectors, which is equal to the vocabulary size. 

**Embedding dimension** - Size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.

**Hidden dimension** - Size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

**Output dimension** - Usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 350
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Create a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models.

In [48]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 758,751 trainable parameters


### Train the model for Math 

Now we'll set up the training and then train the model.

First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use **_stochastic gradient descent_ (SGD)**. The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update.

In [0]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)

Loss function called `criterion` is declared in PyTorch and I had used `Binary Cross Entropy with logits`.

As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the **_sigmoid_ or _logit_** functions. 

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

In [0]:
criterion = nn.BCEWithLogitsLoss()

Using `.to`, we can place the model and the criterion on the GPU

In [0]:
model = model.to(device)
criterion = criterion.to(device)

`train` function iterates over all examples, one batch at a time. 

`model.train()` is used to put the model in "training mode".

For each batch, we first zero the gradients. Each parameter in a model has an attribute of 'grade' that stores the gradient determined by the criterion.

The loss and accuracy are then calculated using our predictions and the labels, `batch.label`, with the loss being averaged over all examples in the batch.

We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.

Loss and accuracy are accumulated over the epoch, the process `.item()` is used to extract a scalar from a tensor that includes only a single value.

Used to calculate train loss & train accuracy for each epochs.

In [0]:
def train(model, iterator, optimizer, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.train()    
    for batch_math in iterator:        
        optimizer.zero_grad()                
        predictions = model(batch_math.Abstract).squeeze(1)        
        loss = criterion(predictions, batch_math.Math)        
        acc = binary_accuracy(predictions, batch_math.Math)        
        loss.backward()        
        optimizer.step()        
        epoch_loss += loss.item()
        epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Function `evaluate` is similar to the function `train`.

`model.eval()` puts the model in "evaluation mode".

`with no_grad()` block causes less memory to be used and speeds up computation.

The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating. Used to calculate test loss & test accuracy for each epochs.

In [0]:
def evaluate(model, iterator, criterion):    
    epoch_loss = 0
    epoch_acc = 0    
    model.eval()    
    with torch.no_grad():    
        for batch_math in iterator:
            predictions = model(batch_math.Abstract).squeeze(1)            
            loss = criterion(predictions, batch_math.Math)            
            acc = binary_accuracy(predictions, batch_math.Math)
            epoch_loss += loss.item()
            epoch_acc += acc.item()        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Calculation of accuracy for Math Column
Setting the number of iterations to 5 then train the model after passing through all the iterations.

Also here I had calculated `Train Loss` , `Train Accuracy`, `Test Loss` & `Test Accuracy`.

In [54]:
N_EPOCHS = 5
for epoch in range(N_EPOCHS):
    start_time = time.time()    
    train_loss_math, train_acc_math = train(model, train_iterator, optimizer, criterion)    
    end_time = time.time()
    epoch_mins_math, epoch_secs_math = epoch_time(start_time, end_time)
    test_loss_math, test_acc_math = evaluate(model, test_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins_math}m {epoch_secs_math}s')
    print(f'\tTrain Loss: {train_loss_math:.3f} | Train Acc: {train_acc_math*100:.2f}%')
    print(f'\tTest Loss: {test_loss_math:.3f} | Test Acc: {test_acc_math*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 39s
	Train Loss: 0.617 | Train Acc: 69.16%
	Test Loss: 0.618 | Test Acc: 69.42%
Epoch: 02 | Epoch Time: 2m 38s
	Train Loss: 0.616 | Train Acc: 69.40%
	Test Loss: 0.617 | Test Acc: 69.45%
Epoch: 03 | Epoch Time: 2m 38s
	Train Loss: 0.616 | Train Acc: 69.38%
	Test Loss: 0.616 | Test Acc: 69.48%
Epoch: 04 | Epoch Time: 2m 38s
	Train Loss: 0.616 | Train Acc: 69.39%
	Test Loss: 0.617 | Test Acc: 69.50%
Epoch: 05 | Epoch Time: 2m 38s
	Train Loss: 0.616 | Train Acc: 69.41%
	Test Loss: 0.616 | Test Acc: 69.53%


### Metrics calculation for Math Column
After calculating the test & train accuracy, I had calculated the following metrics of the obtained model:

- __`Confusion Matrix`__ : A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

While there are many different types of classification algorithms, the evaluation of classification models all share similar principles. In a supervised classification problem, there exists a true output and a model-generated predicted output for each data point. For this reason, the results for each data point can be assigned to one of four categories:

 - `True Positive (TP)` - label is positive and prediction is also positive
 - `True Negative (TN)` - label is negative and prediction is also negative
 - `False Positive (FP)` - label is negative but prediction is positive
 - `False Negative (FN)` - label is positive but prediction is negative

These four numbers are the building blocks for most classifier evaluation metrics.

- __`Precision`__ : Measures the percentage of the correct classification from the predicted members. Also called as `positive predictive value`. 

\begin{gather*}
    \therefore Precision = \frac{True Positive}{(True Positive + False Positive)}
\end{gather*}

- __`Recall`__ : Measures the percentage of the correct classification from the overall members. Also called as `Sensitivity`. 

\begin{gather*}
    \therefore Recall = \frac{True Positive}{(True Positive + False Negative)}
\end{gather*}

Both precision and recall are therefore based on an understanding and measure of relevance. 

- __`F1 Score`__ : Measures the balances of Precision & Recall. Also called as `F-score or F-measure`. 

\begin{gather*}
    \therefore F1Score = \frac{2*(Precision * Recall)}{(Precision + Recall)}
\end{gather*}

- __`Accuracy Score`__ : It is the ratio of number of correct predictions to the total number of input samples.

\begin{gather*}
    \therefore AccuracyScore = \frac{Number of Correct Predictions}{Total number of predictions made}
\end{gather*}

- __`Matthews_correlation coeffcient`__ : It is used in machine learning as a measure of the quality of binary (two-class) classifications

\begin{gather*}
    \therefore MCC = \frac{TP * TN - FP * FN}{\sqrt {(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
\end{gather*}


In [0]:
y_predict_math = []
y_test_math = []

with torch.no_grad():
    for batch in test_iterator:
        predictions = model(batch.Abstract).squeeze(1)
        rounded_preds = torch.round(torch.sigmoid(predictions))
        y_predict_math += rounded_preds.tolist()
        y_test_math += batch.Math.tolist()

In [56]:
y_predict_math = np.asarray(y_predict_math)
y_test_math = np.asarray(y_test_math)
confusion_matrix_math = confusion_matrix(y_test_math,y_predict_math)
recall=recall_score(y_test_math,y_predict_math,average='macro')
precision=precision_score(y_test_math,y_predict_math,average='macro')
f1score=f1_score(y_test_math,y_predict_math,average='macro')
accuracy=accuracy_score(y_test_math,y_predict_math)
matthews = matthews_corrcoef(y_test_math,y_predict_math)
print('Full confusion matrix for method on Math:\n'+ str(confusion_matrix_math))
print('Accuracy: '+ str(accuracy))
print('Macro Precision: '+ str(precision))
print('Macro Recall: '+ str(recall))
print('Macro F1 score:'+ str(f1score))
print('MCC:'+ str(matthews))

Full confusion matrix for method on Math:
[[13658    91]
 [ 5906    24]]
Accuracy: 0.6952589054321866
Macro Precision: 0.4534073231223276
Macro Recall: 0.4987142771812195
Macro F1 score:0.41396031728237603
MCC:-0.015479698685657956


In [57]:
print('True Negative value in confusion matrix for Math Column:',confusion_matrix_math[0][0])
print('True Positive value in confusion matrix for Math Column:',confusion_matrix_math[1][1])
print('False Positive value in confusion matrix for Math Column:',confusion_matrix_math[0][1])
print('False Negative value in confusion matrix for Math Column:',confusion_matrix_math[1][0])

True Negative value in confusion matrix for Math Column: 13658
True Positive value in confusion matrix for Math Column: 24
False Positive value in confusion matrix for Math Column: 91
False Negative value in confusion matrix for Math Column: 5906


### Part 1a: Neural Network Method with Pre-Processing

### i) Adding Title to the Abstract column:

* After reading the given dataset using pandas dataframe, initially I had checked for any null values present in the dataset. And I had found `1 Null value` in `test data` and it is replaced before pre-processing.

* Main job is to build three text classifiers on three columns based on `Abstract Column`. As we see thoroughly through the abstract column it starts with `title` followed by the abstract content. 

* Since we need to identify three classifiers which are `CompVis`, `InfoTheory` or `Math`. Most of the content information is obtained from `title`. Hence adding `title` multiple times to the abstract column adds weight to the data and it helps in predicting our classifiers in a much easier way and passing this transformed data to our Neural Network model for predicting to which classifier it belongs to.

In [0]:
train_data = pd.read_csv('axcs_train.csv')
test_data = pd.read_csv('axcs_test.csv')

In [0]:
# Adding Title multiple times to the abstract column to add more weight
train_data['Abstract'] = train_data['Title'] + train_data['Title'] + train_data['Abstract']

Obtained data is written into a new dataframe using `to_csv` in pandas and passing it to tensor data for building Neural Network model. This is one way of applying pre-processing techniques to our data. 

### ii) Extracting Nouns & Adjectives from  Abstract column:

Here I started performing basic pre-processing steps before passing it to tensor data for building Neural Network model.

Steps are:

**1. Segmentation**

**2. Tokenization**

**3. Removal of stop words**

**4. POS Tagging**

* Main motive to extract only `Nouns` & `Adjectives` is that the classifiers which we are going to predict are `CompVis`, `InfoTheory` & `Math`. As all of these three classes are names of the subjects, it indicates we can predict any of the three classes based on Nouns or Adjectives. If we consider other Parts Of Speech like `Verb`,  `Adverb`, `Preposition` etc. won't make any sense for our predictions.

First I had taken all the abstract contents into a list & then started pre-processing steps.

In [0]:
train_abstract_list = train_data['Abstract'].tolist()
train_abstract_list[0:2]

Required libraries for pre-processing data

In [0]:
import re
import pandas as pd
import nltk.data
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')

#### 1. Segmentation

In [0]:
def segmentation(sentence):
    sent_dect = nltk.data.load('tokenizers/punkt/english.pickle')
    segment_token = sent_dect.tokenize(sentence)
    segmented_list = []
    for sent in segment_token: # for loop to iterate the `segment_token` sentence wise
        segmented_list.append(sent.lower()) # converting the sentence to lower case
    return segmented_list

#### 2. Tokenization

In [0]:
def tokenization(sentencee):
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?") # re for RegexpTokenizer
    token_list = []
    for sent in sentencee: # for loop to iterate the `segmented_files_list` file wise
        tokens = tokenizer.tokenize(sent) # tokenizing the file
        token_list.append(tokens)
        token_list_new = reduce(lambda x,y: x+y,token_list)
    return token_list_new

In [0]:
stopwords_list = stopwords.words('english')

#### 3. Removal of Stopwords & POS Tagging

In [0]:
noun_adj = []
for i in range(len(train_abstract_list)):
    sentencee = segmentation(train_abstract_list[i])
    tokenss = tokenization(sentencee)
    nostop_words = [i for i in tokenss if str(i).lower() not in stopwords_list] # Removal of stop words
    pos_tag_words = nltk.pos_tag(nostop_words) # POS Tagging
    final_word = []
    for word in pos_tag_words:
        if (word[1] == 'NN') or (word[1] == 'JJ'):
            final_word.append(word[0])
    noun_adj.append(' '.join(final_word))

In [0]:
train_data['noun_adj'] = noun_adj # Adding the obtained nouns & adjectives to the dataframe
train_data = train_data.drop('Abstract', axis=1) # Dropping the abstract column 
train_data.to_csv('train_data_noun_adj.csv',index = False) # Writing it to a new dataframe

After obtaining the output, it is passed as an input to the neural network model to check the accuracy. 

After applying my pre-processing techniques, I can see slight increase in my accuracy values. Compared to all the other columns `F1-Score` is more for **CompVis Column.**

<img src = "After_PreProcessing1.JPG"/>

### Part 1b: Machine Learning Method<a id='ML' ></a>

### Import Libraries

1. __re__: This library provides regular expression pattern matching operations used to extract the contents of the data. 
2. __pandas__:This library was used to read the contents & convert the contents of the extracted data into CSV format.
3. __numpy__ : This library is used for operating on arrays.
4. __nltk__: NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming.
5. __sklearn.metrics__ : This library includes score functions, performance metrics and pairwise metrics and distance computations.
6. __sklearn.feature_extraction.text__ : This library converts a collection of raw documents to a matrix of TF-IDF features.
7. __sklearn.linear_model__ : This library implements regularized logistic regression
8. __sklearn.naive_bayes__ : This library implements MultinomialNB, GaussianNB, BernoulliNB
9. __sklearn.svm__ : This library implements Linear SVC, SVC
10. __sklearn.ensemble__ : This library implements RandomForest Classifier
11. __sklearn.model_selection__ : This library evaluates a score by cross-validation

In [1]:
# Load required Libraries
from nltk.corpus import stopwords
from nltk import word_tokenize    
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB,BernoulliNB
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import text 
import pandas as pd
import numpy as np
import re

In [2]:
#  Loading the both train & test datasets into a dataframe
train_data = pd.read_csv('axcs_train.csv')
test_data = pd.read_csv('axcs_test.csv')

In [3]:
# Viewing first few contents of train data
train_data.head()

Unnamed: 0,ID,URL,Date,Title,InfoTheory,CompVis,Math,Abstract
0,cs-9301111,arxiv.org/abs/cs/9301111,1989-12-31,Nested satisfiability,0,0,0,Nested satisfiability A special case of the s...
1,cs-9301112,arxiv.org/abs/cs/9301112,1990-03-31,A note on digitized angles,0,0,0,A note on digitized angles We study the confi...
2,cs-9301113,arxiv.org/abs/cs/9301113,1991-07-31,Textbook examples of recursion,0,0,0,Textbook examples of recursion We discuss pro...
3,cs-9301114,arxiv.org/abs/cs/9301114,1991-10-31,Theory and practice,0,0,0,Theory and practice The author argues to Sili...
4,cs-9301115,arxiv.org/abs/cs/9301115,1991-11-30,Context-free multilanguages,0,0,0,Context-free multilanguages This article is a...


In [4]:
# Viewing first few contents of test data
test_data.head()

Unnamed: 0,ID,URL,Date,Title,InfoTheory,CompVis,Math,Abstract
0,no-150100335,arxiv.org/abs/1501.00335,01-01-2015,A Data Transparency Framework for Mobile Appli...,0,0,0,A Data Transparency Framework for Mobile Appl...
1,no-14024178,arxiv.org/abs/1402.4178,01-01-2015,A reclaimer scheduling problem arising in coal...,0,0,0,A reclaimer scheduling problem arising in coa...
2,no-150100263,arxiv.org/abs/1501.00263,01-01-2015,Communication-Efficient Distributed Optimizati...,0,0,1,Communication-Efficient Distributed Optimizat...
3,no-150100287,arxiv.org/abs/1501.00287,01-01-2015,Consistent Classification Algorithms for Multi...,0,0,0,Consistent Classification Algorithms for Mult...
4,no-11070586,arxiv.org/abs/1107.0586,01-01-2015,Managing key multicasting through orthogonal s...,0,0,0,Managing key multicasting through orthogonal ...


Storing the three classifier columns & abstract contents into a single list using `tolist()` function

In [5]:
# Converting the dataframe contents to a list in train data
abstract_train = train_data.Abstract.tolist()
cv_train = train_data.CompVis.tolist()
it_train = train_data.InfoTheory.tolist()
math_train = train_data.Math.tolist()

In [6]:
# Converting the dataframe contents to a list in test data
abstract_test = test_data.Abstract.tolist()
cv_test = test_data.CompVis.tolist()
it_test = test_data.InfoTheory.tolist()
math_test = test_data.Math.tolist()

In [7]:
'''
Checking the length of both train data & test data in abstract column.
'''
print(f'Number of training samples in abstract column: {len(abstract_train)}')
print(f'Number of testing samples in abstract column: {len(abstract_test)}')

Number of training samples in abstract column: 54731
Number of testing samples in abstract column: 19679


Before building Machine Learning Models to our data, I had performed the following pre-processing steps on our data:

**1. Tokenization** - Process of breaking a document down into words, punctuation marks, numeric digits etc.

**2. Conversion to lower case** - Converting all the words into lower case.

**3. Removal of stopwords** - Usually stop words carry little lexical content. They are often functional words in English, for example, articles, pronouns, particles, and so on. In NLP and IR, we usually exclude stop words from the vocabulary. Otherwise, we will face the curse of dimensionality. There are some exceptions, such as syntactic analysis like parsing, we choose to keep those functional words. However, we are going to remove all the stop words by using the stop word list

**4. Unigram & Bigram generation** - Unigrams & bigrams must be extracted and included in your tokenization process.


Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item but it brings context to the words. Usually Lemmatization is preferred as it does morphological analysisof the words.

Here I am creating a class `LemmaTokenizer` to vectorize the text by using `WordNetLemmatizer` which returns the input word unchanged if the word is not found.

In [8]:
# Class for tokenizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

Generally `Tf-idf` stands for `term frequency-inverse document frequency`, and the tf-idf weight is a weight often used in information retrieval and text mining. Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

In [9]:
'''
Here I use TfidfVectorizer which transforms text to feature vectors by converting the text to lowercase. 
Then taking a regex pattern for extarcting text, finally stopwords which is taken from nltk package are removed and 
extracting both unigrams & bigrams then passed through Lemma Tokenizer.
 
'''
stopwords_list = text.ENGLISH_STOP_WORDS
vectorizer = TfidfVectorizer(analyzer='word',input='content',
                           lowercase=True,
                           token_pattern=r"[A-Za-z]\w+(?:[-'?]\w+)?", stop_words = set(stopwords_list),
                           min_df=3,
                           ngram_range=(1,2),
                           tokenizer=LemmaTokenizer())

### a. Machine Learning Model for CompVis Column

In [10]:
# Using `TfidfVectorizer` converting the train dataset of Compvis column into vector form
x_train_cv = vectorizer.fit_transform(abstract_train)
y_train_cv = np.asarray(cv_train)

  'stop_words.' % sorted(inconsistent))


In [11]:
# Returns the number of items in a container
len(vectorizer.get_feature_names())

286125

In [12]:
vectorizer.get_feature_names()

['!',
 '! )',
 '! ,',
 '! -',
 '! :',
 '! =',
 '! =np',
 '! allows',
 '! paper',
 '! possible',
 '! }',
 '#',
 '# ,',
 '# .',
 '# 1',
 '# 64257',
 '# 8211',
 '# 8217',
 '# bi',
 '# csp',
 '# csps',
 '# p',
 '# p-complete',
 '# p-completeness',
 '# p-hard',
 '# p.',
 '# rhpi_1',
 '# sat',
 '$',
 '$ $',
 '$ (',
 '$ ,',
 '$ .',
 '$ \\varphi\\in\\alpha',
 '$ {',
 '%',
 '% (',
 '% )',
 '% +/-',
 '% ,',
 '% -',
 '% -30',
 '% .',
 '% 1',
 '% 10',
 '% 100',
 '% 19',
 '% 20',
 '% 200',
 '% 21',
 '% 25',
 '% 3',
 '% 30',
 '% 40',
 '% 45',
 '% 50',
 '% 6',
 '% 60',
 '% 70',
 '% 71',
 '% 84',
 '% 9',
 '% 90',
 '% 99',
 '% ;',
 '% accuracy',
 '% accurate',
 '% achieved',
 '% applying',
 '% area',
 '% average',
 '% best',
 '% better',
 '% business',
 '% capacity',
 '% case',
 '% character',
 '% citation',
 '% classification',
 '% compared',
 '% comparison',
 '% computational',
 '% confidence',
 '% correct',
 '% corresponding',
 '% coverage',
 '% current',
 '% data',
 '% decrease',
 '% depending',
 '

Developing 4 models for each classifier & Perform 10-fold cross validation to find which model is better.

Following are the models developed:

**1. Logistic Regression** - A supervised learning classification algorithm used to predict the probability of a target variable

**2. Bernoulli NaiveBayes** - A classifier assumes that all our features are binary such that they take only two values (e.g. a nominal categorical feature that has been one-hot encoded).

**3. Linear Support Vector Classifier** - It is used to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data.

**4. RandomForest Classifier** - A classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object.

In [13]:
# Performing cross validation technique on all 4 models for Compvis Column
models = [
    LogisticRegression(),
    BernoulliNB(),
    LinearSVC(),
    RandomForestClassifier()
]
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
     model_name = model.__class__.__name__
     accuracies = cross_val_score(model, x_train_cv, y_train_cv, scoring='accuracy', cv=CV)
     for fold_idx, accuracy in enumerate(accuracies):
          entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])



In [14]:
# Using `TfidfVectorizer` converting the test dataset of Compvis column into vector form
x_test_cv = vectorizer.transform(abstract_test)
y_test_cv = np.asarray(cv_test)

**Metrics Calculation for all the models developed in Compvis Column**

**1. Accuracy** - It is the ratio of number of correct predictions to the total number of input samples.

**2. Precision** - Measures the percentage of the correct classification from the predicted members. Also called as `positive predictive value.`

**3. F1 Score** - Measures the balances of Precision & Recall. Also called as `F-score` or `F-measure`.

**4. Recall** - Measures the percentage of the correct classification from the overall members. Also called as `Sensitivity.`

**5. Mathews Correlation Coefficient** - It is used in machine learning as a measure of the quality of binary (two-class) classifications

In [16]:
# Calculating the above mentioned metrics for Compvis Column
for compvis in models:
    model_name = compvis.__class__.__name__
    compvis.fit(x_train_cv, y_train_cv)
    print(model_name)
    # Do the prediction
    y_predict_cv=compvis.predict(x_test_cv)
    confusion_matrix_cv = confusion_matrix(y_test_cv,y_predict_cv)
    recall=recall_score(y_test_cv,y_predict_cv,average='macro')
    precision=precision_score(y_test_cv,y_predict_cv,average='macro')
    f1score=f1_score(y_test_cv,y_predict_cv,average='macro')
    accuracy=accuracy_score(y_test_cv,y_predict_cv)
    matthews = matthews_corrcoef(y_test_cv,y_predict_cv)
    print('Full confusion matrix for method on CompVis: \n'+ str(confusion_matrix_cv))
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))



LogisticRegression
Full confusion matrix for method on CompVis: 
[[17462    65]
 [  833  1319]]
Accuracy: 0.9543675999796738
Macro Precision: 0.9537515580396425
Macro Recall: 0.8046048258417231
Macro F1 score:0.8604861651286868
MCC:0.7435453296526738
BernoulliNB
Full confusion matrix for method on CompVis: 
[[17477    50]
 [ 1299   853]]
Accuracy: 0.9314497687890645
Macro Precision: 0.9377224748164642
Macro Recall: 0.6967613615997241
Macro F1 score:0.7606346709160439
MCC:0.5869475961197504
LinearSVC
Full confusion matrix for method on CompVis: 
[[17437    90]
 [  471  1681]]
Accuracy: 0.9714924538848518
Macro Precision: 0.9614400795230835
Macro Recall: 0.8879994471620312
Macro F1 score:0.9205826956552985
MCC:0.8462588156193356
RandomForestClassifier
Full confusion matrix for method on CompVis: 
[[17496    31]
 [ 1528   624]]
Accuracy: 0.9207784948422176
Macro Precision: 0.9361760797128897
Macro Recall: 0.6440970627791895
Macro F1 score:0.7009750234839818
MCC:0.5014047943176014


From the above metrics, it is evident that `Linear SVC` is the best choice for `CompVis` column.

### b. Machine Learning Model for InfoTheory Column

In [17]:
# Using `TfidfVectorizer` converting the train dataset of InfoTheory column into vector form
x_train_it = vectorizer.fit_transform(abstract_train)
y_train_it = np.asarray(it_train)

In [18]:
# Returns the number of items in a container
len(vectorizer.get_feature_names())

286125

In [19]:
vectorizer.get_feature_names()

['!',
 '! )',
 '! ,',
 '! -',
 '! :',
 '! =',
 '! =np',
 '! allows',
 '! paper',
 '! possible',
 '! }',
 '#',
 '# ,',
 '# .',
 '# 1',
 '# 64257',
 '# 8211',
 '# 8217',
 '# bi',
 '# csp',
 '# csps',
 '# p',
 '# p-complete',
 '# p-completeness',
 '# p-hard',
 '# p.',
 '# rhpi_1',
 '# sat',
 '$',
 '$ $',
 '$ (',
 '$ ,',
 '$ .',
 '$ \\varphi\\in\\alpha',
 '$ {',
 '%',
 '% (',
 '% )',
 '% +/-',
 '% ,',
 '% -',
 '% -30',
 '% .',
 '% 1',
 '% 10',
 '% 100',
 '% 19',
 '% 20',
 '% 200',
 '% 21',
 '% 25',
 '% 3',
 '% 30',
 '% 40',
 '% 45',
 '% 50',
 '% 6',
 '% 60',
 '% 70',
 '% 71',
 '% 84',
 '% 9',
 '% 90',
 '% 99',
 '% ;',
 '% accuracy',
 '% accurate',
 '% achieved',
 '% applying',
 '% area',
 '% average',
 '% best',
 '% better',
 '% business',
 '% capacity',
 '% case',
 '% character',
 '% citation',
 '% classification',
 '% compared',
 '% comparison',
 '% computational',
 '% confidence',
 '% correct',
 '% corresponding',
 '% coverage',
 '% current',
 '% data',
 '% decrease',
 '% depending',
 '

In [20]:
# Performing cross validation technique on all 4 models for InfoTheory Column
models = [
    LogisticRegression(),
    BernoulliNB(),
    LinearSVC(),
    RandomForestClassifier()
]
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
     model_name = model.__class__.__name__
     accuracies = cross_val_score(model, x_train_it, y_train_it, scoring='accuracy', cv=CV)
     for fold_idx, accuracy in enumerate(accuracies):
          entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])



In [21]:
# Using `TfidfVectorizer` converting the test dataset of InfoTheory column into vector form
x_test_it = vectorizer.transform(abstract_test)
y_test_it = np.asarray(it_test)

In [23]:
# Calculating the metrics for InfoTheory Column
for infotheory in models:
    model_name = infotheory.__class__.__name__
    infotheory.fit(x_train_it, y_train_it)
    print(model_name)
    # Do the prediction
    y_predict_it=infotheory.predict(x_test_it)
    confusion_matrix_it = confusion_matrix(y_test_it,y_predict_it)
    recall=recall_score(y_test_it,y_predict_it,average='macro')
    precision=precision_score(y_test_it,y_predict_it,average='macro')
    f1score=f1_score(y_test_it,y_predict_it,average='macro')
    accuracy=accuracy_score(y_test_it,y_predict_it)
    matthews = matthews_corrcoef(y_test_it,y_predict_it)
    print('Full confusion matrix for method on InfoTheory: \n'+ str(confusion_matrix_it))
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))



LogisticRegression
Full confusion matrix for method on InfoTheory: 
[[15881   182]
 [  851  2765]]
Accuracy: 0.9475074952995579
Macro Precision: 0.9436908269701535
Macro Recall: 0.876663346521633
Macro F1 score:0.9055518821563093
MCC:0.8176113299301296
BernoulliNB
Full confusion matrix for method on InfoTheory: 
[[15658   405]
 [  594  3022]]
Accuracy: 0.9492352253671427
Macro Precision: 0.9226357433882932
Macro Recall: 0.9052584327804403
Macro F1 score:0.9136212996669142
MCC:0.8277117831770574
LinearSVC
Full confusion matrix for method on InfoTheory: 
[[15818   245]
 [  600  3016]]
Accuracy: 0.957060826261497
Macro Precision: 0.9441622083360464
Macro Recall: 0.9094091764782364
Macro F1 score:0.9255557225864601
MCC:0.8528636091137095
RandomForestClassifier
Full confusion matrix for method on InfoTheory: 
[[15896   167]
 [ 1398  2218]]
Accuracy: 0.9204736013008791
Macro Precision: 0.9245708755160174
Macro Recall: 0.801494196110558
Macro F1 score:0.8461467212470114
MCC:0.7155575582876823

From the above metrics, it is evident that `Linear SVC` is the best choice for `InfoTheory` column.

### c. Machine Learning Model for Math Column

In [24]:
# Using `TfidfVectorizer` converting the train dataset of Math column into vector form
x_train_math = vectorizer.fit_transform(abstract_train)
y_train_math = np.asarray(math_train)

In [25]:
# Returns the number of items in a container
len(vectorizer.get_feature_names())

286125

In [26]:
vectorizer.get_feature_names()

['!',
 '! )',
 '! ,',
 '! -',
 '! :',
 '! =',
 '! =np',
 '! allows',
 '! paper',
 '! possible',
 '! }',
 '#',
 '# ,',
 '# .',
 '# 1',
 '# 64257',
 '# 8211',
 '# 8217',
 '# bi',
 '# csp',
 '# csps',
 '# p',
 '# p-complete',
 '# p-completeness',
 '# p-hard',
 '# p.',
 '# rhpi_1',
 '# sat',
 '$',
 '$ $',
 '$ (',
 '$ ,',
 '$ .',
 '$ \\varphi\\in\\alpha',
 '$ {',
 '%',
 '% (',
 '% )',
 '% +/-',
 '% ,',
 '% -',
 '% -30',
 '% .',
 '% 1',
 '% 10',
 '% 100',
 '% 19',
 '% 20',
 '% 200',
 '% 21',
 '% 25',
 '% 3',
 '% 30',
 '% 40',
 '% 45',
 '% 50',
 '% 6',
 '% 60',
 '% 70',
 '% 71',
 '% 84',
 '% 9',
 '% 90',
 '% 99',
 '% ;',
 '% accuracy',
 '% accurate',
 '% achieved',
 '% applying',
 '% area',
 '% average',
 '% best',
 '% better',
 '% business',
 '% capacity',
 '% case',
 '% character',
 '% citation',
 '% classification',
 '% compared',
 '% comparison',
 '% computational',
 '% confidence',
 '% correct',
 '% corresponding',
 '% coverage',
 '% current',
 '% data',
 '% decrease',
 '% depending',
 '

In [27]:
# Performing cross validation technique on all 4 models for Math Column
models = [
    LogisticRegression(),
    BernoulliNB(),
    LinearSVC(),
    RandomForestClassifier()
]
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
     model_name = model.__class__.__name__
     accuracies = cross_val_score(model, x_train_math, y_train_math, scoring='accuracy', cv=CV)
     for fold_idx, accuracy in enumerate(accuracies):
          entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])



In [28]:
# Using `TfidfVectorizer` converting the test dataset of Math column into vector form
x_test_math = vectorizer.transform(abstract_test)
y_test_math = np.asarray(math_test)

In [29]:
# Calculating the metrics for Math Column
for math in models:
    model_name = math.__class__.__name__
    math.fit(x_train_math, y_train_math)
    print(model_name)
    # Do the prediction
    y_predict_math=math.predict(x_test_math)
    confusion_matrix_math = confusion_matrix(y_test_math,y_predict_math)
    recall=recall_score(y_test_math,y_predict_math,average='macro')
    precision=precision_score(y_test_math,y_predict_math,average='macro')
    f1score=f1_score(y_test_math,y_predict_math,average='macro')
    accuracy=accuracy_score(y_test_math,y_predict_math)
    matthews = matthews_corrcoef(y_test_math,y_predict_math)
    print('Full confusion matrix for method on Math: \n'+ str(confusion_matrix_math))
    print('Accuracy: '+ str(accuracy))
    print('Macro Precision: '+ str(precision))
    print('Macro Recall: '+ str(recall))
    print('Macro F1 score:'+ str(f1score))
    print('MCC:'+ str(matthews))



LogisticRegression
Full confusion matrix for method on Math: 
[[13018   731]
 [ 1702  4228]]
Accuracy: 0.8763656689872453
Macro Precision: 0.8684831241177657
Macro Recall: 0.8299086599215494
Macro F1 score:0.845551062705082
MCC:0.6973256733590067
BernoulliNB
Full confusion matrix for method on Math: 
[[12517  1232]
 [ 1306  4624]]
Accuracy: 0.8710300320138218
Macro Precision: 0.8475686361014687
Macro Recall: 0.8450786977363492
Macro F1 score:0.8463049300612113
MCC:0.6926428583906539
LinearSVC
Full confusion matrix for method on Math: 
[[12810   939]
 [ 1379  4551]]
Accuracy: 0.8822094618628995
Macro Precision: 0.8658868930638566
Macro Recall: 0.849578874784332
Macro F1 score:0.8570266673804481
MCC:0.7152798847321185




RandomForestClassifier
Full confusion matrix for method on Math: 
[[13393   356]
 [ 3131  2799]]
Accuracy: 0.8228060368921185
Macro Precision: 0.8488406336688967
Macro Recall: 0.723056976579747
Macro F1 score:0.7504976844270614
MCC:0.5578936710676944


From the above metrics, it is evident that `Linear SVC` is the best choice for `Math` column.

## Part 2: Topic Modelling<a id='Topic_Modelling' ></a>

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.

Topic modelling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

Here I am using **`Latent Dirichlet Allocation (LDA)`** to classify text in a document to find particular topic by using Python's `Gensim` package.

### Import Libraries

In [30]:
# Load required libraries
import pandas as pd
import nltk
import pickle
nltk.download("wordnet")
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Phrases
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim import matutils, models
from pprint import pprint
from nltk import word_tokenize, pos_tag
import pyLDAvis.gensim
import random
%matplotlib inline

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fetch\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [31]:
# Load given data
monash_data = pd.read_csv('Monash_crawled.csv')
monash_data.head()

Unnamed: 0,uri,url,date,title,body
0,1395271488,http://www.theguardian.com/environment/2020/ja...,2020-01-01,Canberra experiences worst air quality on reco...,Canberra\n has experienced its worst air qual...
1,1396563053,https://weather.com/news/news/2020-01-02-thous...,2020-01-02,Thousands Clog Roads Fleeing Australian Bushfi...,As\n dawn broke over a blackened Australi...
2,1397549175,https://www.businessinsider.com/baby-milestone...,2020-01-03,Key milestones your baby can reach in the firs...,Your baby's brain and body grow a lot during ...
3,1397689515,https://www.dailymail.co.uk/health/article-784...,2020-01-03,"Air pollution can break your BONES, study sugg...",Living in polluted cities may make your bones...
4,1397806413,https://www.independent.co.uk/life-style/gadge...,2020-01-03,'World's most efficient battery' can power a s...,Researchers have developed a new battery they...


Before building Topic Modeling to our data, I had performed the following pre-processing steps on our data:

**1. Tokenization** - Process of breaking a document down into words, punctuation marks, numeric digits etc.

**2. Conversion to lower case** - Converting all the words into lower case.

**3. Unigram & Bigram generation** - Unigrams & bigrams must be extracted and included in your tokenization process.

**4. POS Tagging** - Main motive to extract only `Nouns` & `Adjectives`. If we consider other Parts Of Speech like `Verb`,  `Adverb`, `Preposition` etc. won't make any sense for our predictions.

**5. Compute a bag-of-words representation of the data.**


In [32]:
# Function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if pos[:2].startswith('N') or pos[:2].startswith('J')] 
    return ' '.join(nouns_adj)

In [33]:
# Apply the nouns function to the transcripts to filter only on nouns
monash_data_nouns_adj = pd.DataFrame(monash_data.body.apply(nouns_adj))
monash_data_nouns_adj.head()

Unnamed: 0,body
0,Canberra worst air quality record bushfire smo...
1,dawn blackened Australian landscape Sunday pic...
2,baby brain body lot first months baby own pace...
3,polluted cities bones weaker easier research s...
4,Researchers new battery power phone continuous...


In [34]:
# Extarcting the contents of body to a list
monash_body_data = monash_data_nouns_adj['body'].tolist()
print(len(monash_body_data))
print(monash_body_data[0][0:500])

366
Canberra worst air quality record bushfire smoke atmospheric conditions residents indoors brace more smog coming days ACTas chief health officer Dr Paul Dugdale smoke worst bushfires worsta air quality monitoring city years Air quality index readings Canberra city Wednesday afternoon ACT Health website Ratings more hazardous suburb Monash levels Florey ACT health spokesperson AQI reading fine particles Wednesday Monash monitoring site Canberra-based University NSW climate scientist Dr Sophie Lew


Here we had tokenized the text using a RegExpTokenizer. Also I had removed number tokens as well as words with one character.

In [35]:
# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(monash_body_data)):
    monash_body_data[idx] = monash_body_data[idx].lower()  # Convert to lowercase.
    monash_body_data[idx] = tokenizer.tokenize(monash_body_data[idx])  # Split into words.
    

# Remove numbers, but not words that contain numbers.
monash_body_data = [[token for token in monash_doc if not token.isnumeric()] for monash_doc in monash_body_data]

# Remove words that are only one character.
monash_body_data = [[token for token in monash_doc if len(token) > 1] for monash_doc in monash_body_data]

After tokenization, I had used WordNetLemmatizer as Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

In [36]:
lemmatizer = WordNetLemmatizer()
monash_body_data = [[lemmatizer.lemmatize(token) for token in monash_doc] for monash_doc in monash_body_data]

Bigrams are sets of two adjacent words. Using bigrams we can get phrases like "machine_learning" in our output
(spaces are replaced with underscores); without bigrams we would only get "machine" and "learning".

Note that in the code below, we find bigrams and then add them to the original data, because we would like to keep the words "machine" and "learning" as well as the bigram "machine_learning".

In [37]:
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(monash_body_data, min_count=20)
for idx in range(len(monash_body_data)):
    for token in bigram[monash_body_data[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            monash_body_data[idx].append(token)

Depending on the frequency of their text we delete unusual words and common words. Below we delete terms that occur in less than 20 or more than 50% of the papers. 

In [38]:
# Create a dictionary representation of the documents.
dictionary = Dictionary(monash_body_data)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)
print(dictionary)

Dictionary(849 unique tokens: ['able', 'act', 'activity', 'advice', 'afternoon']...)


Finally, the documents are converted into a vectorized form to compute the frequency of each word which are generated from bigrams.

In [39]:
corpus = [dictionary.doc2bow(monash_doc) for monash_doc in monash_body_data]
print(corpus)

[[(0, 1), (1, 3), (2, 1), (3, 1), (4, 1), (5, 9), (6, 6), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 3), (14, 5), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 1), (24, 7), (25, 1), (26, 1), (27, 1), (28, 2), (29, 1), (30, 1), (31, 2), (32, 2), (33, 1), (34, 2), (35, 1), (36, 2), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 2), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 3), (57, 2), (58, 1), (59, 1), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 6), (70, 1), (71, 1), (72, 1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 8), (83, 1), (84, 1), (85, 1), (86, 2), (87, 1), (88, 1), (89, 1), (90, 1), (91, 7), (92, 1), (93, 1), (94, 3), (95, 1), (96, 1)], [(5, 3), (6, 1), (11, 1), (13, 5), (14, 4), (17, 1), (24, 2), (37, 11), (40, 2), (44, 1), (48, 1), (52, 1), (54, 1), (56, 1), (57, 1),

In [40]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 849
Number of documents: 366


Save the corpus and dictionary to drive the visualization on pre-constructed templates for later loading.

In [41]:
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### Training of LDA Model

Before training our LDA model, let us discuss about some training parameters.

`NUM_TOPICS` - Number of requested latent topics to be extracted from the traning corpus.

``chunksize`` Controls how many documents in the Training Algorithm are processed at a time. As chunksize increases, it will speed up training time and can easily fit into the memory. Also, if chunksize is greater than the number of documents, it can process all the documents at a time.

``passes`` it is similar to `epochs` as it controls how often we train the model on the entire corpus.

`iterations` is how many times it is running in a particular loop.

`eval_every`  Don't evaluate model perplexity, setting this to one slows down training process.

In [42]:
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

In [43]:
# Set training parameters.
random.seed(1234)
NUM_TOPICS = 10
chunksize = 2000
passes = 25
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

ldarun1 = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)
ldarun1.print_topics()

[(0,
  '0.038*"fire" + 0.025*"patient" + 0.023*"climate" + 0.017*"study" + 0.015*"change" + 0.013*"bushfires" + 0.013*"climate_change" + 0.013*"research" + 0.013*"level" + 0.012*"risk"'),
 (1,
  '0.024*"wuhan" + 0.018*"chinese" + 0.015*"symptom" + 0.014*"patient" + 0.013*"melbourne" + 0.012*"hospital" + 0.012*"outbreak" + 0.010*"sydney" + 0.010*"city" + 0.009*"medical"'),
 (2,
  '0.025*"school" + 0.018*"home" + 0.015*"social" + 0.013*"child" + 0.012*"state" + 0.012*"food" + 0.010*"covid" + 0.009*"family" + 0.008*"right" + 0.008*"spread"'),
 (3,
  '0.037*"flight" + 0.027*"wuhan" + 0.023*"island" + 0.017*"christmas" + 0.017*"christmas_island" + 0.013*"january_january" + 0.012*"qantas" + 0.012*"february_february" + 0.012*"chinese" + 0.011*"mr"'),
 (4,
  '0.106*"area" + 0.040*"data" + 0.030*"analysis" + 0.030*"amp" + 0.019*"cell" + 0.019*"result" + 0.019*"network" + 0.016*"image" + 0.014*"human" + 0.014*"site"'),
 (5,
  '0.024*"mask" + 0.018*"face" + 0.016*"face_mask" + 0.013*"sydney" + 0.

In [44]:
top_topics = ldarun1.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS
print('Average topic coherence: %.4f.' % avg_topic_coherence)

pprint(top_topics)

Average topic coherence: -1.1563.
[([(0.038750097, 'student'),
   (0.037633628, 'ship'),
   (0.028114667, 'cruise'),
   (0.024229938, 'passenger'),
   (0.021559853, 'ban'),
   (0.020556295, 'princess'),
   (0.018869197, 'diamond'),
   (0.018494595, 'japan'),
   (0.018342553, 'cruise_ship'),
   (0.016369142, 'diamond_princess'),
   (0.015672153, 'travel'),
   (0.012099682, 'travel_ban'),
   (0.011207089, 'board'),
   (0.011146187, 'chinese'),
   (0.009647067, 'february_february'),
   (0.008890872, 'quarantine'),
   (0.0087652095, 'january_january'),
   (0.0083975075, 'positive'),
   (0.008276098, 'yokohama'),
   (0.008245844, 'international')],
  -0.701063580310877),
 ([(0.023561489, 'wuhan'),
   (0.01808191, 'chinese'),
   (0.01474589, 'symptom'),
   (0.01381103, 'patient'),
   (0.012730829, 'melbourne'),
   (0.011931883, 'hospital'),
   (0.011833912, 'outbreak'),
   (0.010187934, 'sydney'),
   (0.010117869, 'city'),
   (0.008799089, 'medical'),
   (0.008763373, 'man'),
   (0.008175428

### Visualisation for LDA first run

In [45]:
lda_display = pyLDAvis.gensim.prepare(ldarun1, corpus, dictionary, sort_topics=True)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [46]:
# Train LDA model run2

# Set training parameters.
random.seed(1234)
NUM_TOPICS = 100
chunksize = 2000
passes = 25
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

ldarun1 = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)

Here I had computed the topic coherence of each topic & calculated the average of topic coherence. 

Finally printed the topics in order of coherence.

In [47]:
top_topics = ldarun1.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS
print('Average topic coherence: %.4f.' % avg_topic_coherence)

pprint(top_topics)

Average topic coherence: -1.4129.
[([(0.0748142, 'february_february'),
   (0.06957495, 'january_january'),
   (0.066171646, 'aedt'),
   (0.062933266, 'aedt_february'),
   (0.05017088, 'flight'),
   (0.04775899, 'passenger'),
   (0.039658397, 'february_western'),
   (0.03963761, 'february_queensland'),
   (0.03955139, 'february_japan'),
   (0.039138146, 'wale_january'),
   (0.039132394, 'january_victoria'),
   (0.039128035, 'january_february'),
   (0.03625549, 'advertisement'),
   (0.03496465, 'board'),
   (0.03471341, 'japan'),
   (0.030066708, 'wuhan'),
   (0.028640406, 'qantas'),
   (0.023097038, 'plane'),
   (0.021380931, 'woman'),
   (0.02073909, 'quarantine')],
  -0.6737483587365319),
 ([(0.06257448, 'island'),
   (0.04810967, 'wuhan'),
   (0.04769135, 'christmas'),
   (0.046819087, 'christmas_island'),
   (0.02902559, 'evacuee'),
   (0.023261884, 'flight'),
   (0.01853921, 'mr'),
   (0.017899128, 'city'),
   (0.016061883, 'centre'),
   (0.016038945, 'epicentre'),
   (0.015572328,

 ([(0.038192626, 'staff'),
   (0.031827845, 'online'),
   (0.029442193, 'student'),
   (0.023723511, 'face'),
   (0.021587515, 'class'),
   (0.017114421, 'response'),
   (0.01659303, 'crisis'),
   (0.016418662, 'covid'),
   (0.01489756, 'march'),
   (0.012953493, 'possible'),
   (0.012902389, 'monday'),
   (0.012490859, 'state'),
   (0.012322944, 'executive'),
   (0.012260662, 'chief_executive'),
   (0.01223727, 'education'),
   (0.012223595, 'term'),
   (0.0118321935, 'to'),
   (0.011686459, 'economic'),
   (0.011493718, 'concern'),
   (0.009668668, 'others')],
  -1.426240739690007),
 ([(0.025463639, 'patient'),
   (0.025200646, 'hospital'),
   (0.021977011, 'lot'),
   (0.021959115, 'new_zealand'),
   (0.021959035, 'zealand'),
   (0.021080667, 'doctor'),
   (0.017979598, 'family'),
   (0.014629168, 'treatment'),
   (0.014360374, 'emergency'),
   (0.014194119, 'disease'),
   (0.013193236, 'world'),
   (0.010996396, 'part'),
   (0.010995811, 'department'),
   (0.010995476, 'report'),
  

   (0.013584566, 'group'),
   (0.013192159, 'minister'),
   (0.013060805, 'access'),
   (0.0129015725, 'first'),
   (0.012647118, 'education'),
   (0.012537422, 'morrison'),
   (0.0124088, 'economy')],
  -1.9993820247243559),
 ([(0.0267735, 'medical'),
   (0.026173264, 'baby'),
   (0.022261618, 'first'),
   (0.020431383, 'official'),
   (0.016967481, 'month'),
   (0.01672625, 'heat'),
   (0.015981635, 'world'),
   (0.013113213, 'medium'),
   (0.012937296, 'smoke'),
   (0.012562053, 'good'),
   (0.012388101, 'event'),
   (0.012082161, 'policy'),
   (0.010917401, 'who'),
   (0.010731802, 'head'),
   (0.010726733, 'pandemic'),
   (0.0102893775, 'monday'),
   (0.010274992, 'contact'),
   (0.010045576, 'animal'),
   (0.00995735, 'due'),
   (0.009842419, 'doctor')],
  -2.128600923484081),
 ([(0.05885406, 'indonesia'),
   (0.040938452, 'melbourne'),
   (0.032766484, 'trade'),
   (0.024388513, 'customer'),
   (0.022143887, 'traveller'),
   (0.018947333, 'australiaas'),
   (0.0185711, 'morrison

### Visualisation for LDA run 2

In [48]:
lda_display1 = pyLDAvis.gensim.prepare(ldarun1, corpus, dictionary, sort_topics=True)
pyLDAvis.display(lda_display1)

## Conclusion :
From this assessment, I had learnt how to implement Text Classification by building three different classifiers on a specific column by using Neural Network model & Machine Learning models.  Also I had learnt the concept of Topic Modelling to discover the abstract contents of a specific topic from a collection of documents available. Here I had used an unsupervised learning methodology called Latent Dirichlet Allocation (LDA) to classify text in a document to find particular topic and interpret the output of this topic model with the help of visualisations using genism package in Python which helped me in identifying the novel findings from the topic model.

## References :
https://monkeylearn.com/text-classification/ 

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

https://radimrehurek.com/gensim/models/ldamodel.html

http://bl.ocks.org/AlessandraSozzi/raw/ce1ace56e4aed6f2d614ae2243aab5a5/