<a href="https://colab.research.google.com/github/praveentn/hgwxx7/blob/master/nlp/grammar/grammar_checker_ULMFIT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grammar-checker using ulmfit.

## Installing the fastai library

In [1]:
!pip install wget
from fastai.text import *

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=2b21382438ab9cf39e96e474bc3e4d9a1cd0ab4820d3335700e55c4ad9dbb476
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


## Downloading the Dataset.

We'll use The Corpus of Linguistic Acceptability (CoLA) dataset for single sentence classification. It's a set of sentences labeled as grammatically correct or incorrect. It was first published in May of 2018, and is one of the tests included in the "GLUE Benchmark" on which models like BERT are competing.

In [2]:
import wget
import os
%tensorflow_version 2.x

import tensorflow as tf
print(tf.__version__)
device_name = tf.test.gpu_device_name()
#if device_name != '/device:GPU:0':
#  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

print('Downloading dataset...')

# The URL for the dataset zip file.
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
    wget.download(url, './cola_public_1.1.zip')

2.3.0
Found GPU at: /device:GPU:0
Downloading dataset...


In [3]:
# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
    !unzip cola_public_1.1.zip

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


## Parsing the training and testing data.

In [4]:
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training sentences: 8,551



Unnamed: 0,sentence_source,label,label_notes,sentence
5492,b_73,1,,I gave her so much.
3753,ks08,1,,The child broke the teapot by accident.
8310,ad03,0,*,Owners of a pig loves to eat truffles
233,cj99,1,,So much did you eat that everyone gasped.
2308,l-93,0,*,Linda taped the label and the cover.
2676,l-93,1,,Carla slid the book to Dale.
60,gj04,1,,This building got taller and taller.
1117,r-67,0,*,He threw into the wastebasket the letter.
3873,ks08,1,,This is the box in which John put his gold.
7677,sks13,1,,the rice was cooked by Bill.


The two properties we actually care about are the the `sentence` and its `label`, which is referred to as the "weather it is grammatically correct or not" (0=unacceptable, 1=acceptable).

In [5]:
from sklearn.model_selection import train_test_split

train_sentences = df.sentence
train_labels = df.label
train_set = pd.concat([train_labels,train_sentences], axis=1)
new_train_set, new_val_set= train_test_split(train_set,test_size=0.10,shuffle=False)

In [6]:
df = pd.read_csv("./cola_public/raw/out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
test_sentences = df.sentence
test_labels = df.label
new_test_set = pd.concat([test_labels,test_sentences], axis=1)
new_val_set

Unnamed: 0,label,sentence
7695,1,The moon glows in the darkness.
7696,1,The moon glows.
7697,1,I sang a song with Mary while you did so with ...
7698,1,What Mary did with Bill was sing a song.
7699,1,She tried to leave
...,...,...
8546,0,Poseidon appears to own a dragon
8547,0,Digitize is my happiest memory
8548,1,It is easy to slay the Gorgon.
8549,1,I had the strangest feeling that I knew you.


## Data-preprocessing

Building a language model and a classifier on the training data.

In [7]:
import os
print('getcwd:', os.getcwd())
path = os.getcwd()
data_lm = TextLMDataBunch.from_df(path,train_df=new_train_set,valid_df= new_val_set)
data_clas = TextClasDataBunch.from_df(path,train_df=new_train_set, valid_df= new_val_set, vocab=data_lm.train_ds.vocab, bs=32)
data_lm.train_ds.vocab

getcwd: /content


<fastai.text.transform.Vocab at 0x7fd97e5934e0>

In [8]:
data_lm.save('data_lm_export.pkl')
data_clas.save('data_clas_export.pkl')

In [9]:
bs=32
data_lm = load_data(path, 'data_lm_export.pkl', bs=bs)
data_clas = load_data(path, 'data_clas_export.pkl', bs=bs)

In [10]:
torch.cuda.set_device(0)

## Language Modelling

In [11]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd.tgz


In [12]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.356718,3.636376,0.298326,00:09


In [13]:
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,3.173735,3.464684,0.36808,00:11


In [14]:
learn.save('cola_language_model')
learn.save_encoder('cola_language_model_encoder')

## Classifier(here grammar verifier)

In [15]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('cola_language_model_encoder')
learn.freeze()

In [16]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.611312,0.594696,0.721963,00:07


In [17]:
data_clas.show_batch()

	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  idx_min = (t != self.pad_idx).nonzero().min()


text,target
"xxbos xxmaj everybody who has ever , worked in any office which contained any xxunk which had ever been used to type any letters which had to be signed by any xxunk who ever worked in any department like mine will know what i xxunk .",1
"xxbos xxmaj will put a picture of xxmaj bill on your desk before tomorrow , this girl in the red coat will put a picture of xxmaj bill on your desk before tomorrow .",0
xxbos xxmaj the xxunk man in the room said that xxmaj john danced an xxmaj xxunk jig from xxmaj county xxmaj kerry to xxmaj county xxmaj xxunk on xxmaj xxunk .,1
xxbos i watched the xxmaj xxunk who the man who had been my xxunk in my xxunk year had xxunk me to study when i got to xxmaj xxunk talk .,0
xxbos a xxunk xxunk of potatoes with xxunk xxunk fell on the professor of linguistics with the terrible taste in t - shirts from the twelfth story .,1


In [18]:
learn.freeze_to(-1)
learn.fit_one_cycle(3, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.618353,0.596541,0.714953,00:07
1,0.606092,0.602281,0.721963,00:07
2,0.59784,0.596577,0.719626,00:07


In [19]:
learn.freeze_to(-2)
learn.fit_one_cycle(3, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.612826,0.589351,0.718458,00:09
1,0.590064,0.595709,0.720794,00:09
2,0.573383,0.598853,0.723131,00:09


In [20]:
learn.unfreeze()
learn.fit_one_cycle(5, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.578054,0.589864,0.724299,00:17
1,0.564798,0.628624,0.704439,00:18
2,0.480399,0.635295,0.727804,00:18
3,0.395562,0.708104,0.73014,00:17
4,0.297598,0.781073,0.735981,00:17


## Testing

Testing for custom input

In [21]:
learn.predict('he said if else')


(Category tensor(1), tensor(1), tensor([0.3916, 0.6084]))

Testing on the test-set.

In [22]:
preds,targs = learn.get_preds(ordered=True)
accuracy(preds,targs)

tensor(0.7360)

In [23]:
# check if the dataset is inbalanced.
train_labels.value_counts()


1    6023
0    2528
Name: label, dtype: int64



Accuracy on the CoLA benchmark is measured using the "Matthews correlation coefficient" (MCC), as we can see the dataset is an imbalanced dataset.

Now we'll load the holdout dataset and prepare inputs just as we did with the training set. Then we'll evaluate predictions using Matthew's correlation coefficient because this is the metric used by the wider NLP community to evaluate performance on CoLA. With this metric, +1 is the best score, and -1 is the worst score. This way, we can see how well we perform against the state of the art models for this specific task.

In [24]:
from sklearn.metrics import matthews_corrcoef
import tensorflow as tf

test_labels_set = []
pred_labels_i_set = []

# Evaluate each test batch using Matthew's correlation coefficient
print('Calculating Matthews Corr. Coef...')

# For each input batch...
for i in range(len(test_labels)):
  logits = learn.predict(test_sentences[i])
  pred_labels_i = 1 if logits[1] == 1 else 0

  test_labels_set.append(test_labels[i])
  pred_labels_i_set.append(pred_labels_i)

# Calculate and store the coef for the test-data. 
matthews = matthews_corrcoef(test_labels_set, pred_labels_i_set) 

print(matthews) 
              

Calculating Matthews Corr. Coef...
0.09249247627260596


## Export the model

In [25]:
learn.export()

## Import the saved model and test


For running on our local machine(even CPU), we can just import the saved model and play with it

In [26]:
from fastai.text import *
learn = load_learner('/content')

In [27]:
learn.predict('He said if else')

(Category tensor(1), tensor(1), tensor([0.3279, 0.6721]))

In [28]:
learn.predict('I am going to the store.')

(Category tensor(1), tensor(1), tensor([0.1742, 0.8258]))

In [33]:
learn.predict('I wonder whom us to trust')

(Category tensor(1), tensor(1), tensor([0.2523, 0.7477]))

In [39]:
learn.predict('he herself the job')

(Category tensor(0), tensor(0), tensor([0.5493, 0.4507]))

In [38]:
preds[23:33], targs[23:33]

(tensor([[0.7517, 0.2483],
         [0.0272, 0.9728],
         [0.0281, 0.9719],
         [0.1329, 0.8671],
         [0.8400, 0.1600],
         [0.9942, 0.0058],
         [0.0126, 0.9874],
         [0.0164, 0.9836],
         [0.2335, 0.7665],
         [0.0051, 0.9949]]), tensor([1, 1, 1, 0, 0, 0, 1, 1, 1, 1]))