<a href="https://colab.research.google.com/gist/sayakmisra/3f5a3fc7eb18e0a6f93dac4a08b08dd8/grammar-checker-ulmfit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grammar-checker using ulmfit.

## Installing the fastai library

In [None]:
!pip install wget
from fastai.text import *

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=3f1dc374b637eb77b2d84efebf756ac8a277bdb5fe0fdd44c8b170f8eb784ba1
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


## Downloading the Dataset.

We'll use The Corpus of Linguistic Acceptability (CoLA) dataset for single sentence classification. It's a set of sentences labeled as grammatically correct or incorrect. It was first published in May of 2018, and is one of the tests included in the "GLUE Benchmark" on which models like BERT are competing.

In [None]:
import wget
import os
#%tensorflow_version 2.x

import tensorflow as tf
print(tf.__version__)
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

print('Downloading dataset...')

# The URL for the dataset zip file.
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
    wget.download(url, './cola_public_1.1.zip')

2.8.2
Found GPU at: /device:GPU:0
Downloading dataset...


In [None]:
# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
    !unzip cola_public_1.1.zip

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


## Parsing the training and testing data.

In [None]:
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training sentences: 8,551



Unnamed: 0,sentence_source,label,label_notes,sentence
5089,ks08,1,,What I've always tended to do is to do my own ...
6153,c_13,0,*,To Louis was sent a book.
6423,d_98,0,*,Any lion is rare.
7133,sks13,1,,This girl in the red coat will put a picture o...
4038,ks08,1,,Joe warned the class that the exam would be di...
7182,sks13,0,*,"Put a picture of Bill on your desk, this girl ..."
5300,b_82,1,,"It's obvious that, although he's a nice guy, J..."
1135,r-67,0,*,I loaned my binoculars a man who was watching ...
5558,b_73,0,*,John is taller than six feet is.
8379,ad03,1,,Gilgamesh might not have been reading the cune...


The two properties we actually care about are the the `sentence` and its `label`, which is referred to as the "weather it is grammatically correct or not" (0=unacceptable, 1=acceptable).

In [None]:
from sklearn.model_selection import train_test_split

train_sentences = df.sentence
train_labels = df.label
train_set = pd.concat([train_labels,train_sentences], axis=1)
new_train_set, new_val_set= train_test_split(train_set,test_size=0.10,shuffle=False)

In [None]:
df = pd.read_csv("./cola_public/raw/out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
test_sentences = df.sentence
test_labels = df.label
new_test_set = pd.concat([test_labels,test_sentences], axis=1)
new_val_set

Unnamed: 0,label,sentence
7695,1,The moon glows in the darkness.
7696,1,The moon glows.
7697,1,I sang a song with Mary while you did so with ...
7698,1,What Mary did with Bill was sing a song.
7699,1,She tried to leave
...,...,...
8546,0,Poseidon appears to own a dragon
8547,0,Digitize is my happiest memory
8548,1,It is easy to slay the Gorgon.
8549,1,I had the strangest feeling that I knew you.


## Data-preprocessing

Building a language model and a classifier on the training data.

In [None]:
import os
print('getcwd:', os.getcwd())
path = os.getcwd()
data_lm = TextLMDataBunch.from_df(path,train_df=new_train_set,valid_df= new_val_set)
data_clas = TextClasDataBunch.from_df(path,train_df=new_train_set, valid_df= new_val_set, vocab=data_lm.train_ds.vocab, bs=32)
data_lm.train_ds.vocab

getcwd: /content


  return np.array(a, dtype=dtype, **kwargs)


  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)


<fastai.text.transform.Vocab at 0x7fd552910290>

In [None]:
data_lm.save('data_lm_export.pkl')
data_clas.save('data_clas_export.pkl')

In [None]:
bs=32
data_lm = load_data(path, 'data_lm_export.pkl', bs=bs)
data_clas = load_data(path, 'data_clas_export.pkl', bs=bs)

  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)


In [None]:
torch.cuda.set_device(0)

## Language Modelling

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd.tgz


In [None]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,4.338213,3.614888,0.303348,00:03


In [None]:
learn.unfreeze()
learn.fit_one_cycle(1, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,3.175818,3.421913,0.362277,00:04


In [None]:
learn.save('cola_language_model')
learn.save_encoder('cola_language_model_encoder')

## Classifier(here grammar verifier)

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('cola_language_model_encoder')
learn.freeze()

In [None]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.619741,0.598677,0.719626,00:03


  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)


In [None]:
data_clas.show_batch()

  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)


text,target
"xxbos xxmaj everybody who has ever , worked in any office which contained any xxunk which had ever been used to type any letters which had to be signed by any xxunk who ever worked in any department like mine will know what i xxunk .",1
xxbos xxmaj that xxmaj bill tried to discover which drawer xxmaj alice put the money in made us realize that we should have left him in xxmaj seoul .,1
"xxbos xxmaj it is n't because xxmaj sue said anything bad about me that i 'm angry , although she did say some bad things about me .",0
"xxbos xxmaj the folks up at corporate headquarters are the sort of people who the sooner you solve this problem , the more easily you 'll satisfy .",1
xxbos xxmaj we have many graduate students but this year the graduate director met with any student in the graduate program individually to discuss their progress .,0


In [None]:
learn.freeze_to(-1)
learn.fit_one_cycle(3, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.610586,0.586806,0.719626,00:03
1,0.608504,0.591892,0.71729,00:03
2,0.585614,0.594739,0.727804,00:03


  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)


In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(3, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.612267,0.59915,0.719626,00:04
1,0.601884,0.603755,0.724299,00:03
2,0.563958,0.604955,0.709112,00:04


  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)


In [None]:
learn.unfreeze()
learn.fit_one_cycle(5, slice(1e-4, 1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.557525,0.651875,0.683411,00:08
1,0.55446,0.632902,0.714953,00:08
2,0.49356,0.66993,0.738318,00:08
3,0.391956,0.671845,0.741822,00:08
4,0.2874,0.759385,0.740654,00:08


  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)
  sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([],dtype=np.int)


## Testing

Testing for custom input

In [None]:
learn.predict('he said if else was')


(Category tensor(0), tensor(0), tensor([0.7215, 0.2785]))

Testing on the test-set.

In [None]:
preds,targs = learn.get_preds(ordered=True)
accuracy(preds,targs)

tensor(0.7407)

In [None]:
# check if the dataset is inbalanced.
train_labels.value_counts()


1    6023
0    2528
Name: label, dtype: int64



Accuracy on the CoLA benchmark is measured using the "Matthews correlation coefficient" (MCC), as we can see the dataset is an imbalanced dataset.

Now we'll load the holdout dataset and prepare inputs just as we did with the training set. Then we'll evaluate predictions using Matthew's correlation coefficient because this is the metric used by the wider NLP community to evaluate performance on CoLA. With this metric, +1 is the best score, and -1 is the worst score. This way, we can see how well we perform against the state of the art models for this specific task.

In [None]:
from sklearn.metrics import matthews_corrcoef
import tensorflow as tf

test_labels_set = []
pred_labels_i_set = []

# Evaluate each test batch using Matthew's correlation coefficient
print('Calculating Matthews Corr. Coef...')

# For each input batch...
for i in range(len(test_labels)):
  logits = learn.predict(test_sentences[i])
  pred_labels_i = 1 if logits[1] == 1 else 0

  test_labels_set.append(test_labels[i])
  pred_labels_i_set.append(pred_labels_i)

# Calculate and store the coef for the test-data. 
matthews = matthews_corrcoef(test_labels_set, pred_labels_i_set) 

print(matthews) 
              

Calculating Matthews Corr. Coef...
0.19833624079258313


## Export the model

In [None]:
learn.export()

## Import the saved model and test


For running on our local machine(even CPU), we can just import the saved model and play with it

In [None]:
from fastai.text import *
learn = load_learner('/content')

In [None]:
result = learn.predict('he said if else was')
result = str(result).split(',')
if result[0]=='(Category tensor(1)':
  print('Grammatically Correct')
else:
  print('Grammatically Incorrect')

Grammatically Incorrect


In [None]:
result = learn.predict('you are doing great .')
result = str(result).split(',')
if result[0]=='(Category tensor(1)':
  print('Grammatically Correct')
else:
  print('Grammatically Incorrect')

Grammatically Correct
