<a href="https://colab.research.google.com/github/GDO-Galileo/do-voice-interaction/blob/error_correction/gdo_voicebot/grammar_correction_service/grammar_checker_model/Pre_Processing_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pre-Processing for Lang-8 and CoLA datasets**
This notebook will produce four files for use in training our [Grammar Checker Model](https://colab.research.google.com/drive/1_7RQQPkUHyF3ip5vCI0b2aOxSejOZcTv?usp=sharing):

*   lang-8-train.tsv
*   cola-train.tsv
*   cola-validate.tsv
*   cola-test.tsv

Each one will contain lines of labeled sentences with columns separated by tab characters in the format:

```
  Column     Description
 ------------------------------------------------------------------------------------------
    1	    the acceptability judgment label (0 = unacceptable, 1 = acceptable).
    2 	   the (now parsed) lowercase sentence with no puntuation apart from apostrophes.
```
For example, a sample from 'cola-train.tsv' reads:
```
  1     john and the man went to the store
  0     i loved intensely the policeman with all my heart
  0     i'm sure we got any tickets
  1     the umpire called the game off
```

The two datasets used for this model are the [Lang-8 Corpus of Learner English](https://sites.google.com/site/naistlang8corpora/) and the [Corpus of Linguistic Acceptability (CoLA)](https://nyu-mll.github.io/CoLA/).


In [None]:
###################################
#             Imports             #
###################################

import sys
import io
import re
import os
import math
import random
from random import randint

In [None]:
###################################
#      Download raw datasets      #
###################################

## Lang-8 ##
# source: https://docs.google.com/forms/d/17gZZsC_rnaACMXmPiab3kjqBEtRHPMz0UG9Dk-x_F0k

# Upload 'lang-8-en-1.0.zip'
from google.colab import files
uploaded = files.upload()

!unzip lang-8-en-1.0.zip
!cp ./lang-8-en-1.0/entries.train .

## Cola ##
# source: https://nyu-mll.github.io/CoLA

!wget https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
!unzip cola_public_1.1.zip 
!cp ./cola_public/raw/in_domain_train.tsv .
!cp ./cola_public/raw/out_of_domain_dev.tsv .

## Check all files are present ##
# Folder should contain:
#   - entries.train
#   - in_domain_train.tsv
#   - out_of_domain_dev.tsv
print("")
print("Files List:")
%ls

unzip:  cannot find or open lang-8-en-1.0.zip, lang-8-en-1.0.zip.zip or lang-8-en-1.0.zip.ZIP.
cp: cannot stat './lang-8-en-1.0/entries.train': No such file or directory
--2021-11-29 19:53:38--  https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
Resolving nyu-mll.github.io (nyu-mll.github.io)... 185.199.109.153, 185.199.110.153, 185.199.111.153, ...
Connecting to nyu-mll.github.io (nyu-mll.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 255330 (249K) [application/zip]
Saving to: ‘cola_public_1.1.zip’


2021-11-29 19:53:38 (11.3 MB/s) - ‘cola_public_1.1.zip’ saved [255330/255330]

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/

In [None]:
###################################
#        General Constants        #
###################################

# File names
LANG_INPUT = 'entries.train'
LANG_OUTPUT = 'lang-8-train.tsv'

COLA_TRAIN_INPUT = 'in_domain_train.tsv'
COLA_TEST_INPUT = 'out_of_domain_dev.tsv'
COLA_TRAIN_OUTPUT = 'cola-train.tsv'
COLA_VAL_OUTPUT = 'cola-validate.tsv'
COLA_TEST_OUTPUT = 'cola-test.tsv'

# Indexes of output columns
NEW_LABEL = 0
PARSED = 1

# Output acceptability label definitions
CORRECT = '1'
INCORRECT = '0'

In [None]:
###################################
#        General Functions        #
###################################

# Count all lines in a file
def count_lines(file_name):
  num_lines = 0
  for line in open(file_name):
    # Skip empty lines
    if line != "\n":
      num_lines += 1

  return num_lines


# Removes all punctuation and extra spaces apart from apostrophes
def remove_punc(str):
  str = re.sub('  ', ' ', str)
  str = re.sub(' \'', '\'', str)
  str = re.sub(' n\'t', 'n\'t', str)
  str = re.sub(' \.', '.', str)
  str = re.sub(' ,', ',', str)
  str = re.sub(' \?', '?', str)
  str = re.sub(' !', '!', str)
  str = re.sub('\n', '', str)

  return str.translate({ord(i): None for i in '?!"`[]{}()#~/\\.,><@:;+=_-*&^%$£'})


# Write a list of parsed_lines to file_name
def write_parsed(file_name, parsed_lines):
  sys.stdout.write("Writing to {}...".format(file_name))
  sys.stdout.flush()

  o = open(file_name, "w")
  for line in parsed_lines:
      o.write(line)
  o.close()

  sys.stdout.write("\rWriting to {}... Complete\n".format(file_name))
  sys.stdout.flush()

## **Lang-8 Parsing**

This part will produce the file 'lang-8-train.tsv' from the input file 'entries.train', which can then be used for pretraining our model.

The Lang-8 dataset is presented in the format (as described in the dataset's documentation):
```
  Column           Description
 ----------------------------------------------------------------------------------
    1              the acceptability judgment label (1=unacceptable, 0=acceptable).
    2              Serial number
    3              the URL of the entry
    4              Sentence number. 0 is the title
    5              Sentence written by a learner of English
 anything after 6  Corrected Sentences (If exists)
```
During parsing, we remove all titles and all sentences that are initally marked as correct. This is to ensure that all sentences marked as correct are sentences written by a corrector of English rather than a learner of English, making the outputted dataset more reliable.

In [None]:
##################################
#        Lang-8 Constants        #
##################################

# Indexes of input columns
LABEL = 0
SERIAL = 1
URL = 2
SENT_NUM = 3
LEARNER_SENT = 4
CORRECTED_SENT = 5

# Sentence number title marker
TITLE = '0'

# Input acceptability label definitions
OLD_CORRECT = '0'
OLD_INCORRECT = '1'

# Values used to sample data randomly
#   approximately SAMPLE_SIZE/SAMPLE_NUM sentences will be selected
SAMPLE_SIZE = 205
SAMPLE_NUM = 10000

In [None]:
###################################
#           Count lines           #
###################################

num_lines = count_lines(LANG_INPUT)
print("Total Entries: %i" % num_lines)

Total Entries: 1037561


In [None]:
###################################
#    Parse into correct format    #
###################################

f = open(LANG_INPUT, "r")

entry_count = 0
num_processed_entries = 0
num_parsed_lines = 0
parsed_lines = []
for line in f:
  # Skip empty lines
  if line == "\n":
    continue

  # Calculate percentage of lines processed
  entry_count += 1
  percent = entry_count / num_lines * 100
  sys.stdout.write("\rProcessing sentence %i/%i: %i%%" % (entry_count, num_lines, percent))
  sys.stdout.flush()

  # Split by tab character
  columns = line.split('\t')

  # Remove entries marked as correct and titles
  if columns[LABEL] == OLD_CORRECT or columns[SENT_NUM] == TITLE:
    continue

  num_processed_entries += 1

  # Take SAMPLE_NUM/SAMPLE_SIZE learner sentences and format them
  if randint(1,SAMPLE_NUM) <= SAMPLE_SIZE:
    num_parsed_lines += 1
    parsed_lines.append((INCORRECT + '\t' + remove_punc(columns[LEARNER_SENT]) + '\n').lower())

  # Take SAMPLE_NUM/SAMPLE_SIZE corrected sentences and format them
  if len(columns) > CORRECTED_SENT and randint(1,SAMPLE_NUM) <= SAMPLE_SIZE:
    num_parsed_lines += 1
    parsed_lines.append((CORRECT + '\t' + remove_punc(columns[CORRECTED_SENT]) + '\n').lower())

  # Remove sentences with only corrected punctuation
  if (num_parsed_lines >= 2 and parsed_lines[-2] == parsed_lines[-1]):
    num_parsed_lines -= 2
    parsed_lines = parsed_lines[:-3]

f.close()

print("\nTotal Usable Entries: %i" % num_processed_entries)
print("Total Output Entries: %i" % num_parsed_lines)

# Shuffle lines
random.shuffle(parsed_lines)

Processing sentence 1037561/1037561: 100%
Total Processed Entries: 494247
Total Output Entries: 20240


In [None]:
###################################
#  Balance correct and incorrect  #
###################################

# Split into incorrect and correct lines
correct = []
incorrect = []
for line in parsed_lines:
  # Separate by tab character
  columns = line.split('\t')

  # Count correct and incorrect lines
  if columns[NEW_LABEL] == CORRECT:
    correct.append(line)
  else:
    incorrect.append(line)

num_correct = len(correct)
num_incorrect = len(incorrect)

print("Correct Entries: %i" % num_correct)
print("Incorrect Entries: %i" % num_incorrect)

if num_correct > num_incorrect:
  difference = num_correct - num_incorrect
  correct = correct[:-difference]
elif num_incorrect > num_correct:
  difference = num_incorrect - num_correct
  incorrect = incorrect[:-difference]

# Add lines together and shuffle
parsed_lines = correct + incorrect
random.shuffle(parsed_lines)

print("")
print("New Correct Entries: %i" % len(correct))
print("New Incorrect Entries: %i" % len(incorrect))

Correct Entries: 10127
Incorrect Entries: 10113

New Correct Entries: 10113
New Incorrect Entries: 10113


In [None]:
##################################
#      Write to output file      #
##################################

write_parsed(LANG_OUTPUT, parsed_lines)

Writing to lang-8-train.tsv... Complete


## **CoLA Parsing**

This part will produce the files 'cola-train.tsv', 'cola-validate.tsv' and 'cola-test.tsv', which can be used for training, validating and testing our model respectively. 'cola-train.tsv' and 'cola-validate.tsv' are both generated from CoLA's 'in_domain_train.tsv' training file, while 'cola-test.tsv' is generated from 'out_of_domain_dev.tsv'.

The CoLA dataset is presented in the format (as described in the dataset's documentation):
```
  Column     Description
 ----------------------------------------------------------------------------
    1	    the code representing the source of the sentence.
    2	    the acceptability judgment label (0=unacceptable, 1=acceptable).
    3	    the acceptability judgment as originally notated by the author.
    4 	   the sentence.
```

In [None]:
##################################
#         CoLA Constants         #
##################################

# Indexes of input columns
SOURCE = 0
LABEL = 1
AUTHOR_LABEL = 2
SENTENCE = 3

# Percentage to take as a validation set
VALIDATION_PERC = 0.1

In [None]:
##################################
#         CoLA Functions         #
##################################

# Parse lines in file_name to be in the output format with sentences lowercase
#   and without punctuation apart from apostrophes
def cola_parse(file_name, num_lines):
  f = open(file_name, "r")

  entry_count = 0
  parsed_lines = []
  for line in f:
    # Calculate percentage of lines processed
    entry_count += 1
    percent = entry_count / num_lines * 100
    sys.stdout.write("\rProcessing sentence %i/%i: %i%%" % (entry_count, num_lines, percent))
    sys.stdout.flush()

    # Split by tab character and reformat line
    columns = line.split('\t')
    parsed_line = columns[LABEL] + '\t' + remove_punc(columns[SENTENCE]) + '\n'
    parsed_lines.append(parsed_line.lower())

  f.close()

  return parsed_lines


# Oversample incorrect entries if needed
def cola_oversample(parsed_lines):
  # Separate correct and incorrect sentences
  incorrect = []
  correct = []
  for line in parsed_lines:
    # Skip empty lines
    if line == "\n":
      continue

    columns = line.split('\t')

    if columns[NEW_LABEL] == INCORRECT:
      incorrect.append(line)
    else:
      correct.append(line)

  num_incorrect = len(incorrect)
  num_correct = len(correct)

  print("Total Correct Entries: %i" % num_correct)
  print("Total Incorrect Entries: %i" % num_incorrect)

  # Oversample incorrect entries if needed
  total_incorrect = incorrect
  if num_incorrect < num_correct:

    # Find percent needed to add
    difference = num_correct - num_incorrect
    percent_addition = difference/num_incorrect

    # Adding full set
    for _ in range(math.floor(percent_addition)):
      total_incorrect += incorrect

    # Add extra entries taken at random to fill the rest of the total
    percent_addition -= math.floor(percent_addition)
    num_extra_lines = round(num_incorrect * percent_addition)

    random.shuffle(incorrect)
    total_incorrect += incorrect[:num_extra_lines]

  num_total_incorrect = len(total_incorrect)

  total_lines = correct + total_incorrect
  num_total_lines = len(total_lines)

  print("")
  print("New Total Incorrect Entries: %i" % num_total_incorrect)
  print("New Total Entries: %i" % num_total_lines)

  # Shuffle final set
  random.shuffle(total_lines)

  return total_lines

### CoLA Train and Validation Parsing

This part of the code produces 'cola-train.tsv' and 'cola-validate.tsv' from the input file 'in-domain-train.tsv'.



In [None]:
###################################
#           Count lines           #
###################################

num_lines = count_lines(COLA_TRAIN_INPUT)
print("Total Entries: %i" % num_lines)

Total Entries: 8551


In [None]:
###################################
#    Parse into correct format    #
###################################

parsed_lines = cola_parse(COLA_TRAIN_INPUT, num_lines)

Processing sentence 8551/8551: 100%

In [None]:
##########################################
#  Split into train and validation sets  #
##########################################

# Get number of validation lines based on VALIDATION_PERC
num_validation_lines = round(len(parsed_lines) * VALIDATION_PERC)

# Shuffle before splitting
random.shuffle(parsed_lines)

# Split into two lists
validation_lines = parsed_lines[:num_validation_lines]
train_lines = parsed_lines[(num_validation_lines + 1):]

print("Total Train Output Entries: %i" % (len(parsed_lines) - num_validation_lines))
print("Total Validation Output Entries: %i" % num_validation_lines)

Total Train Output Entries: 7696
Total Validation Output Entries: 855


In [None]:
###################################
#    Duplicate incorrect lines    #
###################################

print("Train Oversampling:")
total_train_lines = cola_oversample(train_lines)

print("\nValidation Oversampling:")
total_val_lines = cola_oversample(validation_lines)


Train Oversampling:
Total Correct Entries: 5414
Total Incorrect Entries: 2281

New Total Incorrect Entries: 5414
New Total Entries: 10828

Validation Oversampling:
Total Correct Entries: 608
Total Incorrect Entries: 247

New Total Incorrect Entries: 608
New Total Entries: 1216


In [None]:
###################################
#      Write to output files      #
###################################

# Write to train file
write_parsed(COLA_TRAIN_OUTPUT, total_train_lines)

# Write to validation file
write_parsed(COLA_VAL_OUTPUT, total_val_lines)

Writing to cola-train.tsv... Complete
Writing to cola-validate.tsv... Complete


### CoLA Test Parsing

This part produces the file 'cola-test.tsv' using the file 'out-of-domain-dev.tsv' as input.

In [None]:
###################################
#           Count lines           #
###################################

num_lines = count_lines(COLA_TEST_INPUT)
print("Total Entries: %i" % num_lines)

Total Entries: 516


In [None]:
###################################
#    Parse into correct format    #
###################################

parsed_lines = cola_parse(COLA_TEST_INPUT, num_lines)

Processing sentence 516/516: 100%

In [None]:
###################################
#    Duplicate incorrect lines    #
###################################

total_lines = cola_oversample(parsed_lines)

Total Correct Entries: 354
Total Incorrect Entries: 162

New Total Incorrect Entries: 354
New Total Entries: 708


In [None]:
####################################
#       Write to output file       #
####################################

# Write to test file
write_parsed(COLA_TEST_OUTPUT, total_lines)

Writing to cola-test.tsv... Complete


## **Download Files**

In [None]:
####################################
#         Download Locally         #
####################################

files.download(LANG_OUTPUT)
files.download(COLA_TRAIN_OUTPUT)
files.download(COLA_VAL_OUTPUT)
files.download(COLA_TEST_OUTPUT)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
####################################
#     Download to Google Drive     #
####################################

# Folder path to download the files to
# Note: folder must already exist and will not be created when downloading
folder_path = 'Galileo'

from google.colab import drive
drive.mount('/content/drive')

os.system('cp {} "./drive/My Drive/{}/{}"'.format(LANG_OUTPUT, folder_path, LANG_OUTPUT))
os.system('cp {} "./drive/My Drive/{}/{}"'.format(COLA_TRAIN_OUTPUT, folder_path, COLA_TRAIN_OUTPUT))
os.system('cp {} "./drive/My Drive/{}/{}"'.format(COLA_VAL_OUTPUT, folder_path, COLA_VAL_OUTPUT))
os.system('cp {} "./drive/My Drive/{}/{}"'.format(COLA_TEST_OUTPUT, folder_path, COLA_TEST_OUTPUT))