![](http://i67.tinypic.com/2jcbwcw.png)

## Naive Grammar Checker

NLTK - Build a syntax tree and validate inputs against it.

**Author List**: Sindhuja Jeyabal

**Original Sources**: http://nltk.org

**License**: Feel free to do whatever you want to with this code

This notebook uses nltk package to implement a primitive grammar checker. The grammar is built on the fly using a training set - the web text corpus from nltk is used in this case. You can check the validity of an input sentence using the function   

                        test_input(INPUT STRING TO BE TESTED)  

The checker is primitive because it considers the syntax tree built from the training set as universal grammar. This ipython notebook is just a simple example of how nltk package can be used for text mining. The code can definitely be made more sophisticated. 

In [1]:
import nltk
from nltk.corpus import brown
from nltk.corpus import webtext
from nltk import word_tokenize, sent_tokenize
from nltk.tokenize import RegexpTokenizer
import string

In [2]:
# Inbuilt tokenizer - input your own regex to split on. Does not capture punctuations.
tokenizer = RegexpTokenizer(r'\w+')

In [3]:
# Utility function to tag Part of Speech (POS) in the training corpus
def tag_corpus(corpus_sents):
    sents = list()
    for sent in corpus_sents:
        sents.append([w for w in sent])# if w not in string.punctuation])
    tag_sents = nltk.pos_tag_sents(sents)
    return tag_sents

In [4]:
# Utility function to create the syntax tree for the grammar
def create_tag_matrix(sentence_tags):
    tag_matrix = dict()
    for item in sentence_tags:
        for i in (range(len(item)-1)):
            t = item[i][1]
            tag_matrix[t] = tag_matrix.get(t, [])
            tag_matrix[t].append(item[i+1][1])
    for k in tag_matrix.keys():
        tag_matrix[k] = list(set(tag_matrix[k]))
    return tag_matrix

In [5]:
# Utility function that extracts the Part of Speech (POS) tags from the input
def process_input(raw_input):
    input_sents = [tokenizer.tokenize(s) for s in sent_tokenize(raw_input)]
    input_tags = list()
    for item in nltk.pos_tag_sents(input_sents):
        input_tags.append([t for (w,t) in item ])
    print input_tags
    return input_tags

In [6]:
# Function to validate every sentence in the input
# Input: Model Tag matrix created from the training set, List of POS tags
# Output: True if the tags form a valid path in the syntax tree. False otherwise 
def validate_sent(tag_matrix, input_tags):
    is_valid = True
    for i in (range(len(input_tags)-1)):
        if input_tags[i] not in tag_matrix.keys():
            print "No rule found for " + input_tags[i] + " continuing..."
            continue
        if input_tags[i+1] not in tag_matrix[input_tags[i]]:
            print "sequence not found: " + input_tags[i] + " ---> " + input_tags[i+1]
            print "Incorrect sentence detected"
            is_valid = False
            break
    print "Complete"
    if is_valid:
        print "Valid Sentence"
    return is_valid

# Function to validate the POS tags in the entire input text.
# Input: Model Tag matrix created from the training set, nested list of POS tags
# Output: None
def validate_input(tag_matrix, input_tags):
    valid_vector = [validate_sent(tag_matrix, tags) for tags in input_tags]
    if all(valid_vector):
        print "###Valid Input###"
    else:
        print "!!!Invalid Input!!!"

In [7]:
# Function to be called to test the validity of an input
# Input: Raw string
# Output: True if the string is grammatically correct (so far..)
def test_input(raw_text):
    validate_input(web_matrix, process_input(raw_text))

In [14]:
# Create a grammar and a syntax tree with the webtext corpus available in nltk.

In [8]:
web_sents = webtext.sents()

tagged_sents = tag_corpus(web_sents[:10])
web_matrix = create_tag_matrix(tagged_sents)

In [11]:
# Test if user input is valid.
test_input("I am.")

[['PRP', 'VBP']]
Complete
Valid Sentence
###Valid Input###


A grammar can also be created with the help of a training set manually input by the user.

In [12]:
train = "I am a student. I am working on a project on mining text using nltk. I want to score input text for correctness."
train_sents = [tokenizer.tokenize(s) for s in sent_tokenize(train)]
train_tags = nltk.pos_tag_sents(train_sents)
model_matrix = create_tag_matrix(train_tags)

In [13]:
validate_input(model_matrix, process_input('I a '))

[['PRP', 'DT']]
sequence not found: PRP ---> DT
Incorrect sentence detected
Complete
!!!Invalid Input!!!
