<a href="https://colab.research.google.com/github/bharathkumar-kancharla/Natural-Language-Processing/blob/master/Rule_Based_Approaches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rule Based Approach

## Regular Expressions

[Python re module – Documentation](https://docs.python.org/3/library/re.html)

[regex cheat sheet](https://github.com/tartley/python-regex-cheatsheet/blob/master/cheatsheet.rst)

[Validate regex patterns](https://pythex.org/)

In [0]:
import re

In [2]:
# Extracting model information from product string
model_string = "SAMSUNG Galaxy M21 4GB RAM 64GB"
model_string = model_string.replace('SAMSUNG','')
model = re.sub("\d+GB\s{1,5}RAM|\d+GB|\d+\s{0,5}G", 
                                   "",model_string).strip()
model

'Galaxy M21'

**Syntax:** It is study of rules governing the way words are combined to form sentences in a language

> Sentences are composed of discrete units combined by rules

Syntax tree is a tree representation of syntactic structure of sentences or strings

## Context-free grammar

- List of rules that define the set of well-formed sentences in a language
- The context-free grammars show us what would be the rules to produce some words
- Each rule has a left-hand side, which identifies a syntactic category, and a right-hand side, which defines its alternative component parts, reading from left to right

<img src='http://www.bowdoin.edu/~allen/nlp/fig2.GIF'>

To Readmore on context-free grammar [click here](http://www.bowdoin.edu/~allen/nlp/nlp1.html)


[Automata](https://www.tutorialspoint.com/automata_theory/introduction_to_grammars.htm) is the field to study how system is understanding natural language processing.

To create the syntax tree in the notebook, we need to download ghostscript [here](https://www.ghostscript.com/download/gsdnld.html) and add bin folder to the path variable

- Ghost scripts contains all the rules defined for the language

### Chuncking

Using modified regular expressions, we can define chunk patterns. These are patterns of part-of-speech tags that define what kinds of words make up a *chunk*. We can also define patterns for what kinds of words should not be in a *chunk*. These unchunked words are known as *chinks*.

**chunking** creates chunks, while **chinking** breaks up those chunks

In [0]:
import os
path_to_gs =""

os.environ['PATH'] += os.pathsep + path_to_gs  #modifying the enviornment variable

In [0]:
import nltk
import nltk.corpus
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.data import load #To Access the inbulit data

In [0]:
sent = 'we are learning about chunking, which is sub-topic in NLP'

In [6]:
nltk.download('punkt')  #punkt resource is required to tokenize the sentence
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [8]:
sent_tokenize = word_tokenize(sent)
sent_tokenize

['we',
 'are',
 'learning',
 'about',
 'chunking',
 ',',
 'which',
 'is',
 'sub-topic',
 'in',
 'NLP']

In [9]:
for token in sent_tokenize:
    print(nltk.pos_tag([token]))

[('we', 'PRP')]
[('are', 'VBP')]
[('learning', 'VBG')]
[('about', 'IN')]
[('chunking', 'VBG')]
[(',', ',')]
[('which', 'WDT')]
[('is', 'VBZ')]
[('sub-topic', 'NN')]
[('in', 'IN')]
[('NLP', 'NN')]


In [10]:
nltk.download('gutenberg')
nltk.download('abc')
s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60]
print(s)

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package abc to /root/nltk_data...
[nltk_data]   Package abc is already up-to-date!
PM denies knowledge of AWB kickbacks
The Prime Minister has 


In [11]:
print(os.listdir(nltk.data.find("corpora"))) # checks the files loaded to corpora

['abc', 'gutenberg', 'gutenberg.zip', 'abc.zip']


In [12]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [13]:
hamlet = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')
len(hamlet)

25833

In [0]:
hamlet_pos = []

for word in hamlet[:2000]:
  word_pos = nltk.pos_tag([word])
  hamlet_pos.append(word_pos)

In [15]:
hamlet_pos

[[('[', 'NN')],
 [('The', 'DT')],
 [('Tragedie', 'NN')],
 [('of', 'IN')],
 [('Julius', 'NN')],
 [('Caesar', 'NN')],
 [('by', 'IN')],
 [('William', 'NNP')],
 [('Shakespeare', 'NN')],
 [('1599', 'CD')],
 [(']', 'NN')],
 [('Actus', 'NN')],
 [('Primus', 'NN')],
 [('.', '.')],
 [('Scoena', 'NN')],
 [('Prima', 'NN')],
 [('.', '.')],
 [('Enter', 'NN')],
 [('Flauius', 'NN')],
 [(',', ',')],
 [('Murellus', 'NN')],
 [(',', ',')],
 [('and', 'CC')],
 [('certaine', 'NN')],
 [('Commoners', 'NNS')],
 [('ouer', 'NN')],
 [('the', 'DT')],
 [('Stage', 'NN')],
 [('.', '.')],
 [('Flauius', 'NN')],
 [('.', '.')],
 [('Hence', 'NN')],
 [(':', ':')],
 [('home', 'NN')],
 [('you', 'PRP')],
 [('idle', 'JJ')],
 [('Creatures', 'NNS')],
 [(',', ',')],
 [('get', 'VB')],
 [('you', 'PRP')],
 [('home', 'NN')],
 [(':', ':')],
 [('Is', 'NN')],
 [('this', 'DT')],
 [('a', 'DT')],
 [('Holiday', 'NN')],
 [('?', '.')],
 [('What', 'WP')],
 [(',', ',')],
 [('know', 'VB')],
 [('you', 'PRP')],
 [('not', 'RB')],
 [('(', '(')],
 [('

In [16]:
sent = "we are here to learn about natural language processing"
sent_tokens = nltk.pos_tag(word_tokenize(sent))

# Rules can be written as below - Noun phrase chunk using regular expression
grammar_np = r"NP:{<DT>?<JJ>*<NN>}"

# Create chunk parser and pass Noun Phrase string to it
chunk_parser = nltk.RegexpParser(grammar_np)

# parse() function to parse our sentence
chunk_result = chunk_parser.parse(sent_tokens)
chunk_result

TclError: ignored

Tree('S', [('we', 'PRP'), ('are', 'VBP'), ('here', 'RB'), ('to', 'TO'), ('learn', 'VB'), ('about', 'IN'), Tree('NP', [('natural', 'JJ'), ('language', 'NN')]), Tree('NP', [('processing', 'NN')])])

https://stackoverflow.com/questions/49478228/tclerror-no-display-name-and-no-display-environment-variable-in-googles-colab

The problem is the `tkinter` that you are trying to use.

`Tk` will normally create `GUI` (like a new window) for your interface. But Colab is run on the web server in the cloud. It can't open a window on your machine. You can only interact with it through notebook interface.

### Chinking

In [0]:
chink_grammar = r"""
    chk_name: #chunk name
    {<PRP>?<VB|VBD|VBZ|VBG>*<RB|RBR>?}   #Chunk regex sequence
    }<RB>+{ #chink regex sequence - adverb

    """

In [18]:
chink_parser = nltk.RegexpParser(chink_grammar)
chink_parser.parse(sent_tokens)

TclError: ignored

Tree('S', [Tree('chk_name', [('we', 'PRP')]), ('are', 'VBP'), ('here', 'RB'), ('to', 'TO'), Tree('chk_name', [('learn', 'VB')]), ('about', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN')])

### CFG:

In [0]:
CFG_Grammar = nltk.CFG.fromstring("""

s -> NP VP
VP -> V N
V -> "saw"|"met"
NP -> "John"|"Jim"
N -> "dog"|"cat"
"""
)

In [20]:
from nltk.parse.generate import generate, demo_grammar
# Possible list of sentences that can be generated using the rules:
for sentence in generate(CFG_Grammar):
  print(" ".join(sentence))

John saw dog
John saw cat
John met dog
John met cat
Jim saw dog
Jim saw cat
Jim met dog
Jim met cat


In [21]:
#Different rules of grammar for the sentence formation using the productions():
CFG_Grammar.productions()

[s -> NP VP,
 VP -> V N,
 V -> 'saw',
 V -> 'met',
 NP -> 'John',
 NP -> 'Jim',
 N -> 'dog',
 N -> 'cat']

**Automating Text Paraphrasing:**

In [0]:
def cfg_parse(sentence):
  sent_tk = nltk.pos_tag(word_tokenize(sentence))
  for one in sent_tk:
    if one[1] == 'NNP':
      s_NP = "\'"+one[0]+"\'"
    if one[1] == 'VBD' or one[1] == 'VBN':
      s_V ="\'"+one[0]+"\'"
    if one[1] == 'NN':
      s_N = "\'"+one[0]+"\'"
    else:
      pass
  cfg_grammar = nltk.CFG.fromstring ("""
  
  s -> NP VP
  VP -> V N
  NP -> {}
  V -> {}
  N -> {}
  """.format(s_NP, s_V, s_N))
  for sentence in generate(CFG_Grammar):
    print(" ".join(sentence))
  return

In [23]:
cfg_parse("John saw a long white boat")

John saw dog
John saw cat
John met dog
John met cat
Jim saw dog
Jim saw cat
Jim met dog
Jim met cat


[Sample codes for CFG](http://www.nltk.org/howto/generate.html)