# NATURAL LANGUAGE PARSING WITH REGULAR EXPRESSIONS

By using Python’s regular expression modulere and the Natural Language Toolkit, known as NLTK, you can find keywords of interest, discover where and how often they are used, and discern the parts-of-speech patterns in which they appear to understand the sometimes hidden meaning in a piece of writing

## Compiling and Matching

In [54]:
import re

# characters are defined
character_1 = "Dorothy"
character_2 = "Henry"

# .compile() a regular expression object named regular_expression that will match any 7 character string of word characters.
regular_expression = re.compile("[A-Za-z]{7}")

# check for a match to character_1 here
result_1 = regular_expression.match(character_1)
print(result_1)

# store and print the matched text here
match_1 = result_1.group(0)
print(match_1)

# compile a regular expression to match a 7 character string of word characters and check for a match to character_2 here
result_2 = regular_expression.match(character_2)
print(result_2)

<re.Match object; span=(0, 7), match='Dorothy'>
Dorothy
None


## Searching and Finding

Unlike .match() which will only find matches at the start of a string, .search() will look left to right through an entire piece of text and return a match object for the first match to the regular expression given. If no match is found, .search() will return None. 

In [55]:
result = re.search("\w{8}","Are you a Munchkin?")
print(result.group(0))

Munchkin


  result = re.search("\w{8}","Are you a Munchkin?")


Given a regular expression as its first argument and a string as its second argument, .findall() will return a list of all non-overlapping matches of the regular expression in the string. Consider the below piece of text:

In [56]:
text = "Everything is green here, while in the country of the Munchkins blue was the favorite color. But the people do not seem to be as friendly as the Munchkins, and I'm afraid we shall be unable to find a place to pass the night."

list_of_matches = re.findall("\w{8}",text)

print(list_of_matches)

['Everythi', 'Munchkin', 'favorite', 'friendly', 'Munchkin']


  list_of_matches = re.findall("\w{8}",text)


## Part of speech tagging

you can often find more meaning by analyzing text on a word-by-word basis, focusing on the part of speech of each word in a sentence

- Noun: the name of a person (Ramona,class), place, thing (textbook), or idea (NLP)
- Pronoun: a word used in place of a noun (her,she)
- Determiner: a word that introduces, or “determines”, a noun (the)
- Verb: expresses action (studying) or being (are,has)
- Adjective: modifies or describes a noun or pronoun (new)
- Adverb: modifies or describes a verb, an adjective, or another adverb (happily)
- Preposition: a word placed before a noun or pronoun to form a phrase modifying another word in the sentence (on)
- Conjunction: a word that joins words, phrases, or clauses (and)
- Interjection: a word used to express emotion (Wow)

In [57]:
from nltk import pos_tag

word_sentence = ['do', 'you', 'suppose', 'oz', 'could', 'give', 'me', 'a', 'heart', '?']

part_of_speech_tagged_sentence = pos_tag(word_sentence)

print(part_of_speech_tagged_sentence)


[('do', 'VB'), ('you', 'PRP'), ('suppose', 'VB'), ('oz', 'NNS'), ('could', 'MD'), ('give', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('heart', 'NN'), ('?', '.')]


## Chunking

With chunking in nltk, you can define a pattern of parts-of-speech tags using a modified notation of regular expressions. You can then find non-overlapping matches, or chunks of words, in the part-of-speech tagged sentences of a text.

The regular expression you build to find chunks is called chunk grammar. A piece of chunk grammar can be written as follows:

chunk_grammar = "AN: {\<JJ>\<NN>}"

- AN is a user-defined name for the kind of chunk you are searching for. You can use whatever name makes sense given your chunk grammar. In this case AN stands for adjective-noun
- A pair of curly braces {} surround the actual chunk grammar
- \<JJ> operates similarly to a regex character class, matching any adjective
- \<NN> matches any noun, singular or plural

The chunk grammar above will thus match any adjective that is followed by a noun.



In [58]:
from nltk import RegexpParser, Tree

# Match any adjective followed by a noun
chunk_grammar = "AN: {<JJ><NN>}"

# Create a nltk RegexpParser object and give it a piece of chunk grammar as an argument
chunk_parser = RegexpParser(chunk_grammar)


pos_tagged_sentence = [('where', 'WRB'), ('is', 'VBZ'), ('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN'), ('?', '.')]

# use the RegexpParser object’s .parse() method, 
# which takes a list of part-of-speech tagged words as an argument, 
# and identifies where such chunks occur in the sentence!
chunked = chunk_parser.parse(pos_tagged_sentence)

print(chunked)

Tree.fromstring(str(chunked)).pretty_print()



(S where/WRB is/VBZ the/DT (AN emerald/JJ city/NN) ?/.)
                   S                              
     ______________|____________________           
    |       |      |     |              AN        
    |       |      |     |       _______|_____     
where/WRB is/VBZ the/DT ?/. emerald/JJ     city/NN



### Chunking noun phrases

Noun phrase chunking is particularly useful for deterining meaning and bias in text

In [59]:
pos_tagged_sentence = [('we', 'PRP'), ('are', 'VBP'), ('so', 'RB'), ('grateful', 'JJ'), ('to', 'TO'), ('you', 'PRP'), ('for', 'IN'), ('having', 'VBG'), ('killed', 'VBN'), ('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'), ('of', 'IN'), ('the', 'DT'), ('east', 'NN'), (',', ','), ('and', 'CC'), ('for', 'IN'), ('setting', 'VBG'), ('our', 'PRP$'), ('people', 'NNS'), ('free', 'VBP'), ('from', 'IN'), ('bondage', 'NN'), ('.', '.')]

chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

chunked = chunk_parser.parse(pos_tagged_sentence)
print(chunked)
Tree.fromstring(str(chunked)).pretty_print()

(S
  we/PRP
  are/VBP
  so/RB
  grateful/JJ
  to/TO
  you/PRP
  for/IN
  having/VBG
  killed/VBN
  the/DT
  (AN wicked/JJ witch/NN)
  of/IN
  the/DT
  east/NN
  ,/,
  and/CC
  for/IN
  setting/VBG
  our/PRP$
  people/NNS
  free/VBP
  from/IN
  bondage/NN
  ./.)
                                                                                                   S                                                                                                            
   ________________________________________________________________________________________________|_________________________________________________________________________________________________            
  |       |      |        |        |      |      |        |          |        |      |     |       |     |    |      |         |         |         |         |        |        |       |             AN         
  |       |      |        |        |      |      |        |          |        |      |     |       |     |    

- NP is the user-defined name of the chunk you are searching for. In this case NP stands for noun phrase
- \<DT> matches any determiner
- ? is an optional quantifier, matching either 0 or 1 determiners
- \<JJ> matches any adjective
- \* is the Kleene star quantifier, matching 0 or more occurrences of an adjective
- \<NN> matches any noun, singular or plural

### Chunking Verb Phrases

Another popular type of chunking is VP-chunking, or verb phrase chunking. A verb phrase is a phrase that contains a verb and its complements, objects, or modifiers.

Verb phrases can take a variety of structures, and here you will consider two. The first structure begins with a verb VB of any tense, followed by a noun phrase, and ends with an optional adverb RB of any form. The second structure switches the order of the verb and the noun phrase, but also ends with an optional adverb.

- VP is the user-defined name of the chunk you are searching for. In this case VP stands for verb phrase
- \<VB.*> matches any verb using the . as a wildcard and the * quantifier to match 0 or more occurrences of any character. This ensures matching verbs of any tense (ex. VB for present tense, VBD for past tense, or VBN for past participle)
- \<DT>?\<JJ>*\<NN> matches any noun phrase
- \<RB.?> matches any adverb using the . as a wildcard and the optional quantifier to match 0 or 1 occurrence of any character. This ensures matching any form of adverb (regular RB, comparative RBR, or superlative RBS)
- ? is an optional quantifier, matching either 0 or 1 adverbs

In [60]:
verb1 = "VP: {<VB.*><DT>?<JJ>*<NN><RB.?>?}"

verb2 = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

## Chunk filtering

Chunk filtering lets you define what parts of speech you do not want in a chunk and remove them.

A popular method for performing chunk filtering is to chunk an entire sentence together and then indicate which parts of speech are to be filtered out. If the filtered parts of speech are in the middle of a chunk, it will split the chunk into two separate chunks! The chunk grammar you can use to perform chunk filtering is given below:

In [61]:
chunk_grammar = """NP: {<.*>+}
                       }<VB.?|IN>+{"""


- NP is the user-defined name of the chunk you are searching for. In this case NP stands for noun phrase
- The brackets {} indicate what parts of speech you are chunking. <.*>+ matches every part of speech in the sentence
- The inverted brackets }{ indicate which parts of speech you want to filter from the chunk. <VB.?|IN>+ will filter out any verbs or prepositions