# Regular Expression and Word Tokenization

### Natural Language Processing: 
making sense of language
    - Topic identification
    - Text classification
    
Applications:
    - Chatbots
    - Translation
    - Sentiment analysis


### Regular Expressions: 
match patterns in other strings
    
Applications:
    - Find all web links in a document
    - Parse info or replace unwanted characters

In [3]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"
print (my_string)

Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?


In [6]:
import re

# Match sentence endings: sentence_endings
sentence_endings = r"[.?!]"
print(re.split(sentence_endings, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']


In [7]:
# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

['Let', 'RegEx', 'Won', 'Can', 'Or']


In [8]:
# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']


In [9]:
# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

['4', '19']


### Tokenization: 
convert a string or document to tokens
- can create your own rules
    example: separating punctuation

In [12]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
print (word_tokenize("Hi there!"))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/samanvitha/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['Hi', 'there', '!']


Other tokenizers:
- sent_tokenize
- regexp_tokenize
- TweetTokenizer

### Search vs match: 
re.search() = go through the entire string to find the pattern  
re.match() = matches from the beginning

In [14]:
scene_one = "SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"

In [15]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize


# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

{'my', "'", 'Where', "n't", 'SOLDIER', 'right', 'No', 'held', 'point', 'You', 'Oh', 'there', 'they', 'is', 'lord', 'may', 'ounce', 'length', "'m", 'Pendragon', 'if', 'That', 'KING', 'What', 'son', 'King', 'me', 'search', 'snows', 'its', 'interested', 'African', 'use', 'of', 'found', '!', 'you', 'suggesting', 'court', 'martin', "'s", 'he', 'Pull', '#', 'these', 'other', 'coconut', 'carry', 'through', 'at', ',', 'an', 'weight', 'creeper', '.', 'are', 'kingdom', '2', 'where', 'They', 'England', 'plover', 'wants', 'grips', 'Well', 'forty-three', 'seek', 'five', 'anyway', 'tell', 'and', 'swallow', 'covered', '?', 'defeator', 'knights', 'question', 'In', 'it', 'wings', 'but', 'got', 'carried', 'Britons', 'that', 'velocity', 'bangin', 'bird', 'maybe', 'Who', 'carrying', 'house', 'all', 'European', 'We', 'Found', 'non-migratory', '...', 'will', 'in', 'Mercea', 'breadth', '--', 'second', 'guiding', 'Saxons', 'wind', 'agree', 'beat', 'Whoa', 'I', 'why', 'line', 'Yes', 'together', 'It', 'not', 'b

In [16]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

580 588


In [17]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

<_sre.SRE_Match object; span=(9, 32), match='[wind] [clop clop clop]'>


In [18]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))

<_sre.SRE_Match object; span=(0, 7), match='ARTHUR:'>
