## Introduction to tokenization
Tokenization is the process of transforming a string or document into smaller chunks, which we call tokens. This is usually one step in the process of preparing a text for natural language processing. 

#### Understanding Tokenization
 What are Tokens?

- Tokens are the building blocks of natural language processing. They are essentially the pieces that comprise a piece of text, similar to words in a sentence or sentences in a paragraph.

Types of Tokenization:

- Word Tokenization: This involves breaking text into individual words. For example, the sentence "Hello world!" would be tokenized into two tokens: "Hello" and "world!".
- Sentence Tokenization: This involves breaking text into individual sentences. It’s useful for processing tasks that require understanding the context of each sentence.

Why Tokenization?

- Tokenization is used to structure text for further analysis or processing, like part-of-speech tagging, sentiment analysis, or input for machine learning models.

In [4]:
from nltk.tokenize import word_tokenize 
word_tokenize("Hi There!")

['Hi', 'There', '!']

In [5]:
# Path
file_path = 'scene_one.txt'

# Reading the file
with open(file_path, 'r') as file:
    scene_one = file.read()

In [6]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

In [7]:
# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)
sentences

['SCENE 1: \n[wind] [clop clop clop] \\nKING ARTHUR: Whoa there!',
 '[clop clop clop] \\nSOLDIER #1: Halt!',
 'Who goes there?\\nARTHUR: It is I, Arthur, son of \n...\ncreeper!\\nSOLDIER #1: What, held under the dorsal guiding feathers?\\nSOLDIER #2: Well, why not?\\n']

In [8]:
# Use word_tokenize to tokenize the 3rd sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[2])
tokenized_sent

['Who',
 'goes',
 'there',
 '?',
 '\\nARTHUR',
 ':',
 'It',
 'is',
 'I',
 ',',
 'Arthur',
 ',',
 'son',
 'of',
 '...',
 'creeper',
 '!',
 '\\nSOLDIER',
 '#',
 '1',
 ':',
 'What',
 ',',
 'held',
 'under',
 'the',
 'dorsal',
 'guiding',
 'feathers',
 '?',
 '\\nSOLDIER',
 '#',
 '2',
 ':',
 'Well',
 ',',
 'why',
 'not',
 '?',
 '\\n']

In [9]:
# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))
 
# Print the unique tokens result
print(unique_tokens)

{'guiding', 'Well', ',', 'there', 'goes', 'Arthur', 'of', 'the', 'ARTHUR', '\\nARTHUR', 'Who', 'wind', 'is', 'clop', 'feathers', '#', 'SCENE', '\\n', 'I', 'under', 'It', 'not', '...', '1', ']', '!', '2', '\\nSOLDIER', 'What', 'Whoa', 'creeper', ':', 'son', 'held', 'why', 'dorsal', 'Halt', '[', '\\nKING', '?'}


Use re.search() to search for the first occurrence of the word "wind" in scene_one. Store the result in match.
Print the start and end indexes of match using its .start() and .end() methods, respectively.

In [10]:
import re
# Search for the first occurrence of "wind" in scene_one: match
match = re.search("wind", scene_one)

In [11]:
# Print the start and end indexes of match
print(match.start(), match.end())

11 15


Write a regular expression called pattern1 to find anything in square brackets.
Use re.search() with the pattern to find the first text in scene_one in square brackets in the scene. Print the result.

In [12]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

<re.Match object; span=(10, 78), match='[wind] [clop clop clop] \\nKING ARTHUR: Whoa ther>


Create a pattern to match the script notation (e.g. Character:), assigning the result to pattern2. Remember that you will want to match any words or spaces that precede the : (such as the space within SOLDIER #1:).
Use re.match() with your new pattern to find and print the script notation in the fourth line. The tokenized sentences are available in your namespace as sentences.

In [13]:
# Find the script notation at the beginning of the 1st sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2,sentences[0]))

<re.Match object; span=(0, 8), match='SCENE 1:'>


## Advanced tokenization with NLTK and regex

One new regex pattern you will find useful for advanced tokenization is the ability to use the or method. In regex, OR is represented by the pipe character |. To use the or, you can define a group using parenthesis (). Groups can be either a pattern or a set of characters you want to match. You can also define explicit character classes using square brackets []. 

In [14]:
digit_letter = ('(\d+|\w+)')
re.findall(digit_letter,scene_one)

['SCENE',
 '1',
 'wind',
 'clop',
 'clop',
 'clop',
 'nKING',
 'ARTHUR',
 'Whoa',
 'there',
 'clop',
 'clop',
 'clop',
 'nSOLDIER',
 '1',
 'Halt',
 'Who',
 'goes',
 'there',
 'nARTHUR',
 'It',
 'is',
 'I',
 'Arthur',
 'son',
 'of',
 'creeper',
 'nSOLDIER',
 '1',
 'What',
 'held',
 'under',
 'the',
 'dorsal',
 'guiding',
 'feathers',
 'nSOLDIER',
 '2',
 'Well',
 'why',
 'not',
 'n']

In [15]:
all_letter = r'[a-zA-z]+'
re.findall(all_letter,scene_one)

['SCENE',
 '[wind]',
 '[clop',
 'clop',
 'clop]',
 '\\nKING',
 'ARTHUR',
 'Whoa',
 'there',
 '[clop',
 'clop',
 'clop]',
 '\\nSOLDIER',
 'Halt',
 'Who',
 'goes',
 'there',
 '\\nARTHUR',
 'It',
 'is',
 'I',
 'Arthur',
 'son',
 'of',
 'creeper',
 '\\nSOLDIER',
 'What',
 'held',
 'under',
 'the',
 'dorsal',
 'guiding',
 'feathers',
 '\\nSOLDIER',
 'Well',
 'why',
 'not',
 '\\n']

In [16]:
num = r'[0-9]'
re.search(num,scene_one)

<re.Match object; span=(6, 7), match='1'>

In [17]:
special = r'[\d+|#]' 
re.findall(special, scene_one)

['1', '#', '1', '#', '1', '#', '2']

#### Choosing a tokenizer

Given the following string, which of the below patterns is the best tokenizer? If possible, you want to retain sentence punctuation as separate tokens, but have '#1' remain a single token.

You can use regexp_tokenize(string, pattern) with my_string and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer.

In [21]:
from nltk.tokenize import regexp_tokenize

In [22]:
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

In [23]:
pattern1 = '(\\w+|\\?|!)'
pattern2 = '(\\w+|#\\d|\\?|!)'
pattern3 = '(#\\d\\w+\\?!)'
pattern4 = '\\s+'

In [24]:
regexp_tokenize(my_string, pattern2)

['SOLDIER',
 '#1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

In [25]:
regexp_tokenize(my_string, pattern4)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

#### Regex with NLTK tokenization
Twitter is a frequently used source for NLP text and tasks. In this exercise, you'll build a more complex tokenizer for tweets with hashtags and mentions using nltk and regex. The nltk.tokenize.TweetTokenizer class gives you some extra methods and attributes for parsing tweets.

Here, you're given some example tweets to parse using both TweetTokenizer and regexp_tokenize from the nltk.tokenize module. These example tweets have been pre-loaded into the variable tweets. Feel free to explore it in the IPython Shell!

Unlike the syntax for the regex library, with nltk_tokenize() you pass the pattern as the second argument.

In [26]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

In [27]:
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

In [28]:
tweets = ['This is the best #nlp exercise ive found online! #python',
 '#NLP is super fun! <3 #learning',
 'Thanks @datacamp :) #nlp #python']
# Call regexp_tokenize() with this hashtag pattern on the first tweet in tweets and assign the result to hashtags.
# Print hashtags
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

['#nlp', '#python']


- Write a new pattern called pattern2 to match mentions and hashtags. A mention is something like @DataCamp.

Then, call regexp_tokenize() with your new hashtag pattern on the last tweet in tweets and assign the result to mentions_hashtags.

You can access the last element of a list using -1 as the index, for example, tweets[-1].

In [29]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@#]\w+)"
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

['@datacamp', '#nlp', '#python']


Create an instance of TweetTokenizer called tknzr and use it inside a list comprehension to tokenize each tweet into a new list called all_tokens.
To do this, use the .tokenize() method of tknzr, with t as your iterator variable.

In [30]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
tknzr

<nltk.tokenize.casual.TweetTokenizer at 0x1c8c5caef10>

In [31]:
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]


# Non-ascii tokenization
In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!


Unicode ranges for emoji are:

('\U0001F300'-'\U0001F5FF'), ('\U0001F600-\U0001F64F'), ('\U0001F680-\U0001F6FF'), and ('\u2600'-\u26FF-\u2700-\u27BF').

In [32]:
german_text = "Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕"

In [35]:
from nltk.tokenize import regexp_tokenize, word_tokenize

In [39]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']


In [40]:
# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

['Wann', 'Pizza', 'Und', 'Über']


In [41]:
# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

['🍕', '🚕']


# Email and Emoji Extractor from Text

### Objective
Develop a Python script that can take a large string (such as the content of a document) and extract all the email addresses and emojis it contains.

Understanding Regex for Emails and Emojis:


The regular expression r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' is designed to match email addresses in a text. Let's break it down to understand each part:

- \b: This is a word boundary. It ensures that the pattern matches only a complete word, not a substring of a word. For example, it prevents matching 'abc@def' in 'xyzabc@defghij'.

- [A-Za-z0-9._%+-]+: This part matches the username of an email address. It can include: Uppercase and lowercase letters (A-Za-z). Digits (0-9). Special characters like dot (.), underscore (_), percent (%), plus (+), and hyphen (-).

- The + means that the preceding character set can appear one or more times. @: This is the literal "at" symbol that appears in all email addresses. [A-Za-z0-9.-]+: This part matches the domain name of the email address. It can include: Uppercase and lowercase letters. Digits. Dots and hyphens. The + here again means one or more occurrences of the preceding character set.
\.

- This is a literal dot. In regex, a dot is a special character that matches almost any character, so it's escaped with a backslash to denote an actual dot character. [A-Z|a-z]{2,}: This matches the top-level domain (like .com, .org, .net). It includes: Uppercase and lowercase letters. The {2,} specifies that this part must be at least two characters long, with no upper limit. The pipe | inside the brackets is not doing its usual function of "or" in this context; it's being interpreted literally, which is actually a mistake in the pattern. It should be removed to correctly enforce the rule. \b: Another word boundary, to ensure that the pattern matches only a complete word.

For emojis, the regex can be more complex. A basic pattern to match common emojis is something like r'[\U0001F600-\U0001F64F]'. This pattern matches a range of Unicode characters that include emojis.

In [52]:
import re

# Regular expressions for emails and emojis
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emoji_pattern = re.compile(
    r'[\U0001F600-\U0001F64F'  # emoticons
    r'\U0001F300-\U0001F5FF'  # symbols & pictographs
    r'\U0001F680-\U0001F6FF'  # transport & map symbols
    r'\U0001F1E0-\U0001F1FF'  # flags (iOS)
    r'\U00002702-\U000027B0'
    r'\U000024C2-\U0001F251'
    r'\U0001f926-\U0001f937'
    r'\U00010000-\U0010ffff'
    r'\u2640-\u2642' 
    r'\u2600-\u2B55'
    r'\u200d'
    r'\u23cf'
    r'\u23e9'
    r'\u231a'
    r'\ufe0f'  # dingbats
    r'\u3030'
    ']+', flags=re.UNICODE)

# Function to extract emails
def extract_emails(text):
    return re.findall(email_pattern, text)

# Function to extract emojis
def extract_emojis(text):
    return re.findall(emoji_pattern, text)

In [53]:
# Path
file_path1 = 'fake2.txt'

# Reading the file
with open(file_path1, 'r',encoding='utf-8') as file:
    example_text = file.read()

In [54]:
# Extracting emails and emojis
emails = extract_emails(example_text)
emojis = extract_emojis(example_text)

In [55]:
print(emails)

['contact@fakemail.com', 'hr@learningcorp.net', 'helpdesk@techsolutions.org', 'orders@online-shop.com', 'community@socialplatform.io', 'unsubscribe@newsmail.com', 'partners@b2bmarket.net', 'security@accountsafety.com', 'feedback@userinput.com', 'service@homefixers.org', 'press@mediarelations.com', 'sales@techproducts.net', 'legal@lawconsultants.net', 'techsupport@troubleshooters.org', 'travel@adventures.com']


In [56]:
print(emojis)

['😊', '📅', '📞', '🛒', '🌐', '🚫', '🤝', '🔒', '📝', '🛠️', '🎤', '💻', '⚖️', '💡', '✈️']
