# Word processing

## Processing using regex
The previous lesson demonstrated using regex to answer Eliza questions. In this lesson, we will use regex to process text. 

<font color="red"> **Tokenisation** </font> is defined as the process of breaking up a string into tokens. 
Tokens are the basic building blocks of a sentence. In English, tokens are words, punctuation, and numbers.

For German tokenisation, we need to consider the following:
- German has compound words, e.g. "Krankenhaus" (hospital) is made up of "Kranken" (sick) and "Haus" (house).
- German has umlauts, e.g. "ä", "ö", "ü", "ß".
- German has a different punctuation system, e.g. "„", "“", "–", "…", "»", "«".

Let us focus first on English tokenization. That in itself is not trivial. 
Web based text can be:
1. formal (e.g. news articles when we want to build a news summarizer) 
2. informal (e.g. twitter when we want to build a sentiment analyzer) Interesting things happen when we deal with modern vocabulary found in informal texts like twitter.

Especially when we work with web based texts, we need to consider the following (with examples):
- HTML tags (e.g. \<br\>)
- URLs (http://www.example.com)
- Emoticons (:-))
- Abbreviations (e.g. "Mr.", "Mrs.", "Dr.", "Prof.")
- Punctuation (e.g. "!", "?", ".", ",", ";", ":", "(", ")", "[", "]", "{", "}", "<", ">")
- Numbers (e.g. "1", "2", "3", "4", "5", "6", "7", "8", "9", "0")
- Contractions (didn't = did not, capp'n=capping, )
- Currency symbols (e.g. "$", "€", "£", "¥")
- Hyphenated words (e.g. "self-driving")
- Words with apostrophes (e.g. "it's")
- Words with underscores (e.g. "@this_is_a_twitter_handle")
- Special characters (e.g. "@", "#", "%", "&", "*", "+", "=", "~", "_")
- Non-ASCII characters (e.g. "é", "ñ", "ç", "ß")
- Words with periods (e.g. "U.S.", "U.N.", "U.d.S.", "L.S.T.")
and much more... 

Today we will cover a subset of them.

### Tokenization using regex

Consider that we are interested in knowing the words spoken by Eliza. 
Or we want to know the words spoken by the user in a youtube comment. We can use regex to extract the words from the text.

sentence = "Today is 2023-07-20 and we are learning #NLP in class. Check out the website: www.example.com for more information! The price is $100.50."

Exercise 1: Word Tokenization in NLP

In this exercise, we will focus on word tokenization using space-based tokenization and handle special cases such as punctuation, dates, URLs, and hashtags.

</br>
</br>

Step 1: Simple <font color="blue">**Space-Based Tokenization**</font>
Write a Python function called simple_tokenizer that takes a sentence as input and returns a list of words, tokenized based on spaces.


```
def simple_tokenizer(sentence):
    # TODO: Implement space-based tokenization
    return tokens
```

Step 2: Handling <font color="blue">**Punctuation**</font>
Update the simple_tokenizer function to handle punctuation marks as well. Punctuation marks should be treated as separate tokens.

Step 3: Handling <font color="blue">**Dates**</font>
Write a Python function called handle_dates that takes a list of tokens and identifies dates in the format "YYYY-MM-DD" and tokenizes them as a single unit.

```
def handle_dates(tokens):
    # TODO: Identify dates (YYYY-MM-DD) and merge them into a single token
    return modified_tokens
```

Step 4: Handling <font color="blue">**URLs**</font>
Write a Python function called handle_urls that takes a list of tokens and identifies URLs and tokenizes them as a single unit.

```
def handle_urls(tokens):
    # TODO: Identify URLs and merge them into a single token
    return modified_tokens
```

Step 5: Handling <font color="blue">**Hashtags**</font>
Write a Python function called handle_hashtags that takes a list of tokens and identifies hashtags and tokenizes them as a single unit.

```def handle_hashtags(tokens):
    # TODO: Identify hashtags and merge them into a single token
    return modified_tokens
```

Step 6: Handling <font color="blue">**Currency Values**</font>
Write a Python function called handle_currency that takes a list of tokens and identifies currency values (e.g., $100.50) and tokenizes them as a single unit.
                                                                                                           
```
    def handle_currency(tokens):
    # TODO: Identify currency values and merge them into a single token
    return modified_tokens
```


Step 7: Putting It All Together
Now, combine all the functions together to create a comprehensive tokenizer that handles spaces, punctuation, dates, URLs, and hashtags.

```
def comprehensive_tokenizer(sentence):
    tokens = simple_tokenizer(sentence)
    tokens = handle_dates(tokens)
    tokens = handle_urls(tokens)
    tokens = handle_hashtags(tokens)
    tokens = handle_currency(tokens)
    return tokens
```

Step 8: Test the Tokenizer
Test your comprehensive tokenizer using various sentences containing dates, URLs, hashtags, and punctuation marks.

Example:
```
sentence = "Today is 2023-07-20 and we are learning #NLP in class. Check out the website: www.example.com for more information!"
tokens = comprehensive_tokenizer(sentence)
print(tokens)
```



Expected Output:
['Today', 'is', '2023-07-20', 'and', 'we', 'are', 'learning', '#NLP', 'in', 'class', '.', 'Check', 'out', 'the', 'website', ':', 'www.example.com', 'for', 'more', 'information', '!', 'The', 'price', 'is', '$100.50', '.']



# Hint:

Simple Space-Based Tokenization:

```
import re # Regular Expression library. Will be useful in this exercise

def simple_tokenizer(sentence):
    return sentence.split()

sentence = 'This is a simple sentence that can be tokenized using spaces . Note that even punctuation marks are treated as separate tokens .'
tokens = simple_tokenizer(sentence)
print(tokens)
```

#Hint Answer 1:

Handling Dates:

```
import re

def handle_dates(tokens):
    date_pattern = r'\d{4}-\d{2}-\d{2}'
    modified_tokens = []
    for token in tokens:
        if re.match(date_pattern, token):
            modified_tokens.append(token)
        else:
            modified_tokens.extend(re.split(r'(\W)', token))
    return modified_tokens
```

Rememebre to import re library


We used the re.match function to check if the token matches the date pattern. 
If it does, we append it to the modified_tokens list. 
If it does not, we use the re.split function to split the token based on the non-word characters and append the resulting tokens to the modified_tokens list.

*Also note we are handling a date as pattern like YYYY-MM-DD. Can you think of a way to handle dates in the format MM-DD-YYYY?*

# Answer 1:

Handling URLs:

```
import re

def handle_urls(tokens):
    url_pattern = r'www\.\S+|https?://\S+'
    modified_tokens = []
    for token in tokens:
        if re.match(url_pattern, token):
            modified_tokens.append(token)
        else:
            modified_tokens.extend(re.split(r'(\W)', token))
    return modified_tokens
```


Handling Hashtags:

```
import re

def handle_hashtags(tokens):
    hashtag_pattern = r'#\w+'
    modified_tokens = []
    for token in tokens:
        if re.match(hashtag_pattern, token):
            modified_tokens.append(token)
        else:
            modified_tokens.extend(re.split(r'(\W)', token))
    return modified_tokens
```


Handling Currency Values:

```
import re

def handle_currency(tokens):
    currency_pattern = r'\$\d+(\.\d+)?'
    modified_tokens = []
    for token in tokens:
        if re.match(currency_pattern, token):
            modified_tokens.append(token)
        else:
            modified_tokens.extend(re.split(r'(\W)', token))
    return modified_tokens
```


Combine All Functionalities:

```
def comprehensive_tokenizer(sentence):
    tokens = simple_tokenizer(sentence)
    tokens = handle_dates(tokens)
    tokens = handle_urls(tokens)
    tokens = handle_hashtags(tokens)
    tokens = handle_currency(tokens)
    return tokens
```


Applying it on the sentence above:
```
sentence = "I bought these shoes for $123.50 from www.shoes.com #shoes #shopping"
result = comprehensive_tokenizer(sentence)
print(result)
```

### Tokenization using nltk

NLTK is python library. It is used for natural language processing. It has many features like tokenization, stemming, lemmatization, POS tagging, etc.
Today we will see how to use NLTK for tokenization.

NLTK offers many different types of tokenizers based on your use case. We will see some of them here.
1. Word Tokenizer
2. Sentence Tokenizer
3. Regexp Tokenizer

#### Word Tokenizer
* Word tokenizer is used to tokenize the text into words. 
* It is used to split a sentence into words. It uses the space between two words to split the sentence. 
* It uses a simple logic that is also quite prone to errors. Let’s see an example of how it works.
```
from nltk.tokenize import word_tokenize
text = "I am learning NLP"
word_tokenize(text)
```


#### Sentence Tokenizer
* Sentence tokenizer is used to tokenize the text into sentences.
* It is used to split a paragraph or a large sentence into sentences.
* It uses periods (.) to split the text into sentences.

```
from nltk.tokenize import sent_tokenize
text = "I am learning NLP. It is very interesting and exciting. I am enjoying this course."
sent_tokenize(text)
```

#### Regex Tokenizer
* Regex tokenizer is used to tokenize the text using regular expressions.
* It is used to split a string into substrings using a regular expression.
* It uses regular expressions to split the text into tokens.

```
from nltk.tokenize import RegexpTokenizer

# Create a regular expression tokenizer to handle urls
tokenizer = RegexpTokenizer("www\.[^\s]+|https?://[^\s]+|\d+\.?\d*")

# Tokenize the text
text = "I am learning NLP. It is very interesting and exciting. I am enjoying this course. You can learn more about NLP at https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/"
tokenizer.tokenize(text)
```
We first initialized RegexpTokenizer using a regex pattern. Then we used the tokenize() method to tokenize the text.

### Tokenization using spacy (a short demo)

```
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.pos_, token.dep_)
```

* Note how spacy is able to identify the parts of speech and the dependency relations between the words in the sentence. 
* It goes beyond tokenization and provides a lot of useful information about the sentence.

In [None]:
### Tokenizing Youtube comments *UNDER PROGRESS*

Read the youtube.csv file and tokenize the comments.

In [None]:
# Read the file
df = pd.read_csv('youtube.csv')

# Retain only the CONTENT column
df = df[['CONTENT']]
df.head()

In [None]:
# Use nltk to tokenize the comments
import nltk
df['TOKENS'] = df['CONTENT'].apply(nltk.word_tokenize)
df.head()

# TODO: Complete this section. Demostrate how easy it is to tokenize the comments using nltk. Handle currency, abbreviations, word filtering, urls, etc.
- Word normalization: Ask in comments how would you handle don't - don t or do not , U.S.A., what about less obvious abbreviations where periods are unacceptable like L.S.T.M. 
- Then see how NLTK handles it. Read how it handles it. I can find code snippet in source of the most suitable tokenizer.
- See how case handling is done by tokenizer

- Move onto english lemmatization. Use example in slides and show how nltk handles it. Maybe write a simple python dictionary that does lemmatization. Show pitfalls. 
- This stemmer happens to be porter stemmer. A short intro to how rules work with porter stemmer

- A very short intro to BPE. Only use the most common BPETokenizer. 

- Note: All library installs need to go in requirement.txt

- Find a good example for German word tokenization. How do your rules stand up against nltk? 

In [None]:
# Regex can also act as filters. For example, we can use regex to remove all the abusive words from the comments.


In [None]:

# There are a lot of punctuations and abusive words in the comments. We need to remove them.

#List of abusive words
abusive_word_list = [''] # KAVERI: Add the abusive words here

# Remove the abusive words
def remove_abusive_words(tokens):
    return [word for word in tokens if word not in abusive_word_list]

df['TOKENS'] = df['TOKENS'].apply(remove_abusive_words)


# There are a lot of urls in the comments. We need to remove them. 
# (In some cases we may want to retain them, such as in the case of a web crawler. We may want to extract the urls and crawl them further.)

# Remove the urls
def remove_urls_and_abusive_words(tokens):
    return [word for word in tokens if not word.startswith('http') and word not in abusive_word_list]


df['TOKENS'] = df['TOKENS'].apply(remove_urls_and_abusive_words)