<a href="https://colab.research.google.com/github/alaeddinehamroun/Natural-Language-Processing/blob/main/NLP_Speech_Tagging_Working_With_text_files%2C_Creating_a_Vocabulary_and_Handling_Unknown_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import string
from collections import defaultdict

## Read Text Data
A tagged dataset taken from the Wall Street Journal is provided in the file WSJ_02-21.pos.

In [4]:
# Read lines from 'WSJ_02-21.pos' file and save them into th 'lines' variable
with open("WSJ_02-21.pos",  'r') as f:
  lines = f.readlines()

In [5]:
# Print columns for reference
print("\t\tWord", "\tTag\n")

# Print first five lines of the dataset
for i in range(5):
    print(f'line number {i+1}: {lines[i]}')

		Word 	Tag

line number 1: In	IN

line number 2: an	DT

line number 3: Oct.	NNP

line number 4: 19	CD

line number 5: review	NN



In [6]:
# Print first line (unformatted)
lines[0]

'In\tIN\n'

## Creating a vocabulary
A vocabulary is made up of every word that appeared at least 2 times in the dataset.

In [7]:
# Get the words from each line in the dataset
words = [line.split('\t')[0] for line in lines]

In [8]:
# Define defaultdict of type 'int'
freq = defaultdict(int)

# Count frequency of ocurrence for each word in the dataset
for word in words:
    freq[word] += 1

In [9]:
# Filter the dict to only include words that appeared at least 2 times
# Create the vocabulary by filtering the 'freq' dictionary
vocab = [k for k, v in freq.items() if (v > 1 and k != '\n')]

In [10]:
# Sort the vocabulary
vocab.sort()

# Print some random values of the vocabulary
for i in range(4000, 4005):
    print(vocab[i])

Early
Earnings
Earth
Earthquake
East


## Processing new text sources

### Dealing with unknown words

Now that you have a vocabulary, you will use it when processing new text sources. A new text will have words that do not appear in the current vocabulary. To tackle this, you can simply classify each new word as an unknown one, but you can do better by creating a function that tries to classify the type of each unknown word and assign it a corresponding unknown token.
This function will do the following checks and return an appropriate token:

* Check if the unknown word contains any character that is a digit: return --unk_digit--
* Check if the unknown word contains any punctuation character: return --unk_punct--
* Check it the unkown word contains any upper-case character: return --unk_upper--
* Check if the unkown word ends with a suffix that could indicate it is a noun, verb, adjective, or adverb: return --unk_noun--, --unk_verb--, --unk_adj--, --unk_adv-- respectively


If a word fails to fall under any condition then its token will be a plain --unk--. The conditions will be evaluated in the same order as listed here. So if a word contains a punctuation character but does not contain digits, it will fall under the second condition. To achieve this behaviour some if/elif statements can be used along with early returns.

In [11]:
def assign_unk(word):
    """
    Assign tokens to unknown words
    """
    
    # Punctuation characters
    # Try printing them out in a new cell!
    punct = set(string.punctuation)
    
    # Suffixes
    noun_suffix = ["action", "age", "ance", "cy", "dom", "ee", "ence", "er", "hood", "ion", "ism", "ist", "ity", "ling", "ment", "ness", "or", "ry", "scape", "ship", "ty"]
    verb_suffix = ["ate", "ify", "ise", "ize"]
    adj_suffix = ["able", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "ly", "ous"]
    adv_suffix = ["ward", "wards", "wise"]

    # Loop the characters in the word, check if any is a digit
    if any(char.isdigit() for char in word):
        return "--unk_digit--"

    # Loop the characters in the word, check if any is a punctuation character
    elif any(char in punct for char in word):
        return "--unk_punct--"

    # Loop the characters in the word, check if any is an upper case character
    elif any(char.isupper() for char in word):
        return "--unk_upper--"

    # Check if word ends with any noun suffix
    elif any(word.endswith(suffix) for suffix in noun_suffix):
        return "--unk_noun--"

    # Check if word ends with any verb suffix
    elif any(word.endswith(suffix) for suffix in verb_suffix):
        return "--unk_verb--"

    # Check if word ends with any adjective suffix
    elif any(word.endswith(suffix) for suffix in adj_suffix):
        return "--unk_adj--"

    # Check if word ends with any adverb suffix
    elif any(word.endswith(suffix) for suffix in adv_suffix):
        return "--unk_adv--"
    
    # If none of the previous criteria is met, return plain unknown
    return "--unk--"

A POS tagger will always encounter words that are not within the vocabulary that is being used. By augmenting the dataset to include these unknown word tokens you are helping the tagger to have a better idea of the appropriate tag for these words.

### Getting the correct tag for a word

In [12]:
def get_word_tag(line, vocab):
    # If line is empty return placeholders for word and tag
    if not line.split():
        word = "--n--"
        tag = "--s--"
    else:
        # Split line to separate word and tag
        word, tag = line.split()
        # Check if word is not in vocabulary
        if word not in vocab: 
            # Handle unknown word
            word = assign_unk(word)
    return word, tag

In [13]:
get_word_tag('\n', vocab)

('--n--', '--s--')

In [14]:
get_word_tag('In\tIN\n', vocab)

('In', 'IN')

In [15]:
get_word_tag('tardigrade\tNN\n', vocab)

('--unk--', 'NN')

In [16]:
get_word_tag('scrutinize\tVB\n', vocab)

('--unk_verb--', 'VB')