# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [16]:
from typing import List
import re

def tokenize_words(text: str) -> list:
   
    text = re.sub(r'\\\s*\n', ' ', text)
    
    text = re.sub(r'(:o)(?=\w)', r'\1 ', text)
    
    pattern = r'\d+\.[ap]\.m\.|\:\w|\b[a-zA-Z]\b|\b\w+(?:[-\']?\w+)*\b|[^\w\s]'
    return re.findall(pattern, text)

def tokenize_sentence(text: str) -> list:

    text = re.sub(r'\\\s*\n\s*', ' ', text)
    
    text = re.sub(r'(\d+)\.([ap])\.m\.', r'\1<TIMEABBR>\2<TIMEABBR>m<TIMEABBR>', text)
    
    text = re.sub(r'(\. :o) ([A-Z])', r'\1<SENTBREAK>\2', text)
    
    pattern = r'[^.!?]*[.!?](?:\s+:o)?(?:<SENTBREAK>)?'
    sentences = re.findall(pattern, text)
    
    result = []
    for s in sentences:
        if s.strip():
            s = s.strip()
            s = s.replace('<TIMEABBR>', '.')
            s = s.replace('<SENTBREAK>', '')
            result.append(s)
    
    return result
    
text = "Here we go again. I was supposed to add this text later.\
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
I hope you are getting along fine with this presentation, I really did try.\
And one last sentence, just so you can test you tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text))

print("Tokenized words:")
print(tokenize_words(text))

Tokenized sentences:
['Here we go again.', 'I was supposed to add this text later.', "Well, it's 10.p.m. here, and I'm actually having fun making this course. :o", 'I hope you are getting along fine with this presentation, I really did try.', 'And one last sentence, just so you can test you tokenizers better.']
Tokenized words:
['Here', 'we', 'go', 'again', '.', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later', '.', 'Well', ',', "it's", '10.p.m.', 'here', ',', 'and', 'I', "'", 'm', 'actually', 'having', 'fun', 'making', 'this', 'course', '.', ':o', 'I', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation', ',', 'I', 'really', 'did', 'try', '.', 'And', 'one', 'last', 'sentence', ',', 'just', 'so', 'you', 'can', 'test', 'you', 'tokenizers', 'better', '.']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [17]:
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/ga1ile0/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /home/ga1ile0/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ga1ile0/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/ga1ile0/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [18]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

file = open("../datasets/trump.txt", "r", encoding="utf-8") 
trump = file.read()
words = word_tokenize(trump)

tagged_words = pos_tag(words)

print("First 20 tagged words:")
for word, tag in tagged_words[:20]:
    print(f"{word}: {tag}")

tag_counts = {}
for _, tag in tagged_words:
    if tag in tag_counts:
        tag_counts[tag] += 1
    else:
        tag_counts[tag] = 1

print("\nTag frequency distribution:")
for tag, count in sorted(tag_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{tag}: {count}")

First 20 tagged words:
Thank: NNP
you: PRP
very: RB
much: RB
.: .
Mr.: NNP
Speaker: NNP
,: ,
Mr.: NNP
Vice: NNP
President: NNP
,: ,
Members: NNP
of: IN
Congress: NNP
,: ,
the: DT
First: NNP
Lady: NNP
of: IN

Tag frequency distribution:
NN: 667
IN: 549
DT: 456
JJ: 381
NNS: 358
NNP: 322
,: 312
.: 287
PRP: 284
VB: 283
CC: 241
RB: 231
PRP$: 175
VBP: 172
TO: 150
VBN: 127
MD: 114
VBZ: 99
VBG: 95
VBD: 86
CD: 60
NNPS: 36
WDT: 31
WP: 26
JJR: 26
WRB: 21
POS: 19
RP: 19
:: 17
JJS: 9
RBR: 8
PDT: 5
$: 5
'': 5
``: 3
EX: 3
UH: 2
WP$: 1


## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [19]:
import spacy

file = open("../datasets/trump.txt", "r", encoding='utf-8') 
trump = file.read() 

nlp = spacy.load("en_core_web_sm")

doc = nlp(trump)

sentences = list(doc.sents)

last_ten_sentences = sentences[-10:]

print(f"Last {len(last_ten_sentences)} sentences from Trump's speech:\n")

for i, sentence in enumerate(last_ten_sentences):
    nouns = [token.text for token in sentence if token.pos_ == "NOUN"]
    
    print(f"Sentence {i+1}: {sentence.text.strip()}")
    if nouns:
        print(f"Nouns: {', '.join(nouns)}")
    else:
        print("No nouns in this sentence.")
    print()

Last 10 sentences from Trump's speech:

Sentence 1: When we fulfill this vision, when we celebrate our 250 years of glorious freedom, we will look back on tonight as when this new chapter of American greatness began.
Nouns: vision, years, freedom, tonight, chapter, greatness

Sentence 2: The time for small thinking is over.
Nouns: time, thinking

Sentence 3: The time for trivial fights is behind us.
Nouns: time, fights

Sentence 4: We just need the courage to share the dreams that fill our hearts, the bravery to express the hopes that stir our souls, and the confidence to turn those hopes and those dreams into action.
Nouns: courage, dreams, hearts, bravery, hopes, souls, confidence, hopes, dreams, action

Sentence 5: From now on, America will be empowered by our aspirations, not burdened by our fears; inspired by the future, not bound by failures of the past; and guided by our vision, not blinded by our doubts.
Nouns: aspirations, fears, future, failures, past, vision, doubts

Sentenc

## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [20]:
import numpy as np
import spacy

class BagOfWords:
    """Basic BoW implementation."""
    
    def __init__(self):
        self.__vocabulary = []  
        self.__feature_matrix = None  
    
    def fit_transform(self, corpus: list):
        
        tokenized_docs = [tokenize_words(doc) for doc in corpus]
        
        all_words = []
        for doc in tokenized_docs:
            all_words.extend([word.lower() for word in doc if word.isalnum()])
        
        self.__vocabulary = sorted(list(set(all_words)))
        
        matrix = np.zeros((len(corpus), len(self.__vocabulary)), dtype=int)
        
        for doc_idx, doc in enumerate(tokenized_docs):
            doc_words = [word.lower() for word in doc if word.isalnum()]
            for word in doc_words:
                if word in self.__vocabulary:
                    word_idx = self.__vocabulary.index(word)
                    matrix[doc_idx, word_idx] += 1
        
        self.__feature_matrix = matrix
        return matrix
    
    def get_feature_names(self) -> list:
         
        return self.__vocabulary

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]    
    
vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
print(X)

vectorizer.get_feature_names()
len(vectorizer.get_feature_names())

[[0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1]
 [1 0 0 0 2 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 2 1 0 0 1 0 1 0 0]]


31

## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [21]:
from nltk.book import *

wall_street = text7.tokens

import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

In [22]:
import random

def clean_generated(text: str) -> str:
   
    text = re.sub(r'\s+([.!?])', r'\1', text)
    
    sentences = re.split(r'([.!?])\s+', text)
    
    cleaned_text = ""
    i = 0
    while i < len(sentences):
        if i == 0:
            cleaned_text += sentences[i][0].upper() + sentences[i][1:]
        elif sentences[i] in ['.', '!', '?']:
            cleaned_text += sentences[i]
        else:
            cleaned_text += sentences[i][0].upper() + sentences[i][1:]
        
        i += 1
    
    return cleaned_text

N=5 

SEP=" "

sentence_count=5

ngrams = build_ngrams()
counts = ngram_freqs(ngrams)

some_keys = random.sample(list(counts.keys()), 5)
print("Available sequences:", some_keys)

start_seq = some_keys[0]

generated = start_seq 

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

generated = clean_generated(generated)

print(generated)

Available sequences: ['a bill because of', 'amphobiles according to Brooke', 'as the 1980s bull', 'night if the magazine', '31 1993 at a']
A bill because of what it views as an undesirable intrusion into the affairs of industry but the 300-113 vote suggests that supporters have the potential 0 to override a veto.The broader question is where the Senate stands on the issue.While the Senate Commerce Committee has approved legislation similar to the House bill on airline leveraged buy-outs the measure has n't yet come to the full floor.Although the legislation would apply to acquisitions involving any major airline it is aimed at giving the Transportation Department the chance to review in advance transactions financed by large amounts of debt.The rating concern said 0 the coupon rate has n't yet been fixed but will probably be set at around 8.
