# NLP. Lab 2. Text processing basics

In this lab, we will cover a wide range of NLP concepts, including Sentence Segmentation, Lowercasing, Stop Words Removal, Lemmatization, Stemming, Byte-Pair Encoding (BPE), and Edit Distance. Theoretical overviews and practical examples for each concept will be provided.


## Sentence Segmentation

Sentence segmentation involves breaking down a text into individual sentences, typically separated by punctuation marks.


In [1]:
import nltk

text = "This is a sample text. It contains multiple sentences. Can we segment it?"
sentences = nltk.sent_tokenize(text)

print(sentences)

['This is a sample text.', 'It contains multiple sentences.', 'Can we segment it?']


## Lowercasing

Lowercasing converts all text to lowercase, ensuring uniformity and simplifying text processing.


In [2]:
text = "ThIs Is AN ExaMple Text."
lowercased_text = text.lower()

print(lowercased_text)

this is an example text.


## Stop Words Removal

Stop words are common words (e.g., "the," "and") that are often removed during text processing to focus on meaningful words.


In [3]:
from nltk.corpus import stopwords

nltk.download("stopwords", quiet=True)

text = "This is an example sentence with some stop words."
stop_words = set(stopwords.words("english"))

filtered_words = [word for word in text.split() if word.lower() not in stop_words]

print(filtered_words)

['example', 'sentence', 'stop', 'words.']


## Lemmatization

Lemmatization reduces words to their base or dictionary form, considering the context and applying morphological analysis.


In [4]:
import subprocess

# Download and unzip wordnet
try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

[nltk_data] Downloading package wordnet to /kaggle/working/...
Archive:  /kaggle/working/corpora/wordnet.zip
   creating: /kaggle/working/corpora/wordnet/
  inflating: /kaggle/working/corpora/wordnet/lexnames  
  inflating: /kaggle/working/corpora/wordnet/data.verb  
  inflating: /kaggle/working/corpora/wordnet/index.adv  
  inflating: /kaggle/working/corpora/wordnet/adv.exc  
  inflating: /kaggle/working/corpora/wordnet/index.verb  
  inflating: /kaggle/working/corpora/wordnet/cntlist.rev  
  inflating: /kaggle/working/corpora/wordnet/data.adj  
  inflating: /kaggle/working/corpora/wordnet/index.adj  
  inflating: /kaggle/working/corpora/wordnet/LICENSE  
  inflating: /kaggle/working/corpora/wordnet/citation.bib  
  inflating: /kaggle/working/corpora/wordnet/noun.exc  
  inflating: /kaggle/working/corpora/wordnet/verb.exc  
  inflating: /kaggle/working/corpora/wordnet/README  
  inflating: /kaggle/working/corpora/wordnet/index.sense  
  inflating: /kaggle/working/corpora/wordnet/data.

In [5]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["rocks", "corpora", "cries"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

['rock', 'corpus', 'cry']


## Stemming

Stemming reduces words to their stems or root form, often by removing suffixes, in a more heuristic approach.


In [6]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "rocks", "beautifully"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

['run', 'rock', 'beauti']


## Byte-Pair Encoding (BPE)

BPE is a data compression technique used in NLP for tokenization. It breaks down words into subword units.


In [7]:
# !pip install tokenizers

In [8]:
from tokenizers.processors import TemplateProcessing

special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
temp_proc = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", special_tokens.index("[CLS]")),
        ("[SEP]", special_tokens.index("[SEP]")),
    ],
)

In [9]:
from tokenizers import Tokenizer
from tokenizers.normalizers import Sequence, Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import BPE
from tokenizers.decoders import BPEDecoder

tokenizer = Tokenizer(BPE())
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = BPEDecoder()
tokenizer.post_processor = temp_proc

In [10]:
from tokenizers.trainers import BpeTrainer

In [11]:
import nltk
from nltk.corpus import gutenberg

nltk.download("gutenberg", quiet=True)
nltk.download("punkt", quiet=True)

trainer = BpeTrainer(vocab_size=5000, special_tokens=special_tokens)
shakespeare = [" ".join(s) for s in gutenberg.sents("shakespeare-macbeth.txt")]
tokenizer.train_from_iterator(shakespeare, trainer=trainer)






In [12]:
print(
    tokenizer.encode(
        "BPE is a data compression technique used in NLP for tokenization."
    ).tokens
)
print(
    tokenizer.encode(
        "Is this a danger which I see before me, the handle toward my hand?"
    ).tokens
)

['[CLS]', 'b', 'pe', 'is', 'a', 'd', 'at', 'a', 'com', 'pre', 'ss', 'ion', 'te', 'ch', 'ni', 'que', 'use', 'd', 'in', 'n', 'lp', 'for', 'to', 'ken', 'iz', 'ation', '.', '[SEP]']
['[CLS]', 'is', 'this', 'a', 'danger', 'which', 'i', 'see', 'before', 'me', ',', 'the', 'handle', 'toward', 'my', 'hand', '?', '[SEP]']


## Levenshtein edit distance

Edit distance measures the similarity between two strings by counting the minimum number of operations needed to transform one string into the other.

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance#Example)


In [13]:
# !pip install python-Levenshtein

In [14]:
import Levenshtein

word1 = "kitten"
word2 = "sitting"
distance = Levenshtein.distance(word1, word2)
print(f"Edit distance between '{word1}' and '{word2}': {distance}")

Edit distance between 'kitten' and 'sitting': 3


# Task


[Competition](https://www.kaggle.com/t/6dcb6f9def724f9f82050e9092952dd6)

The aim of the competition is to count the 10 most frequent words in the plays presented in the `data.txt` file.

In order to count the frequent words correctly, you must perform lemmatization and remove stop words.


In [15]:
with open("/kaggle/input/nlp-week-2/data.txt") as f:
    data = f.read()
plays = data.split("\n")
plays

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'shakespeare-macbeth.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-caesar.txt']

In [16]:
plays_dict = {}

for play in plays:
    plays_dict[play] = gutenberg.raw(play)
    print(play, len(plays_dict[play]))

austen-emma.txt 887071
austen-persuasion.txt 466292
austen-sense.txt 673022
shakespeare-macbeth.txt 100351
shakespeare-hamlet.txt 162881
shakespeare-caesar.txt 112310


In [17]:
nltk.download("stopwords", quiet= True)
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import regexp_tokenize
from collections import Counter

stop_words = set(stopwords.words("english"))
stop_words.update(set([',', ';', '.', '!', ':', '@', '#', '--', "''", "``", "`", "'", ]))

lemmatizer = WordNetLemmatizer()

def top_frequent_words(text, topk=10):
    tokens = regexp_tokenize(text.lower(), r'\w+')
    lemmatized_text = [lemmatizer.lemmatize(word) for word in tokens]
    filtered_text = [word for word in lemmatized_text if word not in stop_words]
    
    # Count the occurrences of each word
    word_counts = Counter(filtered_text)

    # Sort the word counts in descending order
    sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

    # Return the top k elements
    return sorted_word_counts[:topk]

In [19]:
top_words = {}
for play, text in plays_dict.items():
    top_words[play] = top_frequent_words(text)

In [20]:
with open("submission.csv", "w") as f:
    f.write("id,count\n")
    for play, counts in top_words.items():
        for i, count in enumerate(counts):
            f.write(f"{play}_{i},{count[1]}\n")