# Text Processing Exercises

## Exercise 1: Text Analysis Basics
**Objective:** Get comfortable with basic string operations and text manipulation.

- Write a Python script to count the number of words in a given text.
- Create a function that identifies and counts the frequency of each unique word in a text.
- Develop a script to find and replace specific words in a text with another word of your choice.

In [1]:
sample_text = """
In the heart of an ancient forest, a mysterious library stood untouched by time. Its shelves were laden with books of every conceivable subject, from the arcane arts to the natural sciences. The air was thick with the scent of old paper and whispers of knowledge long forgotten. Scholars from distant lands would journey for months to study its tomes, delving into secrets that were as old as the forest itself.

One day, a young wanderer stumbled upon the library. With eyes wide with wonder, she explored its vast halls, her fingers brushing against the spines of books that had not been touched in centuries. The library seemed to welcome her, its dimly lit corridors flickering to life as she passed. In this haven of knowledge, the wanderer found not just the answers to her questions, but also questions she had never thought to ask.
"""



In [2]:
import nltk
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /Users/azagar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/azagar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Exercise 1.1: Count the number of words in the given text
def count_words(text):
    words = nltk.word_tokenize(text)
    return len(words)

# Exercise 1.2: Identify and count the frequency of each unique word
def word_frequencies(text):
    words = text.lower().split()
    frequencies = {}
    for word in words:
        if word in frequencies:
            frequencies[word] += 1
        else:
            frequencies[word] = 1
    return frequencies

# Exercise 1.3: Find and replace specific words
def find_and_replace(text, old_word, new_word):
    replaced_text = text.replace(old_word, new_word)
    return replaced_text

# Solutions
word_count = count_words(sample_text)
print(word_count)
frequencies = word_frequencies(sample_text)
print(frequencies)
replaced_text = find_and_replace(sample_text, "library", "sanctuary")
print(replaced_text)

162
{'in': 3, 'the': 11, 'heart': 1, 'of': 6, 'an': 1, 'ancient': 1, 'forest,': 1, 'a': 2, 'mysterious': 1, 'library': 2, 'stood': 1, 'untouched': 1, 'by': 1, 'time.': 1, 'its': 4, 'shelves': 1, 'were': 2, 'laden': 1, 'with': 4, 'books': 2, 'every': 1, 'conceivable': 1, 'subject,': 1, 'from': 2, 'arcane': 1, 'arts': 1, 'to': 6, 'natural': 1, 'sciences.': 1, 'air': 1, 'was': 1, 'thick': 1, 'scent': 1, 'old': 2, 'paper': 1, 'and': 1, 'whispers': 1, 'knowledge': 1, 'long': 1, 'forgotten.': 1, 'scholars': 1, 'distant': 1, 'lands': 1, 'would': 1, 'journey': 1, 'for': 1, 'months': 1, 'study': 1, 'tomes,': 1, 'delving': 1, 'into': 1, 'secrets': 1, 'that': 2, 'as': 3, 'forest': 1, 'itself.': 1, 'one': 1, 'day,': 1, 'young': 1, 'wanderer': 2, 'stumbled': 1, 'upon': 1, 'library.': 1, 'eyes': 1, 'wide': 1, 'wonder,': 1, 'she': 3, 'explored': 1, 'vast': 1, 'halls,': 1, 'her': 2, 'fingers': 1, 'brushing': 1, 'against': 1, 'spines': 1, 'had': 2, 'not': 2, 'been': 1, 'touched': 1, 'centuries.': 1, 's

## Exercise 2: Regular Expressions
**Objective:** Practice using regular expressions for pattern matching and text manipulation.

- Write a Python function that uses regular expressions to find all email addresses in a given text.
- Create a script that extracts all dates (in the format xx/xx/xxxx) from a text.
- Develop a regular expression that identifies all occurrences of Slovene phone numbers in a text.

In [4]:
import re

# Define the sample text
sample_text = """
John's email is john.doe@example.com, and he started working with us on 3rd April 2021. For inquiries, you can also reach out to Jane at jane_doe123@workmail.com. Our office was established on 15/08/1999, and since then, we have been located at 123 Baker Street. Remember to mark the important date, 01-Jan-2023, for our annual meeting. For more information, visit our website or contact admin@ourwebsite.org.
"""

# Regular expression for finding email addresses
email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

# Regular expression for finding dates in the format xx/xx/xxxx
date_regex = r'\b\d{1,2}/\d{1,2}/\d{4}\b'


# Find all matches in the text
email_addresses = re.findall(email_regex, sample_text)
dates = re.findall(date_regex, sample_text)

# Extracting the full match from the date tuples
dates = [''.join(date) for date in dates]

(email_addresses, dates)


(['john.doe@example.com', 'jane_doe123@workmail.com', 'admin@ourwebsite.org'],
 ['15/08/1999'])

## Exercise 3: Text Preprocessing Techniques for Slovene
**Objective:** Deepen understanding of text preprocessing techniques.

- Use classla pipeline to process given text.
- Iterate through text and print words, their lemmas and POS tags line by line. 
- Find a list of Slovene stopwords on the web and filter them out from the given text.

In [5]:
slovene_text = """
V središču starega mesta Ljubljana stoji mogočna Ljubljanska katedrala, ki privablja obiskovalce iz vseh koncev sveta. Zgrajena v baročnem slogu, ta arhitekturni biser razkriva zgodovino in kulturo slovenske prestolnice. Njene veličastne freske in izdelano rezbarstvo vzbujajo občudovanje in spoštovanje med vsemi, ki prestopijo njen prag.

Le nekaj ulic stran, ob bregovih reke Ljubljanice, se razteza živahna tržnica, kjer lokalni pridelovalci vsak dan ponujajo sveže sadje, zelenjavo in druge domače izdelke. Ta kraj je središče mestnega vrveža in priljubljeno zbirališče tako za domačine kot turiste. Sprehajalci lahko uživajo v prijetnem vzdušju, ki ga ustvarjajo številne kavarne in restavracije, ki obdajajo tržnico.
"""

In [6]:
import classla

# Initialize the Classla pipeline for Slovene
nlp = classla.Pipeline('sl', processors='tokenize,ner,pos,lemma,depparse')                      

2024-03-03 11:46:16 INFO: Loading these models for language: sl (Slovenian):
| Processor | Package  |
------------------------
| tokenize  | standard |
| pos       | standard |
| lemma     | standard |
| depparse  | standard |
| ner       | standard |

2024-03-03 11:46:16 INFO: Use device: cpu
2024-03-03 11:46:16 INFO: Loading: tokenize
2024-03-03 11:46:16 INFO: Loading: pos
2024-03-03 11:46:22 INFO: Loading: lemma
2024-03-03 11:46:29 INFO: Loading: depparse
2024-03-03 11:46:30 INFO: Loading: ner
2024-03-03 11:46:30 INFO: Done loading processors!


In [7]:
# Process the text
doc = nlp(slovene_text)

In [8]:
# Iterate through sentences and tokens to extract information
for sentence in doc.sentences:
    for word in sentence.words:
        print(f"Word: {word.text}\tLemma: {word.lemma}\tPart of Speech: {word.upos}")

Word: V	Lemma: v	Part of Speech: ADP
Word: središču	Lemma: središče	Part of Speech: NOUN
Word: starega	Lemma: star	Part of Speech: ADJ
Word: mesta	Lemma: mesto	Part of Speech: NOUN
Word: Ljubljana	Lemma: Ljubljana	Part of Speech: PROPN
Word: stoji	Lemma: stati	Part of Speech: VERB
Word: mogočna	Lemma: mogočen	Part of Speech: ADJ
Word: Ljubljanska	Lemma: ljubljanski	Part of Speech: ADJ
Word: katedrala	Lemma: katedrala	Part of Speech: NOUN
Word: ,	Lemma: ,	Part of Speech: PUNCT
Word: ki	Lemma: ki	Part of Speech: SCONJ
Word: privablja	Lemma: privabljati	Part of Speech: VERB
Word: obiskovalce	Lemma: obiskovalec	Part of Speech: NOUN
Word: iz	Lemma: iz	Part of Speech: ADP
Word: vseh	Lemma: ves	Part of Speech: DET
Word: koncev	Lemma: konec	Part of Speech: NOUN
Word: sveta	Lemma: svet	Part of Speech: NOUN
Word: .	Lemma: .	Part of Speech: PUNCT
Word: Zgrajena	Lemma: zgrajen	Part of Speech: ADJ
Word: v	Lemma: v	Part of Speech: ADP
Word: baročnem	Lemma: baročen	Part of Speech: ADJ
Word: slogu	Lem

In [9]:
import requests

# URL of the Slovene stopwords list
url = "https://raw.githubusercontent.com/stopwords-iso/stopwords-sl/master/stopwords-sl.txt"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Decode the content and split by new lines to get a list of stopwords
    stopwords = response.content.decode('utf-8').splitlines()
    print("Slovene Stopwords:")
    print(stopwords)
else:
    print(f"Failed to fetch stopwords. Status code: {response.status_code}")

Slovene Stopwords:
['a', 'ali', 'april', 'avgust', 'b', 'bi', 'bil', 'bila', 'bile', 'bili', 'bilo', 'biti', 'blizu', 'bo', 'bodo', 'bojo', 'bolj', 'bom', 'bomo', 'boste', 'bova', 'boš', 'brez', 'c', 'cel', 'cela', 'celi', 'celo', 'd', 'da', 'daleč', 'dan', 'danes', 'datum', 'december', 'deset', 'deseta', 'deseti', 'deseto', 'devet', 'deveta', 'deveti', 'deveto', 'do', 'dober', 'dobra', 'dobri', 'dobro', 'dokler', 'dol', 'dolg', 'dolga', 'dolgi', 'dovolj', 'drug', 'druga', 'drugi', 'drugo', 'dva', 'dve', 'e', 'eden', 'en', 'ena', 'ene', 'eni', 'enkrat', 'eno', 'etc.', 'f', 'februar', 'g', 'g.', 'ga', 'ga.', 'gor', 'gospa', 'gospod', 'h', 'halo', 'i', 'idr.', 'ii', 'iii', 'in', 'iv', 'ix', 'iz', 'j', 'januar', 'jaz', 'je', 'ji', 'jih', 'jim', 'jo', 'julij', 'junij', 'jutri', 'k', 'kadarkoli', 'kaj', 'kajti', 'kako', 'kakor', 'kamor', 'kamorkoli', 'kar', 'karkoli', 'katerikoli', 'kdaj', 'kdo', 'kdorkoli', 'ker', 'ki', 'kje', 'kjer', 'kjerkoli', 'ko', 'koder', 'koderkoli', 'koga', 'komu',

In [10]:
# Perform Lemmatization and Filter Out Stopwords
lemmatized_filtered_tokens = []

for sentence in doc.sentences:
    for word in sentence.words:
        if word.lemma.lower() not in stopwords:  
            lemmatized_filtered_tokens.append(word.text)

print("Lemmatized and Filtered Tokens:")
print(lemmatized_filtered_tokens)

Lemmatized and Filtered Tokens:
['središču', 'starega', 'mesta', 'Ljubljana', 'stoji', 'mogočna', 'Ljubljanska', 'katedrala', ',', 'privablja', 'obiskovalce', 'koncev', 'sveta', '.', 'Zgrajena', 'baročnem', 'slogu', ',', 'arhitekturni', 'biser', 'razkriva', 'zgodovino', 'kulturo', 'slovenske', 'prestolnice', '.', 'veličastne', 'freske', 'izdelano', 'rezbarstvo', 'vzbujajo', 'občudovanje', 'spoštovanje', ',', 'prestopijo', 'prag', '.', 'ulic', ',', 'bregovih', 'reke', 'Ljubljanice', ',', 'razteza', 'živahna', 'tržnica', ',', 'lokalni', 'pridelovalci', 'ponujajo', 'sveže', 'sadje', ',', 'zelenjavo', 'domače', 'izdelke', '.', 'kraj', 'središče', 'mestnega', 'vrveža', 'priljubljeno', 'zbirališče', 'domačine', 'turiste', '.', 'Sprehajalci', 'uživajo', 'prijetnem', 'vzdušju', ',', 'ustvarjajo', 'številne', 'kavarne', 'restavracije', ',', 'obdajajo', 'tržnico', '.']


## Exercise 4: Rule-Based Systems
**Objective:** Understand and apply rule-based systems for text processing.

- Develop a script that can extract named entities (like names of people, places, etc.) from a list of messages using rule-based patterns.
- Design a simple rule-based system that can classify text messages as "spam" or "not spam" based on specific keywords.

In [11]:
# Sample data: List of text messages
messages = [
    "Win a FREE iPhone! Click here to claim now!",
    "Dear John, your subscription to 'Tech Today' has been confirmed.",
    "You have won $1000 in the Global Lottery! Send your bank details to claim.",
    "Reminder: Meeting with the marketing team at 10 AM in the Tesla Conference Room.",
    "This is your final reminder to pay your Verizon phone bill.",
    "Congratulations, Sarah! You've been selected for a chance to win a Bahamas cruise!",
    "Exclusive offer for Amazon Prime members: Unlock 50% discount on your next purchase.",
    "Your FedEx package has been shipped and is on its way to 123 Elm Street!",
    "Reminder: Your dental appointment with Dr. Anderson is scheduled for tomorrow at 3 PM.",
    "Claim your FREE trial of Adobe Photoshop today.",
    "Urgent: Your Chase Bank account has been compromised! Change your password immediately.",
    "Join Microsoft's webinar on the future of artificial intelligence.",
    "Get rid of debt now! Consolidate your loans with Goldman Sachs into one low monthly payment.",
    "Happy Birthday, Emily! Enjoy a complimentary dinner at Olive Garden.",
    "Your Netflix membership renewal is due. Please update your billing information.",
    "You're invited to Google's exclusive networking event this Friday in San Francisco.",
    "Act now to extend your Toyota car warranty at a special discounted rate.",
    "Final notice: Your eBay account will be deactivated unless action is taken.",
    "Congratulations, Dave! You've earned a reward from Starbucks! Click to redeem your points.",
    "Survey invitation from Airbnb: Share your feedback and receive a $10 gift card."
]

In [12]:
# Define a list of spammy keywords
spam_keywords = ['win', 'free', 'claim', 'congratulations', 'lottery', 'click here']

# Function to classify messages
def classify_messages(messages, spam_keywords):
    classifications = []
    for message in messages:
        # Convert message to lowercase and check for spam keywords
        if any(spam_keyword in message.lower() for spam_keyword in spam_keywords):
            classifications.append("Spam")
        else:
            classifications.append("Not Spam")
    return classifications

# Classify the messages
classifications = classify_messages(messages, spam_keywords)

# Print results
for message, classification in zip(messages, classifications):
    print(f"Message: '{message}'\nClassification: {classification}\n")


Message: 'Win a FREE iPhone! Click here to claim now!'
Classification: Spam

Message: 'Dear John, your subscription to 'Tech Today' has been confirmed.'
Classification: Not Spam

Message: 'You have won $1000 in the Global Lottery! Send your bank details to claim.'
Classification: Spam

Message: 'Reminder: Meeting with the marketing team at 10 AM in the Tesla Conference Room.'
Classification: Not Spam

Message: 'This is your final reminder to pay your Verizon phone bill.'
Classification: Not Spam

Message: 'Congratulations, Sarah! You've been selected for a chance to win a Bahamas cruise!'
Classification: Spam

Message: 'Exclusive offer for Amazon Prime members: Unlock 50% discount on your next purchase.'
Classification: Not Spam

Message: 'Your FedEx package has been shipped and is on its way to 123 Elm Street!'
Classification: Not Spam

Message: 'Reminder: Your dental appointment with Dr. Anderson is scheduled for tomorrow at 3 PM.'
Classification: Not Spam

Message: 'Claim your FREE 

In [13]:
import re

def extract_named_entities(message):
    # Pattern to match sequences of capitalized words, possibly including single-letter capitalized words (initials) and abbreviations
    # This pattern allows for a space or a period (for initials and abbreviations) followed by a capitalized word
    # The lookahead assertion (?=\s) ensures that the match is followed by a space, aiming to exclude possessive cases and contractions
    entity_pattern = r'(?<!^)(?<!\.\s)(?:[A-Z][a-z]*\.?\s?)+(?=\s)'

    # Find all matches in the message
    potential_entities = re.findall(entity_pattern, message)

    # Post-processing to trim any trailing spaces or periods from the matches
    potential_entities = [entity.strip('. ') for entity in potential_entities]

    return potential_entities

# Apply the extraction function to each message
for message in messages:
    extracted_entities = extract_named_entities(message)
    print(f"Message: '{message}'\nExtracted Entities: {extracted_entities}\n")


Message: 'Win a FREE iPhone! Click here to claim now!'
Extracted Entities: ['FREE', 'Click']

Message: 'Dear John, your subscription to 'Tech Today' has been confirmed.'
Extracted Entities: ['Tech']

Message: 'You have won $1000 in the Global Lottery! Send your bank details to claim.'
Extracted Entities: ['Global', 'Send']

Message: 'Reminder: Meeting with the marketing team at 10 AM in the Tesla Conference Room.'
Extracted Entities: ['Meeting', 'AM', 'Tesla Conference']

Message: 'This is your final reminder to pay your Verizon phone bill.'
Extracted Entities: ['Verizon']

Message: 'Congratulations, Sarah! You've been selected for a chance to win a Bahamas cruise!'
Extracted Entities: ['Bahamas']

Message: 'Exclusive offer for Amazon Prime members: Unlock 50% discount on your next purchase.'
Extracted Entities: ['Amazon Prime', 'Unlock']

Message: 'Your FedEx package has been shipped and is on its way to 123 Elm Street!'
Extracted Entities: ['FedEx', 'Elm']

Message: 'Reminder: Your d

## Exercise 5: Corpus Analysis
**Objective:** Gain experience in working with and analyzing text corpora.

- Download a text corpus in Slovene (ccKres): https://www.clarin.si/repository/xmlui/handle/11356/1034.
- Text format of the corpus contains a lot of documents. Sample and store 1000 of them.
- Analyze the corpus for collocations (frequent word pairs or triplets) and report your findings. NOTE: Since the dataset is relatively large, you can use .split() method instead of classla. 

In [14]:
import os

# Specify the path to the directory containing the files you want to concatenate
directory_path = 'cckresV1_0-text'

# Initialize an empty string to hold the concatenated content and number of documents
concatenated_content = ''
doc_num = 1000

# Loop through each file in the specified directory
for idx, filename in enumerate(os.listdir(directory_path)):
    if idx == doc_num:
        break
    # Construct the full file path
    file_path = os.path.join(directory_path, filename)
    
    # Check if it is a file (and not a directory/subdirectory)
    if os.path.isfile(file_path):
        # Open the file for reading
        with open(file_path, 'r', encoding='utf-8') as file:
            # Read the file's content and concatenate it
            concatenated_content += file.read() + '\n'  # Adding a newline for separation between files

# Optional: Print or save the concatenated content
print(concatenated_content[:300])

# Optional: To save the concatenated content to a new file:
output_file_path = f'concatenated_output_̣{doc_num}.txt'
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    output_file.write(concatenated_content)


20. člen je bil po mnenju pritožnika kršen, ker so v članku razkriti številni osebni podatki kršitelja, hkrati pa tudi otroka. 
V prispevku je objavljeno, kje je družina živela, polno ime in priimek očeta, njegove poklicne dejavnosti ter govorice o tem, da je bil v preteklosti že obsojen za kazniva 


In [15]:
tokenized_content = nltk.word_tokenize(concatenated_content, language='slovene')

In [16]:
import nltk
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
from nltk.tokenize import word_tokenize

# Ensure you have the 'punkt' tokenizer models downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/azagar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
bigram_finder = BigramCollocationFinder.from_words(tokenized_content)

# Optionally, you can apply frequency filters
bigram_finder.apply_freq_filter(10)

bigram_scores = bigram_finder.score_ngrams(BigramAssocMeasures.pmi)

In [18]:
print("Top Bigram Collocations:")
for score in bigram_scores[:25]:  # Adjust the slice for the number of collocations you want to see
    print(score)

Top Bigram Collocations:
(('PASTOR', 'MANDERS'), 16.946640486865206)
(('TROSNEGA', 'PRAHU'), 16.946640486865206)
(('BARVA', 'TROSNEGA'), 16.82110960478135)
(('RAZŠIRJENOST', 'Razširjena'), 16.705632387361412)
(('Naško', 'Križnar'), 16.406072105502506)
(('Ustavna', 'pritožba'), 16.318609264252167)
(('ogljikovih', 'hidratov'), 16.318609264252167)
(('srednjem', 'veku'), 16.318609264252167)
(('Ulica', 'Ristori'), 16.258173410390192)
(('Ruske', 'federacije'), 16.236147104060194)
(('Zdenka', 'Čebašek'), 16.236147104060194)
(('MALO', 'BERILO'), 16.158144592058925)
(('vrednostnih', 'papirjev'), 16.120669886640258)
(('Black/Process', 'Black'), 16.084144010615145)
(('lokalno', 'samoupravo'), 16.08414401061514)
(('Evropsko', 'unijo'), 16.013754682723746)
(('naribanega', 'parmezana'), 15.917136492555125)
(('Black', 'plate'), 15.89149893267275)
(('Toneta', 'Čufarja'), 15.762215915727783)
(('VELIKOST', 'Klobuk'), 15.762215915727781)
(('blagovno', 'znamko'), 15.691826587836383)
(('denarno', 'kaznijo'

In [19]:
trigram_finder = TrigramCollocationFinder.from_words(tokenized_content)

# Optionally, you can apply frequency filters
trigram_finder.apply_freq_filter(10)

trigram_scores = trigram_finder.score_ngrams(TrigramAssocMeasures.pmi)

In [20]:
print("\nTop Trigram Collocations:")
for score in trigram_scores[:25]:  # Adjust the slice for the number of collocations you want to see
    print(score)


Top Trigram Collocations:
(('BARVA', 'TROSNEGA', 'PRAHU'), 33.76775009164656)
(('Black/Process', 'Black', 'plate'), 32.49021611611765)
(('Â', 'Â', 'Â'), 30.170256675219033)
(('uporabljeni', 'materiali', 'uvrščajo'), 30.07356936397715)
(('Ustavna', 'pritožba', 'zoper'), 30.052256027783177)
(('novimatajur', '@', 'spin.it'), 29.63846707470159)
(('Slovenskem', 'etnografskem', 'muzeju'), 29.52789846117534)
(('POGLED', 'OD', 'STRANI'), 29.14745451781312)
(('_', '_', '_'), 29.125862555860586)
(('mlet', 'črni', 'poper'), 28.897894996230697)
(('neprečiščenega', 'olivnega', 'olja'), 28.755863025002952)
(('OD', 'STRANI', 'ZGORAJ'), 28.372521073447896)
(('sveže', 'mlet', 'črni'), 28.346485055220565)
(('izdelka', 'franko', 'tovarna'), 28.18180427040053)
(('Državna', 'revizijska', 'komisija'), 27.807518798322256)
(('mag.', 'Blaž', 'Kavčič'), 27.494320514450386)
(('PRVI', 'IN', 'DRUGI'), 27.24829253392228)
(('cene', 'izdelka', 'franko'), 26.799334633578116)
(('d.', 'o.', 'o.'), 26.157532371115224)
(