# Natural Language Processing

## Exercise Sheet 3

In [1]:
#imports for all exercises


### Exercise 1

Rewrite the following loop as a list comprehension:

In [2]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] 
result = [] 
for word in sent: 
    word_len = (word, len(word)) 
    result.append(word_len) 
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

In [3]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = [(word, len(word)) for word in sent]
print(result)

[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]


### Exercise 2

Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append "ay", e.g. "string" $\rightarrow$ "ingstray". If a word starts with a vowel, just add "way" to the end, e.g. "idle" $\rightarrow$ "idleway". 

Write a function to convert a word to Pig Latin. Test it with the words "pig", "cheers", and "omelet".


In [4]:
def pig_latin(word):
    vowels = "aeiouAEIOU"
    if word[0] in vowels:
        return word + "way"
    else:
        for i, letter in enumerate(word):
            if letter in vowels:
                return word[i:] + word[:i] + "ay"
        return word + "ay" 

words = ["pig", "cheers", "omelet"]
for word in words:
    piglatin_word = pig_latin(word)
    print(f'"{word}" in Pig Latin: "{piglatin_word}"')

"pig" in Pig Latin: "igpay"
"cheers" in Pig Latin: "eerschay"
"omelet" in Pig Latin: "omeletway"


### Exercise 3

Python's `random` module includes a function `choice()` which randomly chooses an item from a sequence, e.g. `choice('aehh ')` will produce one of four possible characters, with the letter "h" being twice as frequent as the others. Write a generator expression that produces a sequence of 500 randomly chosen letters drawn from the string "aehh ", and put this expression inside a call to the `''.join()` function, to concatenate them into one long string. You should get a result that looks like uncontrolled sneezing or maniacal laughter: "he  haha ee  heheeh eha". Use `split()` and `join()` again to normalize the whitespace in this string.

In [5]:
import random

#a sequence of 500 randomly chosen letters drawn from the string
random_letters = ''.join(random.choice("aehh ") for _ in range(500))

#normalize the whitespace in this string
normalized_text = ' '.join(random_letters.split())

print(normalized_text)

eeehehhaeha hhhhhah eaehhhhhhhe aehh ahheheeahahhahhehh hahh ahhehhhea hhe h ee heheah he ahh a ea eehhehah aha eehhhaaeh a ahah ee hahahh h hehhhhh he ehehhheah haeeaheaeehhhehehhaa a h ehaaa hhehhhhaeaeehaaeae hh aheaahha ahhe aaaaeheehhahe aeheheheaehhaahah e ha a heh ahhhahheaeaahhheaaeaheheahhh aea h aaee hah hheeehhe hheaaehaeh ee hh he ehhhah h ha ehahhhhae hehhhheeh h ee hh haeh haeahea h he eaehe ehaehhhehhhahhhaa hhahah ee eehe hhahhh hhhhhhhhh ehh aaha ahhhhhehhahea h


### Exercise 4

Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define $\mu_w$ to be the average number of letters per word, and $\mu_s$ to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: $4.71 \mu_w + 0.5 \mu_s - 21.43$. Compute the ARI score for the "lore" and "learned" genre of the Brown Corpus. Make use of the fact that `nltk.corpus.brown.words()` produces a sequence of words, while `nltk.corpus.brown.sents()` produces a sequence of sentences.


In [6]:
import nltk
from nltk.corpus import brown

def calculate_ari(words, sentences):
    total_letters = sum(len(word) for word in words)
    total_words = len(words)
    total_sentences = len(sentences)
    
    mu_w = total_letters / total_words
    mu_s = total_words / total_sentences
    
    ari_score = 4.71 * mu_w + 0.5 * mu_s - 21.43
    return ari_score

lore_words = brown.words(categories='lore')
lore_sentences = brown.sents(categories='lore')
ari_lore = calculate_ari(lore_words, lore_sentences)

learned_words = brown.words(categories='learned')
learned_sentences = brown.sents(categories='learned')
ari_learned = calculate_ari(learned_words, learned_sentences)

print(f"ARI score for 'lore' genre: {ari_lore:.2f}")
print(f"ARI score for 'learned' genre: {ari_learned:.2f}")


ARI score for 'lore' genre: 10.25
ARI score for 'learned' genre: 11.93


### Exercise 5

Define a variable `silly` to contain the string: 'newly formed bland ideas are inexpressible in an infuriating way'. Now write code to perform the following tasks:

a) Split `silly` into a list of strings, one per word, using Python's `split()` operation, and save this to a variable called `bland`.  
b) Extract the second letter of each word in `silly` and join them into a string, to get 'eoldrnnnna'.  
c) Combine the words in `bland` back into a single string, using `join()`. Make sure the words in the resulting string are separated with whitespace.  
d) Print the words of `silly` in alphabetical order, one per line.  

In [7]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'  # define the 'silly' string

bland = silly.split() # task A

second_letters = ''.join(word[1] for word in bland) # task B

combined_words = ' '.join(bland) # task C

sorted_words = sorted(bland) # task D

# Print the results
print("Split 'silly' into a list of words:")
print(bland)

print("\nExtract the second letter of each word and join them into a string:")
print(second_letters)

print("\nCombine the words in 'bland' back into a single string:")
print(combined_words)

print("\nPrint the words of 'silly' in alphabetical order:")
for word in sorted_words:
    print(word)


Split 'silly' into a list of words:
['newly', 'formed', 'bland', 'ideas', 'are', 'inexpressible', 'in', 'an', 'infuriating', 'way']

Extract the second letter of each word and join them into a string:
eoldrnnnna

Combine the words in 'bland' back into a single string:
newly formed bland ideas are inexpressible in an infuriating way

Print the words of 'silly' in alphabetical order:
an
are
bland
formed
ideas
in
inexpressible
infuriating
newly
way


### Exercise 6

Rewrite the following nested loop as a nested list comprehension:

In [8]:
words = ['attribution', 'confabulation', 'tenacious', 'elocution',
         'sequoia', 'tenacious', 'unidirectional']
vsequences = set()
for word in words:
    vowels = []
    for char in word:
        if char in 'aeiou':
            vowels.append(char)
    vsequences.add(''.join(vowels))
sorted(vsequences)

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

In [9]:
words = ['attribution', 'confabulation', 'tenacious', 'elocution',
         'sequoia', 'tenacious', 'unidirectional']

vsequences = sorted({''.join(char for char in word if char in 'aeiou') for word in words})
print(vsequences)

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']
