<div style="text-align: right">
    <i>
        LING 5981/6080: Fundamentals of Python <br>
        Fall 2020 <br>
        Aniello De Santo
    </i>
</div>

# Notebook 4: range, zip, enumerate, and useful string methods

This notebook expands on for-loops, introducing a way to iterate over numbers within a certain range, therefore giving access to index-based iteration over containers using `range`. It also shows how to use `zip` and `enumerate`. 
It also discusses several additional string methods such as `split` and `join`.
Finally, the homework will lead you to use what you have learned so far (specifically, for-loops, if statements, and lists) to implement $n$-gram extraction.

## For-loops: reminder

_For-loops_ iterates over some object (**iterable**) and considers sub-elements of that object in order.

In [None]:
for letter in "apple":
    print(letter)

In [None]:
for letter in "apple":
    print("hello")

In [None]:
indexes = [0, 1, -1, -4]
word = "linguistics"

for index in indexes:
    print(word[index], end="")

In [None]:
cities = ["NYC", "LA", "SF"]
for city in cities:
    print("The current city is", city)
    for ch in city:
        print("\t", ch)

In order to print indexes of items in iterables, we can implement a **counter**, i.e. a variable that will increase every time some condition is met. In this case, we will set the counter to $0$ and increase it with every iteration.

In [None]:
index = 0
for letter in "linguistics":
    print(letter, "\t index:", index)
    index += 1

**Example:** Let's say we are given three lists: list of states (`states`), list of average temperatures for those states in the same order (`temperatures`) and a list of states that are considered New England (`new_england`).

In [None]:
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

temperatures = [62.8, 26.6, 60.3, 60.4, 59.4, 45.1, 49, 55.3, 70.7, 63.5,
                70, 44.4, 51.8, 51.7, 47.8, 54.3, 55.6, 66.4, 41, 54.2, 
                47.9, 44.4, 41.2, 63.4, 54.5, 42.7, 48.8, 49.9, 43.8, 52.7, 
                53.4, 45.4, 59, 40.4, 50.7, 59.6, 48.4, 48.8, 50.1, 62.4, 
                45.2, 57.6, 64.8, 48.6, 42.9, 55.1, 48.3, 51.8, 43.1, 42]

new_england = ["Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut",
               "Rhode Island"]

The code below prints average temperatures for New England states. The variable `index` stores the index of an item we are currently looking at.

In [None]:
index = 0
for state in states:
    if state in new_england:
        print(state+":", temperatures[index])
    index += 1

**Practice:** Helpful function for the following practice exercise is `sum` that takes list as an argument and returns the sum of all numbers in that list. FYI, functions `min` and `max` are available as well.

In [None]:
numbers = [1, 18.3, 9, 0, 3.14]
print("Sum of those numbers is", sum(numbers))
print("The smallest number is", min(numbers))
print("The largest number is", max(numbers))

Modify the code above to print the average temperature in New England. (You can use the `round` function to make the resulting number prettier.)

### Modifying strings

String indexes cannot be reassigned, i.e. the existent parts of the string cannot be modified directly:

In [None]:
string = "hello"
string[-1] = "a"

If we have a task to "mask" all vowels from a text, we will need to create a new string based on the old one.

**Practice** Withouth looking at the code in the next cell, can you think of how to do it?

In [None]:
vowels = "aoiue"
text = "This is a sentence that should contain no vowels."

#try it here by yoursel!

In [None]:
vowels = "aoiue"
text = "This is a sentence that should contain no vowels."

masked_text = ""
for char in text:
    if char not in vowels:
        masked_text += char
    else:
        masked_text += "*"
print(masked_text)

**Practice:** You are given a string `alphabet` that contains all English letters, and a string `text`.

In [None]:
alphabet = "abcdefghijklmnopqrstuvwxyz"
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."

Write code that makes this string lowercase and deletes punctuations from the text.

## Range

Say that you want to print the word "hello" ten times. How would you do it? The most trivial answer is "I'll write _print("Hello")_ ten times". But how would you do it with a for-loop? Can you think of a way to make the loop iterate exactly $10$ times?

In [None]:
# Try it here!
print(list(range(1, 1000)))

**Range** is a numeric iterable defined by three arguments: _start_, _end_, and _step_. These arguments behave exactly as they do in slices: _start_ defines the initial numerical value, _end_ is the first value not included in the range, and _step_ defines the difference between the first and the following value.

In [None]:
for value in range(500, 1000):
    print(value, end=" ")

In [None]:
for value in range(1, 10, 2):
    print(value, end=" ")

If only one argument is provided, it is considered to be _end_, and the initial value is assumed to be $0$.

In [None]:
for value in range(10):
    print(value, end=" ")

Range cannot be displayed directly, but can be easily converted to a list using `list` function.
(If you are curious about the nature of the range object, read [this article](https://treyhunner.com/2018/02/python-range-is-not-an-iterator/), but a safe way is to just call it an iterable, or a range object).

In [None]:
print("Printing range object:", range(10))
print("Typecasting range to a list:", list(range(10)))

In order to iteratively get indexes available in some iterable, we can use the following trick: `range(len(iterable))`.

In [None]:
word = "linguist"
for i in range(len(word)):
    print("index:", i, "\tsymbol:", word[i])

**Practice** OK, now you know all you need to know about for-loops! Can you write a code that asks the user for $10$ favorite foods, one at the time? Add those foods into a list, and once the user is done print them back!

In [None]:
# Try it here!

In [None]:
word = input()
bigrams= []

for i in range(len(word)-1):
    bi = word[i:i+2]
    if bi not in bigrams:
        bigrams.append(bi)

print(bigrams)

## N-grams

$n$-gram models are a very basic, fundamental concept in computational linguistics!
Intuitively, $n$-grams are sequences of $n$ consequtive symbols.

    word:   banana
    n:      2
    ngrams: ba, an, na
    
    word:   linguist
    n:      3
    ngrams: lin, ing, ngu, gui, uis, ist

A special case of $n$-grams where the value of $n$ is $2$ are called _bigrams_. If $n=1$, these are called _unigrams_.

For computational linguistics and NLP, **$n$-gram models** are extremely important: symbol-level $n$-gram models define which sequences of characters are (im)possible in a certain language, word-level $n$-gram models tell us which words can be adjacent to each other, and so on.

**Practice:** write code that extracts _bigrams_ from a given word.

In [None]:
word = input("Word: ")
    
bigrams = []
for i in range(len(word) - 1):
    bigram = word[i:i+2]
    if bigram not in bigrams:
        bigrams.append(bigram)
print(bigrams)

## Enumerate and Zip

Object-defining functions that can sometimes be very useful are `enumerate` and `zip`.

**`enumerate`** takes a list as input, and returns list of _tuples_, where every tuple contains an item from the input list, and its index. Just as `range`, this function creates its own object that can be easily typecasted into a list.

In [None]:
input_list = ["NY", "CA", "RI", "CO"]
print(list(enumerate(input_list[1:])))

In [None]:
z = (0,1,2,3,4,4)
print(z[0])

**Tuple** is another basic data type in Python. While they share the majority of the functionality with lists, their main difference is that tuples cannot be modified as easily as lists. Tuples can be thought of as "protected lists", but read [here](https://realpython.com/python-lists-tuples/) to learn more.

**`zip`** takes an arbitrary number of lists as input, and returns a list of tuples, where every tuple is an index-wise combination of items from those lists (i.e. `[(lis1[0],list2[0]),(lis1[1],list2[1]), ...]`).

In [None]:
towns = ["Port Jeff", "Stony Brook", "Lake Grove"]
zip_codes = [11777, 11790, 11755]
print(list(zip(towns, zip_codes)))

## Several useful string methods

There are multiple methods that simplify working with strings and lists, and in this section, I exemplify the following ones: `replace`, `split`, `strip`, `join`, `startswith`, and `endswith`.

**`replace`** returns a string in which some replacement was performed.

    string.replace(old_substring, new_substring)

In [None]:
string = "Hi friend. It is very nice to see you, friend!"
string = string.replace("friend", "Alex")
print(string)

**Practice:** Using the template provided below, greet everybody whose name is listed in the list `guests`.

In [None]:
template = "Hi, [guest], it is very nice to meet you!"
guests = ["Pearl", "Garnet", "Peridot"]

# your code

**`split`** takes a string and splits it into a list based on the provided argument. If no argument is provided, `split` splits the string based on the whitespaces.

    string.split(separator)

In [None]:
text = "A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen."
parsed_text = text.split(" ",2)
print(parsed_text)

In [None]:
text = "Achessboardappeared"
parsed_text = text.split()
print(parsed_text)

In [None]:
names = "Anna and Mary and John and Sebastian"
list_of_names = names.split(" and ")
print(list_of_names)

In [None]:
names = "Anna, and , Mary and John and Sebastian"
list_of_names1 = names.split(",",1)
list_of_names2 = names.split(",",2)
print(list_of_names1)
print(list_of_names2)

**`strip`** removes inisible symbols from the ends of the string. The invisible things that `strip` removes are ` `, `\n` and `\t`. It is an extremely useful function when working with the "dirty" user input, or when processing text files.

    string.strip()

In [None]:
string = "\nHello world!   \t"
string = string.strip()
print("-->" + string + "<--")

**`startswith`** and **`endswith`** are string methods that return booleans depending on the string starting or ending with a certain substring.

    string.startswith(substring)
    string.endswith(substring)

In [None]:
print("'hello' starts with 'hell':", "hello".startswith("hell"))
print("'hello' starts with 'hi':", "hello".startswith("hi"))
print("'hello' starts with 'hello':", "hello".startswith("hello"))

In [None]:
print("'linguistics' ends with 'cs':", "linguistics".endswith("cs"))
print("'linguistics' ends with '':", "linguistics".endswith(""))

**`join`** is a string method that takes a list as argument, and, if all items within that list are strings, it concatenates them using the given string.

    conjunction_string.join(list_to_concatenate)

In [None]:
names = ['Anna', 'Mary', 'John', 'Sebastian']
print(" and ".join(names))

In [None]:
letters = ['P', 'y', 't', 'h', 'o', 'n']
print("".join(letters))

# Homework 4

**Due on Sunday, October 4, 11.59pm**

Send your notebook (don't forget to save your solutions!) to <aniello.desanto@utah.edu> with the subject **\[LING 5981/6080\] Homework 4**.

**Problem 1. (3 points)** You are given the following list of English vowels.

In [1]:
vowels = ["a", "o", "i", "u", "e"]

Using the idea of a counter, implement a program that asks the user for a word, and then prints the number of consonants in that word. (For simplicity, we assume that "y" always behaves as a consonant, even though [it is not true](https://www.rd.com/culture/letter-y-vowel-consonant/).)

In [3]:
#intatiantiate a counter
consonants = 0

word = input("Word: ")
#for every character not in the vowels list
for c in word:
    if c not in vowels:
        #increase the counter
        consonants += 1
        
        
print(consonants)

Word: yellow
4


**Problem 2. (5 points)**
Implement a program that asks the user for the value of $n$ and for a word, and extracts $n$-grams from that word for any $n$ provided by the user.

    word:   banana
    n:      2
    ngrams: ba, an, na, an, na
    
    word:   linguist
    n:      3
    ngrams: lin, ing, ngu, gui, uis, ist
    
    
*Hint 1* If you didn't do it for practice before, start by implementing a code that extracts all bigrams. Then think about how you can generalize it to arbitary $n$.

*Hint 2* Be careful with the *edges* (i.e., the last $n$ gram in each word). And what happens if the word is shorter then the $n$-gram? (i.e, the word is "hi" and n=3? You still need to list "hi"!)

**Important** There are multiple default libraries to extract $n$-grams, already available in Python. But for this homework you **must** use the concepts we have studies so far. Any solution involving an external library will be counted as 0.

In [4]:
n = int(input("Value of n: "))
word = input("Word: ")

#we don't don't want ngrams shorter than n
if len(word) < n:
    print("The word is too short.")

#instantiate the ngram list
ngrams = []
# we need n-1 to stop iterating at the right index
for i in range(len(word) - (n - 1)):
    ngram = word[i:i+n]
    if ngram not in ngrams:
        ngrams.append(ngram)
print(ngrams)

Value of n: 2
Word: hello
['he', 'el', 'll', 'lo']


**Problem 3. (12 points - 3 points per part)** You are given the following text.

In [5]:
text = "It was dark, like the bottom of a well. There was a pattern of skulls and bones around \
the frame, for the sake of appearances; Death could not look himself in the skull in a mirror \
with cherubs and roses around it. The Death of Rats climbed the frame in a scrabble of claws and \
looked at Death expectantly from the top. Quoth fluttered over and pecked briefly at his own \
reflection, on the basis that anything was worth a try. Show me, said Death, show me my thoughts. \
A chessboard appeared, but it was triangular, and so big that only the nearest point could be seen. \
Right on this point was the world - turtle, elephants, the little orbiting sun and all. It was the \
Discworld, which existed only just this side of total improbability and, therefore, in border country. \
In border country the border gets crossed, and sometimes things creep into the universe that have \
rather more on their mind than a better life for their children and a wonderful future in the \
fruit picking and domestic service industries. On every other black or white triangle of the \
chessboard, all the way to infinity, was a small grey shape, rather like an empty hooded robe."

You are also given a string that contains all symbols of English alphabet.

In [6]:
alphabet = "abcdefghijklmnopqrstuvwxyz"

_Part 1._ Write some code that generates the list `unique_words`, containing all and only the unique lowercase words from `text`.

You should see the following output (the order can differ!):
    
    ['a', 'infinity', 'reflection', 'with', 'like', 'big', 'briefly', 'into', 'children', 'which', 'fruit', 'picking', 'there', 'try', 'little', 'around', 'appearances', 'appeared', 'all', 'crossed', 'basis', 'improbability', 'their', 'discworld', 'black', 'to', 'death', 'future', 'only', 'my', 'robe', 'things', 'for', 'it', 'existed', 'said', 'sake', 'sometimes', 'right', 'way', 'that', 'country', 'chessboard', 'quoth', 'well', 'domestic', 'skull', 'wonderful', 'hooded', 'or', 'empty', 'bottom', 'mirror', 'himself', 'rather', 'over', 'every', 'triangle', 'roses', 'border', 'orbiting', 'was', 'from', 'show', 'be', 'pecked', 'bones', 'just', 'universe', 'me', 'triangular', 'gets', 'worth', 'have', 'climbed', 'service', 'fluttered', 'top', 'but', 'grey', 'claws', 'at', 'rats', 'creep', 'own', 'pattern', 'point', 'white', 'than', 'dark', 'therefore', 'frame', 'this', 'not', 'the', 'could', 'mind', 'turtle', 'scrabble', 'better', 'industries', 'looked', 'an', 'cherubs', 'life', 'anything', 'more', 'small', 'and', 'of', 'his', 'on', 'skulls', 'elephants', 'in', 'thoughts', 'seen', 'nearest', 'expectantly', 'other', 'side', 'shape', 'total', 'so', 'world', 'look', 'sun']

In [7]:
#we want to filter out punctuation marks
new_text = ""
for s in text.lower():
    if s in alphabet + " ":
        new_text += s
        
#you can use this print to look at the new text       
#print(new_text)

#create a list of words
text_split = new_text.split(" ")

unique_words = []
#create a copy with only a single instance of each word
for w in text_split:
    if w not in unique_words:
        unique_words.append(w)
        
print(unique_words)

['it', 'was', 'dark', 'like', 'the', 'bottom', 'of', 'a', 'well', 'there', 'pattern', 'skulls', 'and', 'bones', 'around', 'frame', 'for', 'sake', 'appearances', 'death', 'could', 'not', 'look', 'himself', 'in', 'skull', 'mirror', 'with', 'cherubs', 'roses', 'rats', 'climbed', 'scrabble', 'claws', 'looked', 'at', 'expectantly', 'from', 'top', 'quoth', 'fluttered', 'over', 'pecked', 'briefly', 'his', 'own', 'reflection', 'on', 'basis', 'that', 'anything', 'worth', 'try', 'show', 'me', 'said', 'my', 'thoughts', 'chessboard', 'appeared', 'but', 'triangular', 'so', 'big', 'only', 'nearest', 'point', 'be', 'seen', 'right', 'this', 'world', '', 'turtle', 'elephants', 'little', 'orbiting', 'sun', 'all', 'discworld', 'which', 'existed', 'just', 'side', 'total', 'improbability', 'therefore', 'border', 'country', 'gets', 'crossed', 'sometimes', 'things', 'creep', 'into', 'universe', 'have', 'rather', 'more', 'their', 'mind', 'than', 'better', 'life', 'children', 'wonderful', 'future', 'fruit', 'p

_Part 2._ Write a program which generates the list `bigrams`, collect all attested bigrams in `unique_words`. Ignore words that are shorter than $2$ characters. **Make sure that the list `bigrams` does not contain duplicates**.

*Hint. You can use the code to extract bigrams you wrote above. Then, you need to have that code iterate over each word in the `unique_words` list and add a check for duplicates!*

In [None]:
bigrams = []

for w in unique_words:
    #again, we want to avoid words that are too short
    if len(w) > 1:
        
        for i in range(len(w) - 1):
            bigram = w[i:i+2]
            if bigram not in bigrams:
                bigrams.append(bigram)
                
print(bigrams)

_Part 3._ Based on the variable `alphabet`, generate all possible bigrams of English. (Hint: look at the second exercise of the previous homework!)

In [8]:
#let's copy alphabet here so not to forget :)
alphabet = "abcdefghijklmnopqrstuvwxyz"
possible_bigrams = []

#iterate. over alphabet once to get the first element of the bigram
for s in alphabet:
    #get the second emelemen of the bigram
    for a in alphabet:
        bigram = s+a
        #just double check to make them unique
        if bigram not in possible_bigrams:
            possible_bigrams.append(bigram)

print(possible_bigrams)

['aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao', 'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay', 'az', 'ba', 'bb', 'bc', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bk', 'bl', 'bm', 'bn', 'bo', 'bp', 'bq', 'br', 'bs', 'bt', 'bu', 'bv', 'bw', 'bx', 'by', 'bz', 'ca', 'cb', 'cc', 'cd', 'ce', 'cf', 'cg', 'ch', 'ci', 'cj', 'ck', 'cl', 'cm', 'cn', 'co', 'cp', 'cq', 'cr', 'cs', 'ct', 'cu', 'cv', 'cw', 'cx', 'cy', 'cz', 'da', 'db', 'dc', 'dd', 'de', 'df', 'dg', 'dh', 'di', 'dj', 'dk', 'dl', 'dm', 'dn', 'do', 'dp', 'dq', 'dr', 'ds', 'dt', 'du', 'dv', 'dw', 'dx', 'dy', 'dz', 'ea', 'eb', 'ec', 'ed', 'ee', 'ef', 'eg', 'eh', 'ei', 'ej', 'ek', 'el', 'em', 'en', 'eo', 'ep', 'eq', 'er', 'es', 'et', 'eu', 'ev', 'ew', 'ex', 'ey', 'ez', 'fa', 'fb', 'fc', 'fd', 'fe', 'ff', 'fg', 'fh', 'fi', 'fj', 'fk', 'fl', 'fm', 'fn', 'fo', 'fp', 'fq', 'fr', 'fs', 'ft', 'fu', 'fv', 'fw', 'fx', 'fy', 'fz', 'ga', 'gb', 'gc', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gj', 'gk

_Part 4._ Collect all unattested bigrams of English in the list `unattested_bigrams`. 

*Hint* The unattested bigrams are those bigrams that are possible but not attested in the word sample (you collected all attested bigrams before)!

In [None]:
unattested_bigrams = []

for b in possible_bigrams:
    if b not in bigrams:
        unattested_bigrams.append(b)
        
print(unattested_bigrams)

Don't be surprised that some bigrams from `unattested_bigrams` are actually present in other English words, the text that we are working with is very small! If you are curious, take a larger text, and run your code on it. :)