# Topic 0: Introduction to Python (Part 2)

This is the second part of the Introduction to Python for Natural Language Engineering course.

These notebooks are designed to give you the working knowledge of Python necessary to complete the lab sessions for Natural Language Engineering. 

From the last notebook you should be familiar with python types, basic operators, identifiers, booleans and conditions, lists and strings.  This notebook will introduce you to defining functions, using comments and docstrings and working with more data structures such as sets, tuples and dictionaries.

As in the last session:-

- Run all of the code cells as you work through the notebook. 
- Try to understand what is happening in each code cell and predict the output before running it.
- Complete all of the exercises.
- Solutions to all exercises are provided, but please avoid loading the solution until you have had a go at solving it yourself.


Run the following cell twice, first to load some set up code, then again to run the code.

In [None]:
%load ../setup


## Functions
Functions are defined using the keyword `def`, followed by a function name, and a list of parameters in parentheses. Don't forget the `:` after the closing parenthesis.

The body of the function starts on the next line, and must be indented.

In [None]:
def double(number):
     return(number * 2)

In [None]:
double(13)

In [None]:
type(double)

In [None]:
def add_question_mark(string):
    return string + "?"

In [None]:
add_question_mark("what's your name")

In [None]:
def print_first_half(string):
    half_length_of_string = len(string)//2 #use floor division as indices must be integers
    return string[:half_length_of_string]

In [None]:
print_first_half('hi how are you doing?')

### Exercise
In the empty cell below define a function called `square` that returns an input parameter squared. 

Hint: check the 'basic functions' section above, for the Python syntax for exponentials.

In [None]:
# %load solutions//square_function


### Exercise
In the empty cell below define a function `makelist` that takes a sentence string as an input, and returns a list of the words in the sentence.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//make_list

### Comments and docstrings

Look at the code in the code cell below.

- The first block of text shows how doc strings are used by convention in Python. By convention, all function definitions should begin with a block of documentation (docstring) of the form given by the first block in triple quotes in the programme below.
- Comments in the code itself are introduced by a `#` either as a separate line or appended to the end of a line. Python will ignore the rest of a line after a `#`.

When you type shift-enter to execute the cell, the function exists in the kernel and can be called by any cell in the notebook. The function definition ends as soon as the indentation ceases (this is triggered by the comment `"Here is the argument:"`).  After creating the function the kernel will continue to execute the contents of the cell, thereby calling the function. 

Notice how the programme splits a character string at carriage return ("\n") characters. This works because split is an inbuilt method of the text data type. Therefore all text objects can be split in this way.
- `\n` is the carriage return character. 
- `\t` can be used similarly for reading tab separated data.
- If you leave the argument empty it will treat any string of whitespace as a delimiter to be split. This has the advantage that a double space will be treated as a single delimiter.

### Exercise
In the cell below examine the contents of the variable `sample_text` with and without the print function.
- Notice that execution of the first cell means the variable is now in the kernel and accessible to any cell.
- Use the same box to try printing the variable `input_text`. What happens? Why?

In [None]:

def count_paragraphs(input_text):
    """
    A paragraph is defined as the text before a CR character ie.: "\n".
    Take a character string, split it into paragraphs, count them
    and return the count.
    :param input_text: a character string containing paragraph marks.
    :return: integer, the number of paragraphs.
    """
    
    # The following statement creates a list of strings by breaking
    # up input_text wherever a "\n" character occurs
    
    paragraphs = input_text.split("\n")  
    
    # The len() function counts the number of elements in the list
    
    return len(paragraphs)


# Here is the argument:

sample_text = "This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!\nSentences like that should not exist. They're too artificial.\nA REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?\nHave you heard of 'code smell'? Google it if you haven't."
print (sample_text)

# Here is the function call:

print ("Number of paragraphs: ", count_paragraphs(sample_text))


### Exercise
In the blank cell below examine the contents of the variable `sample_text` with and without the print function.
- Notice that execution of the first cell means the variable is now in the kernel and accessible to any cell.
- Use the same box to try printing the variable `input_text`. What happens? Why?

In [None]:
sample_text



## Sets  
These are *unordered* lists of *unique* elements.

Note the use of curly  brackets rather than the square brackets used for lists.

In [None]:
unique_numbers = {1, 2, 2, 2, 3}
unique_numbers

In [None]:
type(unique_numbers)

To initialise an empty set, use `set()`

In [None]:
new_set = set()
type(new_set)

To add an element to a set use the method `add`.

In [None]:
unique_numbers.add(5)

Use `len` to give the number of elements in a set.

In [None]:
len(unique_numbers)

To check the presence of an element in a set use the keyword `in`.

Similar to the use of `in` for lists and strings.

In [None]:
2 in unique_numbers

Iterating over a set

The syntax for iterating over a set is similar to that used when iterating over a list. 

Remember to use `for`, `in`, `:` and indentation.

In [None]:
for number in unique_numbers:
    print(number * 3)

In [None]:
for number in unique_numbers:
    print (double(number))

### Exercise
In the empty cell below create a function called `get_vocabulary` that takes a *list* of words as input, and returns a *set* of the words in the sentence.

Use your function `get_vocabulary` to create the set dickens_vocab, a set of unique words in the opening_line (see above).

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//get_vocabulary

    
The code in the next cell shows how to create a vocabulary with the set datatype.

In [None]:
def get_vocabulary(input_text):
    """
    A word is defined as a character string delimited by
    a space " " character.
    Given an input string, split it into words and
    return the set of unique words in the input.
    :param input_text: Character string with some text
    :return: The set of unique words in the input.
    """
    
    list_of_words = input_text.split()
    
    # The following line takes the list of words,
    # removes repetitions and creates a set:

    return set(list_of_words)

# The following loop is repeated for every word in the set
# Note that a set is just one of many iterable objects:


for word in get_vocabulary(sample_text):
    print (word)

## Dictionaries
A dictionary is an *unordered* set of key:value pairs. 

Keys are used to index the dictionary.

The main operations are storing a value with a key, and then extracting a specific value using its key. 

Each key in a given dictionary must be unique. 

A dictionary is initialised with curly braces. This can contain comma-separated key:value pairs. 

Note the use of ':' to map a key to a value.

In [None]:
simpsons_ages = {"Bart":10, "Lisa":8, "Homer" : "thirty something"}
simpsons_ages

In [None]:
type(simpsons_ages)

Accessing the values of keys in a dictionary

In [None]:
simpsons_ages["Homer"]

In [None]:
simpsons_ages['Bart']

Getting the number of elements in a dictionary.

Just like getting the length of a list, we use the keyword `len`.

In [None]:
len(simpsons_ages)

Checking the presence of a key in a dictionary.

In [None]:
"Marge" in simpsons_ages

In [None]:
"Bart" in simpsons_ages

Accessing a key that does not exist is an error.

In [None]:
simpsons_ages["Krusty"]

Accessing a key that does not exist using the .get() method, which supplies a default value to use in this case.

In [None]:
simpsons_ages.get("Krusty","I don't know")

Adding a new key:value entry to the dictionary.

In [None]:
simpsons_ages["Marge"] = 34
simpsons_ages["Marge"]

### Exercise
In the blank cell below add two extra key-value pairs to the dictionary, `simpons_ages`, each consisting of a name and corresponding age.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//more_simpsons

Use a `for` loop to iterate over *keys* in the dictionary.

In [None]:
for person in simpsons_ages: 
     print (person)

Use the `items` method to iterate over the key-value pairs of a dictionary.

In [None]:
for item in simpsons_ages.items():
     print (item)

In [None]:
#Note that 'person' and 'age' here are arbitary variable names, and  can be replaced with any two names eg 'key' and 'value'
for person, age in simpsons_ages.items():
     print(person," is ", age, " years old")

### Exercise
In the blank cell below make a new dictionary called `polygons` where the keys are names of shapes and the values are the corresponding number of sides.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//polygons

### Exercise
In the blank cell below iterate over the keys and values, printing each key and value in a sentence (eg 'a triangle has 3 sides').

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//print_polygons

### Exercise
In the empty cell below write code that will print, one word per line, each word in `dickens_words` together with the number of times that word appears in `dickens_words`.

In [None]:
# %load solutions//print_dickens_counts
           

## Files
Files have a file path and in the cell below we use the variable `input_file_path` to hold a string that contains a file path.

In [None]:
#Make sure the file path points to a valid file
#input_file_path = "N:/nle_notebooks/sample_text.txt"
input_file_path = "/Users/davidw/Documents/teach/NLE/NLE Notebooks/Topic 0/sample_text.txt"

We now use the file path variable to *open* the file. We need to do this before reading/writing to it.

In [None]:
input_file = open(input_file_path)
type(input_file)

Use the `read` command to read the entire file contents into a `str` variable called `input_text`.

In [None]:
input_text = input_file.read()
type(input_text)

In [None]:
input_text

When you are done with the file, close it.

In [None]:
input_file.close()

After the file has been closed it cannot be read any more.

In [None]:
input_text = input_file.read()

### Exercise
In the blank cell below write a function, `print__word_counts` that will take a file path as an argument, open the file, then print, one word per line, each word in the file together with the number of times that word appears in the file. 

Test your function by running it on the `sample_text.txt`.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//print_word_counts

## Tuples

A tuple consists of a number of values separated by commas. These can be different types. It is initialised with parentheses, containing its objects separated by commas.

In [None]:
person = ("Jon", 14, "jon@thewall.com")
person

In [None]:
type(person)

Use `len` to count the number of elements in a tuple.

In [None]:
len(person)

Indexing into a tuple is similarly to indexing into a list.

In [None]:
person[0]

In [None]:
person[-2:]

It can be useful to use tuples as values in dictionaries.

In [None]:
#Note that each key is a string, and each value is a tuple
people = {"Joffrey":(12, "Baratheon", "joff@kingslanding.com"), "Jon":(14, "Snow", "jon@thewall.com")}
people["Joffrey"]

In [None]:
### Jon's age - we access this using the dictionary key, and then indexing within the value:
people["Jon"][0]

In [None]:
### Joffrey's email
people["Joffrey"][2]

In [None]:
#  list everyone's first and last names:
for person, record in people.items():
     print (person, record[1])

### Exercise
In the blank cell below create a dictionary called `address_book`, with at least 3 key-value entries. Each should consist of a person's name in string format (the key), and a tuple with corresponding pieces of information about them (the value).

Once you've done that, iterate over the address book, printing information about each person into a sentence.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//prime_ministers

### Exercise
Make sure that you understand the code in the following cell. It calculates the number of sentences in each paragraph of a text. 

Can you see where tuples are being used?

In [None]:
def count_sentences_per_paragraph(input_text):
    """
    Given an input text:
     - assign a number to each paragraph,
     - count the number of sentences in each paragraph,
     - output a list of all paragraph numbers together
       with the number of sentences in it.

    :param input_text: A character string possibly containing
                        periods "." to separate sentences and
                        paragraph marks "\n" to separate
                        paragraphs.
    :return: A list of ordered pairs (tuples) where the first
            element of the pair is the paragraph number and
            the second element is the number of sentences in
            that paragraph.
            Sample output: [(0, 1), (1, 3), (2, 3), (3, 1)]
    """
    
    paragraphs = input_text.split("\n")
    sentences_per_paragraph = list()  # create an empty list
    paragraph_number = 0
    for paragraph in paragraphs:
        paragraph_number += 1
        sentences = paragraph.split(". ")
        number_of_sentences = len(sentences)
        
        # create a tuple with the paragraph number and the number of sentences
        # in it, then append the tuple to the list:
        
        sentences_per_paragraph.append((paragraph_number, number_of_sentences))
    return sentences_per_paragraph


print (sample_text)
for para, count in count_sentences_per_paragraph(sample_text):
    print("paragraph {0} contains {1} sentence(s)".format(para,count))

## The range function

This produces a generator of numbers in a specified range.

In [None]:
indices = range(0,5)
indices

In [None]:
type(indices)

In [None]:
len(indices)

The output from a `range` can be used as a set of indices.

In [None]:
for i in indices:
    print (words[i])

If `range` is given a single parameter, it will create a range from zero.

In [None]:
for i in range(10):
    print (i)

### Exercise
In the blank cell below use `range` to print a list of the first 10 integers.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//first_ten_ints

### Exercise
In the cell below use `range` to print a list of the first 10 cubes.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//first_ten_cubes

## The zip function

The zip function is used to pair up the corresponding elements between multiple iterables. 

It takes multiple iterables as arguments, and returns a list of tuples where the i-th tuple consists of the i-th element from each of the input iterables.

In the example below, we 'zip together' `words` and `indices` into a series of tuples called `word_positions`. For example, the 3rd element of `word_positions` contains the 3rd element of `words` and the 3rd element of `indices`.

In [None]:
words = 'It was the best of times, it was the worst of times'.split()
indices = range(len(words))
word_positions = zip(words, indices)
type(word_positions)

In [None]:
for word, position in word_positions:
    print("'{0}' is in position {1}".format(word,position))


### Exercise
In the blank cell below write a function, `show_word_positions` that takes a filepath as its argument. The funciton should read the text from the file, split the text on whitespace, and then print out each word and its position as in the above example.

Test your function out on `sample_text.txt`.

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//show_word_positions

The cell below contains a new version of the code that calculates the number of sentences in each paragraph of a text. 

This versions makes use `range` and `zip`.    

In [None]:
def count_sentences_per_paragraph(input_text):
    """
    Given an input text:
     - assign a number to each paragraph,
     - count the number of sentences in each paragraph,
     - output a list of all paragraph numbers together
       with the number of senfor a,b in enumerate(['The','Holy','Grail']): print a,btences in it.

    :param input_text: A character string possibly containing
                        periods "." to separate sentences and
                        paragraph marks "\n" to separate
                        paragraphs.
    :return: A list of ordered pairs (tuples) where the first
            element of the pair is the paragraph number and
            the second element is the number of sentences in
            that paragraph.
            Sample output: [(0, 1), (1, 3), (2, 3), (3, 1)]
    """
    
    paragraphs = input_text.split("\n")
    sentence_counts = []
    for paragraph in paragraphs:
        number_of_sentences = count_sentences(paragraph)
        sentence_counts.append(number_of_sentences)
    
    # Create a list with the paragraph numbers we need:
    
    paragraph_numbers = range(len(paragraphs))
    
    # Make a list of tuples by combining two existing lists:
    sentences_per_paragraph = zip(paragraph_numbers, sentence_counts)
    return sentences_per_paragraph

def count_sentences(paragraph):
    """
    A sentence is a character string delimited by a period "."
    Given an input paragraph, return the number of sentences
    in it.
    :param paragraph: Character string with sentences.
    :return: number of sentences in the input paragraph
    """
    
    sentences = paragraph.split(".")
    return len(sentences)

for para, count in count_sentences_per_paragraph(sample_text):
    print("paragraph {0} contains {1} sentence(s)".format(para,count))

In situations where you are zipping lists that may be of different lengths, and want to pad out any 'missing' elements, you fill find `zip_longest` useful.

In [None]:
from itertools import zip_longest

listA = ["the","cat","sat"]
listB = ["a","dog","lay","down"]
for elem in zip_longest(listA,listB):
    print(elem)

### Enumerate
Python provides a built-in function that can be used instead of `range` and `zip`.

In [None]:
for a,b in enumerate(['The','Holy','Grail']): 
    print(a,b)

In [None]:
for a,b in enumerate(['The','Holy','Grail'],1): 
    print(a,b)

### Exercise
In the empty cell below, adapt the that calculates the number of sentences in each paragraph of a text so that it uses `enumerate` rather than `range` and `zip`.


In [None]:
# uncomment the next line and then run the cell to load a solution
# %load solutions//count_sentences_per_paragraph