### Before Starting

**Make sure you have the following files in the default directory. This is easy to do using the notebook. The directory you were in when you opened the notebook is the default directory. By placing the files in the same directory you will not have to think about filepaths at all for these exercises.** 

* Exercise10.py

* sample corpus.txt

* sample corpus 2.txt

### Exercise 1

**We start with a simple exercise to see how to create a function and how to call it with an argument**

* **The first block of text shows how doc strings are used by convention in Python. All functions should begin with a block of documentation (docstring) of the form given by the first block in triple quotes in the programme below.**
* **Comments in the code itself are introduced by a # either as a separate line or appended to the end of a line. Python will ignore the rest of a line after a #.**

**Type shift-enter to execute the cell. The function now exists in the kernel and can be called by any cell in the notebook. The function definition ends as soon as the indentation ceases (this is triggered by the comment "Here is the argument:").  After creating the function the kernel will continue to execute contents of the cell, thereby calling the function.** 

**Notice how the programme splits a character string at carriage return characters. This works because split is an inbuilt method of the text data type. Therefore all text objects can be split in this way.** 
* "\n"** is the carriage return character.** 
* **Use **"\t"** for reading tab separated files.**
* **If you leave the argument empty it will treat any string of whitespace as a delimiter to be split. This has the advantage that a double space will be treated as a single delimiter.**
* **Use the empty cell below to test what happens when you use **split(' ')** and encounter a double space.**

**Next examine the contents of the variable** sample_text** with and without the print function.**
* **Notice that execution of the first cell means the variable is now in the kernel and accessible to any cell.**
* **Use the same box to try printing the variable** input_text**. What happens? Why?**

In [6]:

def count_paragraphs(input_text):
    """
    A paragraph is defined as the text before a CR character ie. "\n"
    Take a character string, split it into paragraphs, count them.
    and return the counts
    :param input_text: a character string containing paragraph marks
    :return: integer, the number of paragraphs
    """
    
    # The following statement creates a list of strings by breaking
    # up input_text wherever a "\n" character occurs
    
    paragraphs = input_text.split("\n")  
    
    # The len() function counts the number of elements in the list
    
    return len(paragraphs)


# Here is the argument:

sample_text = "This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!\nSentences like that should not exist. They're too artificial.\nA REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?\nHave you heard of 'code smell'? Google it if you haven't."
print (sample_text)

# Here is the function call:

print ("Number of paragraphs: ", count_paragraphs(sample_text))


"This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!\nSentences like that should not exist. They're too artificial.\nA REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?\nHave you heard of 'code smell'? Google it if you haven't."

### Exercise 2:

    
**The next programme introduces the set datatype and shows how to create a vocabulary with it.**

**A set is an unordered collection of unique objects. Like a list, a set is iterable. That means python's powerful capabilities for performing actions on successive elements can be used on it. Here we see one of the most basic and important of these, iterating with a for loop.**

**Check you are comfortable with what the for loop is doing then:**
* **change the function by adding a line so that the set returned by the function call is stored in a variable.**
* **Spend a moment considering the shortcomings of this function.**
* **Is the space character by itself a good word delimiter?**

In [8]:
def get_vocabulary(input_text):
    """
    A word is defined as a character string delimited by
    a space " " character.
    Given an input string, split it into words and
    return the set of unique words in the input.
    :param input_text: Character string with some text
    :return: The set of unique words in the input.
    """
    
    list_of_words = input_text.split()
    
    # The following line takes the list of words,
    # removes repetitions and creates a set:

    return set(list_of_words)

# The following loop is repeated for every word in the set
# Note that a set is just one of many iterable objects:


for word in get_vocabulary(sample_text):
    print (word)

it's
that
You
not
it;
of
if
This
numeric,
It
looks
can
Pythonic
flavour
types:
is
code,
token
smell
it
sentence01
7
smell'?
CamelCase
sentence
alphanumeric,
you
different.
Google
showing
like
and
alphabetic,
different
to
UPPERCASE,
A
haven't.
REAL
know?
a
it.
artificial.
has
sample
They're
exist.
Have
heard
Sentences
should
'code
too
punctuation!
Title,


### Exercise 3

**Another important data type is the dictionary, which maps keys onto values. Both keys and values can be a wide range of objects.  The programme below uses a special form of dictionary imported from a library with a few special features. First spend a moment experimenting with the native basic dictionary form in the empty cell below. One way to create a dictionary is:**

my_dictionary = dict( ).

**You can then assign values to keys like this (red is the key and colour is the value):**

* my_dictionary['red'] = 'colour'.
* my_dictionary['banana'] = 'noun'

**Keys must be unique. There is no requirement for keys or values to be of the same type as other keys or values.
Another way to creat a dictionary is**

* my_dictionary = {'red': 'colour', 'banana': 'noun'}

**Here we load the sample sample text from file. (It is the same text as before but loading from file is the usual way to access text, both as a matter of good practice and because text files can be very large).**
* **Notice carefully how the programme opens a file. If the file isn't in the default directory you will need to set the file path.**
* **Instead of using this form you can open a file with explicit open and close statements. Change the programme to do this. The format is **file_handle = open(   )** and then **file_handle.close( )**. The latter needs no argument.**

* **The advantage of the first method is that the file will close automatically without you having to remember.**
* **The indentation also shows you clearly when the file is in use. This could be an advantage or a disadvantage depending on how long the file is open and how many levels of indentation you are working with.**

**The programme shows how to add and update entries in a dictionary and how to iterate over a dictionary extracting key-value pairs. Notice the import statement. What happens if you delete it?**

**The special imported version of the dictionary has a default value. Once you are comfortable with what it does, convert the programme to use the basic dictionary type. You will now need to introduce code to check if a word is in the dictionary as you can no longer rely on the dictionary structure to automatically assign a 0 to unknown entries.**

**To test whether an item is present use this:**

* if 'red' in my_dictionary:
      
**Finally, look at the results obtained for the word "it" and think for a moment about the problem cause by the occurrence of "it's".**

In [12]:
# Identifiers that are not built-in and are defined somewhere
# else need to be imported into this module, that is, included
# as if they were part of this source module:

import collections


def count_words_from_file(input_text):
    """
    A word is defined as a character string delimited by
    a space " " character.
    Given a file containing text, count how many unique
    words it contains and return the counts.
    :param file_path: A file path with the file name
    :return: A dictionary containing entries {word:count}
    """
   
    # The dictionary with default values is defined in the
    # collections module we imported. The int parameter
    # tells Python that when a key is accessed that has not
    # been previously stored in the dictionary, it should
    # create an entry for this key with a value of 0:
    
    word_counts = collections.defaultdict(int)
    for word in input_text.split():
    
        # increment the word's count, starting with 0 if it has no entry

        word_counts[word] += 1  
    return word_counts


input_file_path = "sample_corpus.txt"
with open(input_file_path) as input_file_handle:  # open the file for reading
    sample_text = input_file_handle.read()  # read the entire file
word_counts = count_words_from_file(sample_text)  

# The items of a dictionary are key:value tuples, in this case they
# are the word and its corresponding count; they can be accessed thusly:


for word, count in word_counts.items():
    print (word, count)

This 1
is 1
a 1
sample 1
sentence 2
showing 1
7 1
different 1
token 1
types: 1
alphabetic, 1
numeric, 1
alphanumeric, 1
Title, 1
UPPERCASE, 1
CamelCase 1
and 1
punctuation! 1
Sentences 1
like 2
that 1
should 1
not 1
exist. 1
They're 1
too 1
artificial 1
A 1
REAL 1
looks 1
different. 1
It 1
has 1
flavour 1
to 1
it. 1
You 1
can 1
smell 1
it; 1
it's 1
Pythonic 1
code, 1
you 3
know? 1
Have 1
heard 1
of 1
'code 1
smell'? 1
Google 1
it 1
if 1
haven't. 1


### Exercise 4:

**Exercises 4 - 7 look at various different versions of a programme to count the sentences in our sample text.**

**Notice how to create an empty list. You can also do it with **= [ ]** though it is now considered better practice to use **= list( ).

**It will help your programming enormously if you can understand the difference between two commonly used inbuilt list methods, append and extend. append puts a single object at the end of a list. extend concatenates all the elements of the object. **

**The argument of extend is often a list but it can be any iterable. Experiment with this in the empty cell below. Make sure you try appending and extending by both text and numbers, both single items and a list. The results are sometimes surprising. This is a common source of errors.**

**A tuple is an immutable sequence of objects. It is like a list but you cannot change it. It can be of arbitrary length including 0 and 1. For example:**

* a = tuple( )
* a = (3,)
* a = 3,

**The last two are the same.**

**When should you use a tuple and when a list? If the position of an item in a list matters it is probably better to use a tuple. If it doesn't matter, use a list.** 

**Now check what the programme below is doing and run it.Try to understand what is happening. ** 

In [13]:
def count_sentences_per_paragraph(input_text):
    """
    Given an input text:
     - assign a number to each paragraph,
     - count the number of sentences in each paragraph,
     - output a list of all paragraph numbers together
       with the number of sentences in it.

    :param input_text: A character string possibly containing
                        periods "." to separate sentences and
                        paragraph marks "\n" to separate
                        paragraphs.
    :return: A list of ordered pairs (tuples) where the first
            element of the pair is the paragraph number and
            the second element is the number of sentences in
            that paragraph.
            Sample output: [(0, 1), (1, 3), (2, 3), (3, 1)]
    """
    
    paragraphs = input_text.split("\n")
    sentences_per_paragraph = list()  # create an empty list
    paragraph_number = 0
    for paragraph in paragraphs:
        paragraph_number += 1
        sentences = paragraph.split(". ")
        number_of_sentences = len(sentences)
        
        # create a tuple with the paragraph number and the number of sentences
        # in it, then append the tuple to the list:
        
        sentences_per_paragraph.append((paragraph_number, number_of_sentences))
    return sentences_per_paragraph


print (sample_text)
print (count_sentences_per_paragraph(sample_text))

This is a sample sentence showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!
Sentences like that should not exist. They're too artificial
A REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?
Have you heard of 'code smell'? Google it if you haven't.
[(1, 1), (2, 2), (3, 3), (4, 1)]


### Exercise 5

**Study the code below. This is a reworking of the previous programme with the actual counting of sentences taken outside the programme into a separate function.**

* **One of the reasons for doing this is to isolate a task which will be used in different places. That reason doesn't really apply here as the new function is only being called in one place.**
* **Another reason for making a separate function is readability. Do you think this change improves readability?**
    

In [2]:
def count_sentences_per_paragraph(input_text):
    """
    Given an input text:
     - assign a number to each paragraph,
     - count the number of sentences in each paragraph,
     - output a list of all paragraph numbers together
       with the number of sentences in it.

    :param input_text: A character string possibly containing
                        periods "." to separate sentences and
                        paragraph marks "\n" to separate
                        paragraphs.
    :return: A list of ordered pairs (tuples) where the first
            element of the pair is the paragraph number and
            the second element is the number of sentences in
            that paragraph.
            Sample output: [(0, 1), (1, 3), (2, 3), (3, 1)]
    """
    
    paragraphs = input_text.split("\n")
    sentences_per_paragraph = []
    paragraph_number = 0
    for paragraph in paragraphs:
        paragraph_number += 1
        number_of_sentences = count_sentences(paragraph)
        sentences_per_paragraph.append((paragraph_number, number_of_sentences))
    return sentences_per_paragraph
      
    #sentences_per_paragraph = 
    #enumerated_
    #return enumerated_sentences_per_paragraph

def count_sentences(paragraph):
    """
    A sentence is a character string delimited by a period "."
    Given an input paragraph, return the number of sentences
    in it.
    :param paragraph: Character string with sentences.
    :return: number of sentences in the input paragraph
    """
    
    sentences = paragraph.split(".")
    return len(sentences)


print(count_sentences_per_paragraph(sample_text))

NameError: name 'sample_text' is not defined

### Exercise 6

    

**First experiment with **zip( )**, a function which often proves useful and is worth remembering. Try copying these lines in the empty cell:**
* print zip(['a','b','c'], range(5))
* for a,b in zip(['a','b','c'], range(5)): print a,b

**Study the next version of the same programme.**
* **We now make a list of consecutive integers using** range( ).
    
* **We then combine two lists into one .**
* **In this case the use of **range( )** in combination with **zip( )** is quite neat. However, it does create an unnecessary list of integers and there are situations where this approach gets messy.**

**Instead of counting the paragraphs we can get python to do it for us using an inbuilt function with precisely that purpose:**
* enumerate( )** performs the combination of **range( )** and **zip( )** automatically.**
* **Experiment with **enumerate( )** in an empty cell. For example you could try:**
* for a,b in enumerate(['The','Holy','Grail']): print a,b

**Change the programme below to use **enumerate( )** instead of **range( )** and **zip( ). **The code you need to insert is:**

* sentences_per_paragraph = [(ind,val) for ind,val in enumerate(sentence_counts, 1)]

**There are two ways this can be done. You can either insert the enumerate command in the function (which will require fewer changes) or you can make the function return a simple list instead of a list of tuples and use enumerate in the print command instead of in the function itself. This would dedicate the function to its real task of getting the list, leaving the presentation as a separate task.**

**Try running it without the second argument (i.e. just use** enumerate(sentence_counts)** without the 1. Check you understand the difference.**
    

In [None]:
def count_sentences_per_paragraph(input_text):
    """
    Given an input text:
     - assign a number to each paragraph,
     - count the number of sentences in each paragraph,
     - output a list of all paragraph numbers together
       with the number of sentences in it.

    :param input_text: A character string possibly containing
                        periods "." to separate sentences and
                        paragraph marks "\n" to separate
                        paragraphs.
    :return: A list of ordered pairs (tuples) where the first
            element of the pair is the paragraph number and
            the second element is the number of sentences in
            that paragraph.
            Sample output: [(0, 1), (1, 3), (2, 3), (3, 1)]
    """
    
    paragraphs = input_text.split("\n")
    sentence_counts = []
    for paragraph in paragraphs:
        number_of_sentences = count_sentences(paragraph)
        sentence_counts.append(number_of_sentences)
    
    # Create a list with the paragraph numbers we need:
    
    paragraph_numbers = range(len(paragraphs))
    
    # Make a list of tuples by combining two existing lists:
    sentences_per_paragraph = zip(paragraph_numbers, sentence_counts)
    return sentences_per_paragraph


print count_sentences_per_paragraph(sample_text)

### Exercise 7

**The final version of the programme demonstrates the use of the** map( ) **function to iterate over a list. The advantage of this is that it is no longer necessary to initialize the list and append elements to it in a loop. **

**When used simply like here, the map function is good python but it can be used to write complicated code which is difficult to read and is considered poor style. Where possible it is usually good practice to use list comprehensions.**

**Change the programme below to use a list comprehension instead of** map( )

**The code you need is:**
* sentence_counts = [count_sentences(paragraph) for paragraph in paragraphs]

In [None]:
def count_sentences_per_paragraph(input_text):
    """
    Given an input text:
     - assign a number to each paragraph,
     - count the number of sentences in each paragraph,
     - output a list of all paragraph numbers together
       with the number of sentences in it.

    :param input_text: A character string possibly containing
                        periods "." to separate sentences and
                        paragraph marks "\n" to separate
                        paragraphs.
    :return: A list of ordered pairs (tuples) where the first
            element of the pair is the paragraph number and
            the second element is the number of sentences in
            that paragraph.
            Sample output: [(0, 1), (1, 3), (2, 3), (3, 1)]
    """
    
    paragraphs = input_text.split("\n")
    
    # Apply the count_sentences function to every element of paragraphs,
    # return the results in a new list, call it sentence_counts:
    
    sentence_counts = map(count_sentences, paragraphs)
    paragraph_numbers = range(len(paragraphs))
    sentences_per_paragraph = zip(paragraph_numbers, sentence_counts)
    return sentences_per_paragraph


print count_sentences_per_paragraph(sample_text)

### Exercise 7a

**We have now done the following:**

* **Written a programme and checked it worked**

* **Written several more versions of it and checked they work too**

**Now it is time to think about checking and testing. There is in fact a problem with the programme. In a real programming situation of course you do not know whether there is a problem but try to imagine you haven't been told. How hard would you have checked? The purpose of this exercise is to consider different ways of checking for problems and to show how difficult it can be. The same problem exists in all our versions of this programme**

* **First look at the code and see if you can see where the problem lies. It is often impossible to find problems by inspection.**

* **Next, do some experimenting with the split function. Do you really understand how it works?**

* **The best way to find problems is usually to test the programme thoroughly. See if you can find the problem by testing it more carefully.**

* **Load the file **sample_corpus_2.txt** and run the programme on it. (It is best to carry out a separate experiment in a new cell but if you find this difficult, you can change the file name in exercise 3, execute the cell again, then run the programme with the new file stored in the old variable name, taking care to change it back and execute it again afterwards).**

* **Study the input and output until you understand what the problem was. Notice the number of outputs generated**

### Exercise 8

**The next few exercises develop a programme for making tokens. A token is a specific occurrence of a basic unit of lexical processing, typically a word or an item of punctuation.**

* **Study the programme, in particular the string methods. These are very useful in NLP.**
* **Experiment with the string methods using the empty cell until you understand how they work in special cases such as a single space and a single punctuation mark.**
* **The programme will only assign one feature to each token. Are there any cases where more than one feature should be assigned?**

* **How does the programme delimit tokens (i.e. find where one ends and the next begins)? Look at the results and note how poorly this works.**

In [None]:
def make_tokens(input_text):
    """
    Take an input text, split it into tokens, find the
    token's shape, make a feature
    vector with the token itself and its shape, return
    a list of all token feature vectors found in the input.
    :param input_text: A character string containing spaces
    :return: A list of token feature vectors (token, shape).
        Sample output: [('a', 'alpha'), ('7', 'digit'), ('A27', 'alnum')]
    """
    
    # Here we define a token as being delimited by a space character:
    
    tokens = input_text.split(" ")
    return map(make_token_feature_vector, tokens)


def make_token_feature_vector(token):
    """
    Given a token, extract its shape and return a
    vector with the token itself and its shape
    :param token: A character string
    :return: A tuple (token, shape)
    """
    
    if token.isalpha():
        return (token, "alpha")
    elif token.isdigit():
        return (token, "digit")
    elif token.isalnum():
        return (token, "alnum")
    elif token in ",:;":  
        return (token, "punctuation")
    elif token in ".!?":  
        return (token, "sentence_end")
    elif token == "\n":  
        return (token, "paragraph_end")
    else:
        return (token, "other")


for token in make_tokens(sample_text):
    print token

### Prelim to exercise 9

**Exercise 9 introduces lazy generators, an important form of function in python. A lazy generator does not calculate its results all at once but returns them one a a time for iteration. THe **enumerate** function which we saw in example 6 is a lazy generator.**

**You can define lazy generator functions by using **yield** instead of **return**. When the function reaches a **yield** command it yields the argument and suspends execution without terminating and returns control to the level that called the function. The next time it is called it it resumes from the same place that it was left. There is no requirement to have a single yield command. You can yield in one place the first time and another place the next time (as you will see from the programme in the exercise).**

**The cell below shows a simple function using both forms so that you can see the difference. Notice that you cannot use the result in the same way. A result that is returned is passed directly as value whereas a result that is yielded must be used in an iterator.**

In [None]:
def return_count_to_ten():
    return range(1,11)


def yield_count_to_ten():
    for i in range(1, 11):
        yield i

        
l = return_count_to_ten()
print l
    
i = yield_count_to_ten()
print ('yield')
print i

l = list(yield_count_to_ten())
print l

for i in yield_count_to_ten():
    print i


### Exercise 9

**The previous programme delimited tokens by looking for spaces between them. You should have noticed that it doesn't work very well because it doesn't account for punctuation symbols. We need a better way to do this and, ideally, a separate function to do it.** 

**Because it is hard to follow, here is a summary of the logic of the new function,** split_tokens(input_text):

**The function reads the whole string one character at a time, adding characters to the token variable.**
* **When it encounters a delimiter it yields the token.**
* **If the token is empty it yields the delimiter character - unless it is a space - because the delimiter is an item of punctuation which is itself a token.**
* **After returning a token the variable is reset to an empty string.**

**Notice how the function yields the result instead of returning it. This means that it continues from the same point next time it is called.**
* **Try calling the function using the empty cell. What happens?**
* **Notice that the programme does not make a simple function call, it uses it in a list comprehension which iterates over it. Another common way to collect the yields would be with a for loop.**

**To make sure you understand the logic tests, experiment with statements of the form:**
* if variable_name: print "True"

**Test this with the variable set to different types of data, including an empty list and an empty string.** 

In [None]:
def make_tokens(input_text):
    """
    Take an input text, split it into tokens, find the
    token's shape, make a feature
    vector with the token itself and its shape, return
    a list of all token feature vectors found in the input.
    :param input_text: A character string containing spaces
    :return: A list of token feature vectors (token, shape).
        Sample output: [('a', 'alpha'), ('7', 'digit'), ('A27', 'alnum')]
    """
    
    # Now it's up to the split_tokes function to decide what a token is.
    # List comprehension creates a list by extracting elements from
    # an iterable object, in this case Python automatically converts the
    # split_tokens function into an iterable object because it uses the "yield" statement:
    
    tokens = [token for token in split_tokens(input_text)]
    return map(make_token_feature_vector, tokens)


def split_tokens(input_text):
    """
    This function decides how to delimit a token. It takes an input
    string, iterates over it character by character; it collects
    constituent characters in the output token; punctuation characters
    are considered delimiters therefore become tokens of their own; the
    space character is removed from tokens. Yield each found token at
    a time.
    :param input_text: A character string containing a mix of text and delimiter characters.
    :yield A character string which is either free from delimiters or
        is a delimiter itself.
    """

    DELIMITERS = ",:!?.\n"
    token = ""
    for char in input_text:
        if char in DELIMITERS:  # test if the input character is a delimiter (substring presence)
            
            # Character strings, lists, etc, have a logical truth value in Python;
            # an empty string is False, if it has characters it is True.
            
            if not token:  # same as token == ""
                yield char
            else:
                
                # Return token to the calling program, but next time this function
                # is called, continue from
                # the next statement rather than from the beginning of the function:
                
                yield token  # After yielding control to the calling program,
                             # this function will execute the next statement:
                token = ""  # Pick up execution from here.
                yield char
        elif char == " ":
            if token:  # same as token != ""
                yield token
                token = ""
        else:
            token += char

for token in make_tokens(sample_text):
    print token

### Exercise 10

**The purpose of this exercise is to learn the difference between three different ways of running a python programme.** 

**The first is the way used in these exercises: simply typing or pasting the code into a notebook (or console) and running it.** 

**Very similar to the first way is to import the code from a file or module into a notebook (or console). If you import a module, python will automatically run it. That means it reads every line in the file and executes. If the module contains function definitions, executing them means creating the functions. If it contains code that calls functions, python will make those calls and run the functions.**  

**The third way is to run the module from the command line by typing python (or ipython) followed by the module name including the .py suffix.**

**Python behaves the same for the second and third method. However, it is often useful to have a module that runs using the third method but doesn't run using the second i.e. you can import the functions, and perhaps some variables, without running anything. To achieve this, modules often include the line**
* if __name__ == "__main__" **as in the cell below.**

**This will run when called from the command line but not when the file is imported.**

**The cell below contains the programmes for the tokens exercise we just looked at. It is also stored in a file named "Exercise10.py" You don't need to read the code as nothing has changed (apart from the addition of one line for testing which was added only to the saved file). Try all three methods:**

**1. Execute the cell below**

**2. In the empty cell execute:**

* Import Exercise10 

* **(Note the capital letter). It should not run the programme. To see what has happened, run the following commands: **

* print noone

* print Exercise10.noone

* from Exercise10 import noone

* print noone

** The variable **noone** did not exist in the original programme (it was assigned in the test line that was added to the file).**
* **Notice the difference between the two types of import. Using the second type is more convenient as you don't have to specify the namespace to access functions and variables.**
* **For this reason people sometimes use the command** from module import \*. **However, this is dangerous as you can easily overwrite existing names and python will not warn you. Using the import command in this way is considered bad practice. You can sometimes get away with it when importing your own module but avoid it with library modules.**

**3. Run it from the command line** (ipython Exercise10.py)

In [None]:
def make_tokens(input_text):
    """
    Take an input text, split it into tokens, find the
    token's shape, make a feature
    vector with the token itself and its shape, return
    a list of all token feature vectors found in the input.
    :param input_text: A character string containing spaces
    :return: A list of token feature vectors (token, shape).
        Sample output: [('a', 'alpha'), ('7', 'digit'), ('A27', 'alnum')]
    """
    
    # Now it's up to the split_tokes function to decide what a token is.
    # List comprehension creates a list by extracting elements from
    # an iterable object, in this case Python automatically converts the
    # split_tokens function into an iterable object because it uses the "yield" statement:
    
    tokens = [token for token in split_tokens(input_text)]
    return map(make_token_feature_vector, tokens)


def make_token_feature_vector(token):
    
    """
    Given a token, extract its shape and return a
    vector with the token itself and its shape
    :param token: A character string
    :return: A tuple (token, shape)
    """
    
    if token.isalpha():
        return (token, "alpha")
    elif token.isdigit():
        return (token, "digit")
    elif token.isalnum():
        return (token, "alnum")
    elif token in ",:;":  
        return (token, "punctuation")
    elif token in ".!?":  
        return (token, "sentence_end")
    elif token == "\n":  
        return (token, "paragraph_end")
    else:
        return (token, "other")



def split_tokens(input_text):
    
    """
    This function decides how to delimit a token. It takes an input
    string, iterates over it character by character; it collects
    constituent characters in the output token; punctuation characters
    are considered delimiters therefore become tokens of their own; the
    space character is removed from tokens. Yield each found token at
    a time.
    :param input_text: A character string containing a mix of text and delimiter characters.
    :yield A character string which is either free from delimiters or
        is a delimiter itself.
    """
    
    # First decide what characters delimit a token:
    DELIMITERS = ",:!?.\n"
    
    token = ""
    for char in input_text:
        
        if char in DELIMITERS:  # test if the input character is a delimiter (substring presence)
            
            # Character strings, lists, etc, have a logical truth value in Python;
            # an empty string is False, if it has characters it is True.
            
            if not token:  # same as token == ""
                yield char
            else:
                
                # Return token to the calling program, but next time this function
                # is called, continue from
                # the next statement rather than from the beginning of the function:
                
                yield token  # After yielding control to the calling program,
                             # this function will execute the next statement:
                token = ""  # Pick up execution from here.
                yield char
        elif char == " ":
            if token:  # same as token != ""
                yield token
                token = ""
        else:
            token += char
            
sample_text = "This is a sample sentence01 showing 7 different token types: alphabetic, numeric, alphanumeric, Title, UPPERCASE, CamelCase and punctuation!\nSentences like that should not exist. They're too artificial.\nA REAL sentence looks different. It has flavour to it. You can smell it; it's like Pythonic code, you know?\nHave you heard of 'code smell'? Google it if you haven't."            

if __name__ == "__main__":
    for token in make_tokens(sample_text):
        print token

### Exercise 11

**Note on terminology. The word "parse" means to read and process sequentially. In NLP it also has a specific meaning to analyse text to determine its syntax. To avoid confusion please be aware that in this exercise the first meaning is used.**

**The programme below constructs a nested list by reading some input text and looking for delimiters.**
* **First it runs our **make_tokens** function.**
* **Then it reads one token at a time to construct sentences. Each sentence is a list.**
* **When the end of a sentence is reached, a new empty list is created and a new sentence read.**
* **When it reaches the end of a paragraph, all the sentences in that paragraph are kept together in a list and a new pargraph is created.**

**There are various ways this could be done. The method below uses the generator method we have seen before where results are delivered using the yield command, instead of the return command. This means the function does not exit but resumes from the same place the next time it is called.**
* **Using generators is often a good way to write clear simple code**
* **Another advantages of the generator method is that it enables data to be processed as it is needed, making it possible to process very large lists that might use up too much memory.**

**First, execute the cell and study the code until you understand how it works.** 

**Second consider how else this task could have been written. What do you think of this method? Is it easy to read?**

In [None]:
def parse_text(input_text):
    """
    A parsed text is defined as a list of parsed paragraphs.
    Given an input text, parse its paragraphs and return a list
    with the results.
    :param input_text: A character string with paragraphs
    :return: A list of parsed paragraphs
    """
    
    return [paragraph for paragraph in parse_paragraphs(input_text)]


def parse_paragraphs(input_text):
    """
    A parsed paragraph is defined as a list of parsed sentences.
    Given an input text, parse its sentences; if the sentence is
    actually the end of a paragraph, then yield the previous
    sentences packed as a list.
    :param input_text: a character string containing paragraphs
                       and sentences.
    :yield: A list of sentences up to the end of the paragraph.
    """
    
    paragraph = list()
    for sentence in parse_sentences(input_text):
        
        # We expect parse_sentences to return "paragraph_end"
        # when it encounters an end of paragraph mark.
        
        if sentence == "paragraph_end":
            yield paragraph
            paragraph = list()
        else:
            paragraph.append(sentence)
    yield paragraph


def parse_sentences(input_text):
    """
    A parsed sentence is defined as a list of token vectors
    :param input_text: a character string containing paragraphs,
                       sentences and token vectors.
    :yield: A list of token vectors up to the end of a sentence.
    """
    
    token_vectors = make_tokens(input_text)  
    sentence = list()
    
    # Since a token vector is a tuple (token, shape) we can unpack it
    # automatically as we iterate over the list of token vectors:
    
    for token, shape in token_vectors:
        if shape == "sentence_end":
            yield sentence
            sentence = list()
        elif shape == "paragraph_end":
            if sentence:
                yield sentence
                sentence = list()
            yield "paragraph_end"
        else:
            sentence.append((token, shape))
    if sentence:
        yield sentence



print "************************** SENTENCES IN THE PARSED TEXT:"
for sentence in parse_sentences(sample_text):
    print sentence
print "************************** PARAGRAPHS IN THE PARSED TEXT:"

for paragraph in parse_paragraphs(sample_text):
    print paragraph

print "************************** PARSED TEXT:"
print parse_text(sample_text)

### Exercise 12

**The programme in this exercise selects a character at random from the nested list generated by the previous programme. Run the programme and make sure you can understand what it is doing.**

**1. Recall that we defined a token vector to be an ordered pair (token, shape). Accessing the token or the shape with the code** token_vector[0]** or **token_vector[1]** is difficult to read. It is better to define the indices as constants. Constants are always given capitalised names and sit in the global scope. Do you agree that this improves readability?**

**2. Notice how to index into the nested list and the character string. The line indexing the character string could have been written as:**

character = parsed_text[paragraph_coord][sentence_coord][token_coord][TOKEN][character_coord]

**Do you think this would have made the programme more readable?**

In [None]:
import random

TOKEN = 0  
SHAPE = 1

def get_random_character_coordinates_in_text(parsed_text):
    """
    Given a parsed text, as the one produced by parse_text.py,
    return a random character within the text, together with its
    coordinates.
    :param parsed_text: A nested list with token vectors within
        sentence lists within paragraph lists.
    :return: A vector where the elements are: the random character,
        the paragraph, sentence, token and character coordinates.
        Sample output: ('f', 3, 1, 2, 1)
    """

    # Generate a random index within a valid range:
    
    paragraph_coord = random.randrange(len(parsed_text))
    sentence_coord = random.randrange(len(parsed_text[paragraph_coord]))
    token_coord = random.randrange(len(parsed_text[paragraph_coord][sentence_coord]))
    token = parsed_text[paragraph_coord][sentence_coord][token_coord][TOKEN]
    character_coord = random.randrange(len(token))
        
    
    # With the obtained random coordinates, access the input parsed text:
    
    character = token[character_coord]
    
    return character, paragraph_coord, sentence_coord, token_coord, character_coord


parsed_text = parse_text(sample_text)
for _ in range(10):
    print get_random_character_coordinates_in_text(parsed_text)