In [2]:
import sys
sys.path.append(r'/Users/davidw/Documents/teach/NLE/resources')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re    #import regex module

def tokenise(sentence):
    sentence = re.sub("'(s|m|(re)|(ve)|(ll)|(d))\s", " '\g<1> ",sentence + " ")
    sentence = re.sub("s'\s", "s ' ",sentence)
    sentence = re.sub("n't\s", " n't ",sentence)
    sentence = re.sub("gonna", "gon na",sentence)
    sentence = re.sub("\"(.+?)\"", "`` \g<1> ''",sentence)   
    sentence = re.sub("([.,?!])", " \g<1> ", sentence)
    return sentence.split()

def make_same_length(listA,listB):
    # extends the shorter list (with empty strings) to make the lists have the same length
    if len(listA) < len(listB):
        listA.extend('' for _ in range(len(listB)-len(listA)))
    else:
        listB.extend('' for _ in range(len(listA)-len(listB)))
    return listA,listB

testsentence = "After saying \"I won't help, I'm gonna leave it all to you!\", on his parents' arrival, the boy's behaviour suddenly improved."


# Topic 1: Preprocessing Text

## Preliminaries 

*During the lab you should work your way through this document.*

*Do each of the **Something for you to do** activities that appear along the way.*


### Something for you to do

*The first thing you need to do is run the following cell. This will give you access to the Sussex NLTK package.*

In [86]:
# Edit this cell to uncomment one line and remove the one that follows

import sys
#sys.path.append(r'T:\Departments\Informatics\LanguageEngineering') 
sys.path.append(r'/Users/davidw/Documents/teach/NLE/resources')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import collections

## Overview 

A raw text document is just a sequence of characters. There are a number of basic steps that are often performed when processing natural language text. In this lab session we will cover some of the basic text pre-processing methods. In particular, you will be looking at:
- <b> tokenisation</b> - roughly speaking, this involves grouping characters into words;
- <b>case normalisation</b> - this involves converting all of the text into lower case; 
- <b>stemming</b> - this involves removing a word's inflections to find the stem; and 
- <b>punctuation and stop-word removal</b> - stop-words are common functions words that in some situations can be ignored.

Note that we do not always apply all of the above preprocessing methods; it depends on the application. One of the things that you will be learning about in this module, is when the application of each of these methods is, and is not, appropriate.

### Available corpora

We have provided simple interfaces to each of the following corpora, which interact well with NLTK tools.

- The NLTK texts
- Amazon product reviews (~78k documents, ~640k sentences)
- Wall Street Journal text (~2k documents, ~51k sentences)
- Reuters articles (~61k documents, ~740k sentences)
  - Reuters / Finance (~47k documents, ~550k sentences)
  - Reuters / Sport (~13k documents, ~185k sentences)
- Medline abstracts (~985k documents, ~6100k sentences)
- Twitter posts (~962k documents, ~1720k sentences)

## Getting raw sentences from a corpus

The corpora are too large to easily process with some of the functions you will be using, so we have provided a way for you to work on a randomly selected sample of each corpus.

The Reuters, Twitter and Medline corpora have a function called <code style="background-color: #F5F5F5;">sample_raw_sents</code>, which returns a specified number of random sentences, where each sentence is an un-tokenised string.

The code in the next cell shows you how to iterate over a random sample of 10 sentences. When you are using a tokeniser, you will replace
`# do something with sentence`
with code that tokenises each sentence and prints the results.

In [None]:
from sussex_nltk.corpus_readers import ReutersCorpusReader

rcr = ReutersCorpusReader()    #Create a new reader

sample_size = 10

for sentence in rcr.sample_raw_sents(sample_size): #get a sample of random sentences, where each sentence is a string
    # do something with sentence

### Something for you to do

- Make a copy the cell above and move the copied cell so that it is positioned below this cell. 
- Adapt the code in the new cell so that it prints a sample of **20** sentences from the **Twitter** corpus.

In [None]:
from sussex_nltk.corpus_readers import TwitterCorpusReader

tcr = TwitterCorpusReader()    #Create a new reader

sample_size = 20

for sentence in tcr.sample_raw_sents(sample_size): #get a sample of random sentences, where each sentence is a string
    print(sentence)


### Something for you to do

- Point your browser at [Sussex NLTK package documentation](http://www.sussex.ac.uk/Users/davidw/courses/nle/SussexNLTK-API/) and have a look around. This provides information about the above corpora. Take a particularly careful look at the [`corpus_readers` Module](http://www.sussex.ac.uk/Users/davidw/courses/nle/SussexNLTK-API/sussex_nltk.html#module-sussex_nltk.corpus_readers)

### Something for you to do

- In the code cell below write code that will establish whether there are systematic differences between the  average sentence length (as measured in terms of the number of characters in the sentence) of the sentences in the Reuters, Twitter and Medline corpora.

In [None]:
# put your own solution in this cell

### Model solutions available (full and partial)

In [None]:
# uncomment the next line and then run the cell to load a partial solution
#%load ../Solutions/3/average_sentence_length_part

In [None]:
# uncomment the next line and then run the cell to load a full solution
# %load ../Solutions/3/average_sentence_length

## DIY Tokenisation with Regular Expressions

Text doesn't come in neat tokens ready for analysis, it must first undergo sentence segmentation and tokenisation.  
We have already sentence segmented the corpora.  
In this lab you will be focusing on tokenisation, in particular, you will be comparing the merits of the following tokenisers:  
- Your own regular expression based tokeniser
- The (NLTK implemented) PENN treebank style regular expression based tokeniser
- A Twitter-specific CMU tokeniser

### Issues to consider

Your goal when working through this next section should be to investigate the strengths and weaknesses of each of the 3 tokenisers on three rather different kinds of corpora: 
- the Reuters corpus, 
- the Twitter corpus and 
- the Medline corpus.

### Making your own tokeniser

In this section, you will write your own Python function, which takes as input a single string representing a sentence, and returns a <b>list of strings</b> obtained by splitting the sentence into tokens.

Let's start by simply splitting by whitespace. 

In [49]:
print("   What    is the    air-speed   velocity of  an unladen swallow?   ".split()) 

['What', 'is', 'the', 'air-speed', 'velocity', 'of', 'an', 'unladen', 'swallow?']


### Something for you to do

- In the empty code cell below write a [function](http://docs.python.org/tutorial/controlflow.html#defining-functions), `tokenise` which takes a sentence as input and returns a list of the tokens making up the sentence. Your first version of this function should tokenise only on whitespace, as shown in the cell above. Show that your function works on the sentence shown above.


In [None]:
# put your code here

### Model solution available

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load ../Solutions/3/simple_tokenise

### Something for you to do

- In the empty code cell below write code that applies your tokenise function to each sentence in a sample of 30 sentences taken from  the Reuters, Twitter and Medline corpora, 10 sentences from each corpus.

In [None]:
# put your code here

### Model solution available

In [None]:
# uncomment the next line and then run the cell to load a solution
# %load ../Solutions/3/tokenise_samples

In most tokenisation policies (e.g. in the Wall Street Journal corpus), contractions like "I'm" tend to be split into "I" and "'m".  

When it comes to more than just splitting by whitespace, it can be convenient to use [regular expressions](http://docs.python.org/library/re.html) to process the string in some way. The following code cell illustrates this. Trying running it and then read on to discover how it works.

In [2]:
import re    #import regex module

print(re.sub("([.?!'])", " \g<1>", "You're using coconuts!").split())   

['You', "'re", 'using', 'coconuts', '!']


Let's look at how the above code works by breaking it down.  

First, run the following cell.

In [56]:
import re    #import regex module

print(re.sub("'", " '", "You're using coconuts!")   )

You 're using coconuts!


As you can see, this code takes the string "You're using coconuts!" and inserts a space before the apostophe, the `'` character. 

Let's see how it works...

The first argument of `re.sub`, i.e. `"'"`, is a regular expression that in this case is extremely simple, since it only matches the apostophe character, `'`.

The second argument of `re.sub`, where we see `" '"`, indicates that an apostophe should be substituted by a space followed by an apostophe.

Now let's make it slightly more complicated. We also want to insert a space before the `"!"`, so let's look at how to do that. 

Run the following code cell.

In [3]:
import re    #import regex module

print(re.sub("(['!])", " \g<1>", "You're using coconuts!")   )

You 're using coconuts !


The first argument of `re.sub`, has been changed to `"(['!])"`, which is a regular expression that matches either an apostophe character,`'`, or an exclamation mark,`!`.

This is achieved with the regular expression `"['!]"`, where the square brackets enclose the alternative characters. 

Why does the regular expression contain parenthesis? 

It has to do with what we need to put as the second argument of `re.sub` where the substitution is specified. 

To understand this, you need to appreciate that we want to add a space before an apostrophe and also a space before an exclamation mark. How can we specify that in the second argument of `re.sub`? 

The answer is that we need to make use of the the idea of a **group**.

The parenthesis in `"(['!])"` define the start and end of a group. In this case the whole regular expression is a group. In general, however, there can be several sets of parentheis defining several groups. For example, the regular expression `"([Tt]h)e (m*n)"` has two groups. Groups are numbered from left to right, so the group in the regular expression `"(['!])"` is group 1. 

Defining this group allows us to refer to the string that matches the regular expression `"(['!])"`, which will be either an apostrophe or an exclamation mark. This is then used in the second argument of `re.sub`, where we see `" \g<1>"`, which indicates that the material that matches the apostophe or exclamation mark should be substituted by a space followed by the symbol that was matched. The `1` in `\g<1>` tells us that it is group one.

We are now ready to look at the original code, which is reproduced below and should now make sense. 

In [51]:
import re    #import regex module

print(re.sub("([.?!'])", " \g<1>", "You're using coconuts!").split())   

['You', "'re", 'using', 'coconuts', '!']


First, the spaces are added before any full stop, question mark, exclamation mark or apostrophe.
The resulting string is then split on white space.

### Something for you to do

- Create an empty code cell below, and write a new version of your `tokenise` function that uses `re.sub` in the way we've just considered. 

In [None]:
# put your code here

### Model solution available

In [222]:
# uncomment the next line and then run the cell to load a solution
# %load ../Solutions/3/tokenise_with_re.sub

['What', 'is', 'the', 'air-speed', '.', 'velocity', 'of', 'an', 'unladen', 'swallow', '?']


### Something for you to do


- Create an empty code cell below, and extend your tokeniser function to cater for the following guidelines. 
- Test out your new tokeniser on the string  
`"After saying \"I won't help, I'm gonna leave!\", on his parents' arrival, the boy's behaviour improved."`  
 notice that the `"` characters in the test sentence have been espaced, appearing as `\"`.

### Guidelines

- punctuation is split from adjoining words
- opening double quotes are changed to two single forward quotes.
- closing double quotes are changed to two single backward quotes.
- the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately.
  - e.g. `"children's"` produces `"children 's"`
  - e.g. `"parents'"` produces `"parents '"`
- contractions should be split into component morphenes
  - e.g. `"won't"` produces `"wo n't"`
  - e.g. `"gonna"` produces `"gon na"`
  - e.g. `"I'm"` produces `"I 'm"`
  
  
These tokenisation guidelines are a subset of those found [here](http://www.cis.upenn.edu/~treebank/tokenization.html).



### Hints:

- Use multiple calls to `re.sub` to deal with different cases one at a time. As in...

```
    sentence = re.sub(<pattern1>, <replacement1>,sentence)
    sentence = re.sub(<pattern2>, <replacement2>,sentence)
    sentence = re.sub(<pattern3>, <replacement3>,sentence)
```

- Order your calls to `re.sub` so that you deal with the specific cases first and the more general cases later.

- In dealing with the replacement of start and end `"`, you will find the following useful:

>The `'*'`, `'+'`, and `'?'` qualifiers are all *greedy*; they match
>as much text as possible.  Sometimes this behaviour isn't desired; if the RE
>`<.\*>` is matched against `<a> b <c>`, it will match the entire
>string, and not just `<a>`.  Adding `'?'` after the qualifier makes it
>perform the match in *non-greedy* or *minimal* fashion; as *few*
>characters as possible will be matched.  Using the RE `<.\*?>` will match
>only `<a>`.  
(taken from https://docs.python.org/2/library/re.html).


### Model solution available

In [225]:
# uncomment the next line and then run the cell to load a solution
# %load ../Solutions/3/my_tokeniser

['After', 'saying', '``', 'I', 'wo', "n't", 'help', ',', 'I', "'m", 'gon', 'na', 'leave', '!', "''", ',', 'on', 'his', 'parents', "'", 'arrival', ',', 'the', 'boy', "'s", 'behaviour', 'improved', '.']


## The NLTK regular expression tokeniser

The NLTK implements a regular expression tokeniser `word_tokenize` that is based on the above tokenisation guidelines. 

**Function**: `word_tokenize`

- Arguments
 - a single string, representing a sentence
- Returns
 - a list of strings, where each string is a token within the sentence</dd>

### Something for you to do

- Make sure you understand the code in the cell below and then run it so that you can compare the way that the test sentence has been tokensed by the two tokenisers.

In [None]:
%save -f ../Solutions/3/my_tokeniser 225

In [45]:
from nltk.tokenize import word_tokenize

# Define a useful function for printing two sequences in a way that makes it easier to compare them
def print_lists_in_columns(sequence1,sequence2,heading1,heading2):
    
    def make_lists_same_length(listA,listB):
    # Extends the shorter list (with empty strings) to so that the lists have the same length
        if len(listA) < len(listB):
            listA.extend('' for _ in range(len(listB)-len(listA)))
        else:
            listB.extend('' for _ in range(len(listA)-len(listB)))
        return listA,listB

    sequence1,sequence2 = make_lists_same_length(sequence1,sequence2)
    datadict = {heading1 : sequence1,
                heading2: sequence2}
    df = pd.DataFrame(datadict,columns=[heading1,heading2])
    print(df,"\n")

testsentence = "After saying \"I won't help, I'm gonna leave!\", on his parents' arrival, the boy's behaviour improved."

# run the nltk tokeniser and your tokeniser on the test sentence
nltk_toks = word_tokenize(testsentence) # run the nltk tokeniser
my_toks = tokenise(testsentence) # run your tokeniser

# print the sequences side by side
print_lists_in_columns(nltk_toks,my_toks,'NLTK','MINE')

         NLTK       MINE
0       After      After
1      saying     saying
2          ``         ``
3           I          I
4          wo         wo
5         n't        n't
6        help       help
7           ,          ,
8           I          I
9          'm         'm
10        gon        gon
11         na         na
12      leave      leave
13          !          !
14         ``         ''
15          ,          ,
16         on         on
17        his        his
18    parents    parents
19          '          '
20    arrival    arrival
21          ,          ,
22        the        the
23        boy        boy
24         's         's
25  behaviour  behaviour
26   improved   improved
27          .          . 



### Something for you to do

- In the code cell below write code to run both the `NLTK_Tokenise` and your own `Tokenise` function on a sample of 10 sentences from the Reuters corpus.
- Use the `print_lists_in_columns` function (defined above) to show the two tokenisations of each sentence.
- Look for differences in the output of the two tokenisers.


In [None]:
# %load ../Solutions/3/nltk_vs_mine

## The Twitter-specific Tokeniser

The third tokeniser for you to explore is a Twitter-specific tokeniser that has been developed by [Gimpel et al.](http://ttic.uchicago.edu/~kgimpel/papers/gimpel+etal.acl11.pdf) as part of a Twitter-specific part-of-speech tagger (featured in later lab classes).

**Function**: `twitter_tokenize`
- Arguments
 - a single string, representing a sentence
- Returns
 - a list of strings, where each string is a token within the sentence</dd>

`twitter_tokenize` can be quite slow, so we have provided the following function to tokenise an entire sample of sentences at once.  

**Function**: `twitter_tokenize_batch`
- Arguments
 - a list of strings, where each string represents a sentence
- Returns
 - a list of sentences, where each sentence is a list of tokens

### Something for you to do
- Create a new code cell and write code to run both  `twitter_tokenize` and the the NLTK tokeniser, `word_tokenize`, function on each sentence in a sample of 10 sentences from the Twitter corpus.
- Display each sentence tokenised by the two tokenisers using the `print_lists_in_columns` function defined above.
- Once you have done this, look for differences in the output of the two tokenisers.


In [None]:
# %load ../Solutions/3/nltk_vs_twitter

### Something for you to do
- Re-using the code from the cell above, use both the NLTK and Twitter tokenisers on a sample of 10 sentences from the **Medline** corpus.
- Look for situations where the  tokenisers do not tokenise appropriately.
- Try to figure out the differences in tokenisation policies of the tokenisers.
- Think about possible motivations for the differences in tokenisation policy, by considering how the tokens may be used in subsequent (down-stream) language processing steps.


In [None]:
# %load ../Solutions/3/nltk_vs_twitter_medline

## Normalising text and removing unimportant tokens

### Number and case normalisation

Without any kind of normalisation, the tokens `"help"` and `"Help"` are two distinct types. This may, or may not be what you want. 

Another example, is that `"1998"` and `"1999"` count as distinct types. There are situations where there is no need to distinction between different numbers.

Python provides a [number of functions](http://docs.python.org/library/stdtypes.html#string-methods), which you can call in order to analyse their content, or produce new strings from them.

The following code performs case normalisation and replaces tokens that consist of digits by "NUM". 

It uses [list comprehension](http://docs.python.org/tutorial/datastructures.html#list-comprehensions) to build a new list by looping through and filtering items.

In [63]:
tokens = ["The","cake","is","a","LIE"]      #a list of tokens, some of which contain uppercase letters
print([token.lower() for token in tokens])   #print newly created list of all lowercase tokens

numbers = ['in', 'the', 'year', '120', 'of', 'the', 'fourth', 'age', ',', 'after', '120', 'years', 'as', 'king', ',' , 'aragorn', 'died', 'at', 'the', 'age', 'of', '210']
print(["NUM" if token.isdigit() else token for token in numbers])  #replace all number tokens with "NUM" in a new list of tokens

['the', 'cake', 'is', 'a', 'lie']
['in', 'the', 'year', 'NUM', 'of', 'the', 'fourth', 'age', ',', 'after', 'NUM', 'years', 'as', 'king', ',', 'aragorn', 'died', 'at', 'the', 'age', 'of', 'NUM']


### Something for you to do
- Create a new code cell below into which you should put code that normalises tokens such as `"4th"`, `"1st"` and `"22nd"` to `"Nth"`.
- Try to adapt this code from the cell above: `["NUM" if token.isdigit() else token for token in numbers]`
- Test your code on the list `["The", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]`. 
- Check that the token `"and"` isn't changed to `"Nth"`.
- You will find [this page](http://docs.python.org/library/stdtypes.html#string-methods) useful.


In [None]:
# %load ../Solutions/3/normalise_to_Nth

### Something for you to do
- Complete the code in the cell below. You have just two lines to complete. The goal is to use a large sample of the Reuters corpus to establish the extent to which vocabulary size is reduced when number and case normalisation is applied.
- For each of the two incomplete lines you should use nested list comprehensions. This is described in Section 5.1.4 in [this document](http://docs.python.org/tutorial/datastructures.html#list-comprehensions)


In [None]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize

def vocabulary_size(sentences):
    tok_counts = collections.defaultdict(int)
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token] += 1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

############################################
lowered_sentences = # complete this line
normalised_sentences = # complete this line
############################################

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


In [None]:
# %load ../Solutions/3/impact_of_normalisation

## Stemming

A considerable amount of the apparent lexical variation found in documents results from the use of morphological variants which do not have a major impact on the topic of the document. An easy way to remove these varied forms is to use a stemmer. NLTK includes a number of stemmers in the <code style="background-color: #F5F5F5;">nltk.stem</code> package.

[NLTK stem module API](http://nltk.org/api/nltk.stem.html)

[NLTK Porter stemmer](http://nltk.org/api/nltk.stem.html?highlight=stemmer#nltk.stem.porter.PorterStemmer)

### Something for you to do
- Complete the code below to show how the NLTK implementation of the Porter stemmer in `nltk.stem.porter.PorterStemmer` stems a sample of sentences in the Reuters corpus. All you need to do is to provide the missing first two arguments to the call to `print_lists_in_columns`.
- Have a close look at the differences between the columns. This will give you a good indication of what the stemmer does.

In [None]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

rcr = ReutersCorpusReader() 
st = PorterStemmer()

sample_size = 10

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

for sentence in tokenised_sentences:
    print_lists_in_columns(<Argument 1>,<Argument 2>,"BEFORE","AFTER")

In [None]:
# %load ../Solutions/3/show_stemmer_sample

### Something for you to do
- By looking at the impact on a large sample of the Reuters corpus, establish the extent to which vocabulary size is reduced by stemming.
- Create a new code cell for this. You should be abnle to re-use a lot of the code from the code you used when measuring the impact of lower case and number normalisation.

In [None]:
# %load ../Solutions/3/impact_of_stemming

### Punctuation and stop-word removal

A stopword is a word that occurs so often that it loses its usefulness in some tasks. We may get more meaningful information from our corpus analysis if we remove stopwords and punctuation.

The code below takes a list of tokens and creates a new list, which contains only those strings which are alphabetic and non-stop-words.

In [None]:
from nltk.corpus import stopwords

stopwords = stopwords.words('english')
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stopwords]

**NOTE**

`isalpha` only returns `True` if the string is entirely composed of alphabet characters. If you want a function to return `True` even when a word contains digits, then you should use `isalnum`.`

### Something for you to do
- By looking at the impact on a large sample of the Medline corpus, establish what proportion of tokens are stop-words.

In [None]:
# %load ../Solutions/3/impact_of_stopword_removal

### Organising your code

<div style="background-color: #FAFAD2;color:#8B0000;border:1px solid #DCDCDC;padding: 5px;">
<h3>Something for you to do</h3>
<ul>
<li><span style="color:#8B0000">Use the test sentence in the cell below to test that you have satisfied the guidelines.</span></li>

Notice that the double quotes within the string have been escaped using `\`.

</ul>
</div>

As you progress through these notebooks, you are creating (and we are providing) code snippets that perform various task that you may want to re-use at a later point. Instead of copy pasting code from lab sheets, it is better to build a library of functions that is easy for you to access.

To do this, create another python module (see below how this is done) that will contain frequently used functions (you can call it <code style="background-color: #F5F5F5;">nle_utils</code>). For instance the stop-word removal code later on in this lab session is something you will probably need later on in the module. Instead of copy pasting the code everytime you need it, you can separate the code into a function in <code style="background-color: #F5F5F5;">nle_utils</code> and use it by importing the module.

<code style="background-color: #F5F5F5;">import nle_utils</code>  
<code style="background-color: #F5F5F5;">nle_utils.remove_stopwords('some text with stopwords')</code>

For more information see:  
- [Python functions](http://docs.python.org/2/tutorial/controlflow.html#defining-functions)
- [Python modules](http://docs.python.org/2/tutorial/modules.html)


### Saving your code with iPython magic

You can save the code in one or more of the cells in a notebook using the iPython magic command <code style="background-color: #F5F5F5;">save</code>. For example:

- <code style="background-color: #F5F5F5;">%save nle_utils.py 14</code> will save the contents of cell 14 to file <code style="background-color: #F5F5F5;">nle_utils.py</code>.
- <code style="background-color: #F5F5F5;">%save nle_utils.py 1-10 12 15</code> will save contents of cells 1-10, 12 and 15 to file <code style="background-color: #F5F5F5;">nle_utils.py</code>.

Note that a cell is assigned a number when it is executed.

Note that this will overwrite the file.

To execute an iPython magic command, place it in a cell and run that cell.

if you want to edit this code load the code back into an iPython notebook cell. To do this, place the following in a code cell and run the cell. The code will then appear in a new code cell.

<code style="background-color: #F5F5F5;">%load nle_utils.py</code>