# Data

**ENGSCI233: Computational Techniques and Computer Systems** 

*Department of Engineering Science, University of Auckland*

# 0 What's in this Notebook?

Data provides the critical link between what we do on a computer (modelling, analysis) and what is happening out in the real-world. This module will develop your understanding of some fundamental concepts that are preliminary to actually working with data: (i) file I/O and file system manipulation, and (ii) representation of data in a data-type.

The important things to know with file I/O is: how to create, read and write text files from a computer program, and how to copy, move and delete files from the file system. Other really cool things you can do, but we won't cover them here: add and extract files from zip archives, download files from the internet, and broadcast a command to your friend's laptop that [**formats their hard-drive**](https://www.youtube.com/watch?v=dQw4w9WgXcQ).

We'll also start looking at data-types, a gentle introduction to the Python concept of a `Class`. Here, we'll start with something called a linked list, which has a few limited applications. Later, we'll generalise this to a network where we'll really start to see the power of dedicated data types.

You need to know:
- The basic file system manipulation commands - read, write, copy, delete, move - and how to apply them in different situations.
- The attributes and methods of a linked list: delete, insert, pop, and pointers.

In [None]:
# imports and environment: this cell must be executed before any other in the notebook
%matplotlib inline
from data233 import*

## 1 File I/O and File System  Operations

<mark>***Read, write, copy, move and delete files from within your code.***</mark>

Before we can look at, analyse, visualise and understand data, we must first ***load it*** into our workspace. 

Usually, data will exist as one or more files somewhere on the file system. The key operations are:
1. ***Opening*** files, particularly those not in the current folder.
2. ***Reading*** both data and metadata from the file.
3. ***Writing*** to files.
4. ***Manipulating*** files (copy, delete, etc.) and the file system (folder creation, deletion).

### 1.1 Opening files

As with MATLAB, we can open a file, assign a ***file pointer*** and use this to extract the information. For example, open the file ```example_file.txt```

In [None]:
# opening a file
fp = open('example_file.txt','r')     # the first argument is the file name, the second indicates we are opening in "read-only"
print(fp)

# closing a file
fp.close()

***Be careful***, if you open an *existing* file in write mode (by passing the argument ```'w'```) it will delete everything and start a new file from scratch.

### 1.2 Reading from files

***Execute the cell below*** to show the content of the file we just opened:

In [None]:
%pycat example_file.txt

It contains both ***data*** (the numbers) as well as ***metadata*** (information about the numbers, in the top row, often called ***headers***).

We can use preexisting knowledge about the file's structure to write a short piece of code to read it:
1. The file contains a ***single header*** line. It is the ***first line***.
2. The file contains ***two columns*** of data.
3. Assume we ***don't know*** how much data there is in the file (don't known how long the columns are).
4. Data in the columns are ***comma separated***.

This is a common data format called a CSV file (for comma-separated values). 

***Run the code below***. Uncomment each of the ```print``` statements in turn to help you understand what is going on.

In [None]:
# open the file
import numpy as np
fp = open('example_file.txt','r')

# read the header
hdr = fp.readline()         # this function pulls out the next line as a string
#print(hdr)

# enter a while loop to read the data
ln = fp.readline()         # read the first line of the data
xs = []                    # empty lists for the data to be stored in
Ts = []                
while ln != '':            # check: is it the end of the file?
    # "strip" off the "newline" character at the end
    ln = ln.strip()
    #print(ln)
    
    # "split" the string into two substrings, using the comma as a delimiter
    x,T = ln.split(',')
    #print(x)
    
    # save the data in a list
    xs.append(float(x))    # adds the x value to the end of the list of x's
    Ts.append(float(T))    # adds the T value to the end of the list of T's
    #print(xs)
    
    # read the next line before going back to the start of the loop
    ln = fp.readline()         

# close the file
fp.close()

# convert the data from lists to arrays
#print(Ts)
xs = np.array(xs)
Ts = np.array(Ts)
#print(Ts)

***- - - - CLASS CODING EXERCISE - - - -***

In [None]:
# PART ONE
# --------
# OPEN example_file.txt and keep a running total of the SUM of 
# temperature values.
# (hint: copy-paste code from the cell above)

# **your code here**

In [None]:
# PART TWO
# --------
# Extend PART ONE by OPENING a second file called 'cumulative_temperature.txt',
# and WRITE out depth and cumulative temperature data.

# **your code here**

In [None]:
# OPTIONAL CHALLENGE
# ------------------
# Write out temperature AND cumulative temperature in SCIENTIFIC NOTATION 
# accurate to 7DP. Make sure the header accurate reflects the column data.


### 1.3 Writing to files

The trickiest part of writing data to a file is getting your head around Python's [```format```](http://thepythonguru.com/python-string-formatting/) method for strings. The best way to learn is through example:

In [None]:
# the print function displays a string as output to the screen
print("Hello, I am string.")

In [None]:
# the string class has a format method that modifies the string
print("Hola, soy {}.".format("string"))

In [None]:
# we use format to insert numbers into strings
print("{:d} is an integer".format(3))

In [None]:
# all kinds of numbers
print("and {:f} is a float.".format(3))

In [None]:
# we can control how the numbers look, and how many numbers to insert
print("{:d} decimal places is readable but not precise for computation, e.g., {:3.2f}".format(2, np.pi))

In [None]:
# scientific notation is handy when you don't know how large or small your number is going to be
print("this can be a good way to represent long ugly numbers, e.g., {:8.7e}".format(np.sqrt(np.pi*1.e12)))

In [None]:
# writing data to a file
fp = open('my_file.txt','w')

# write some string data
fp.write('this is the header')
fp.write('\n')          # new line characters are important (comment this line and see for yourself)
hdr = 'it should contain human readable information for anyone who happens to open the file'
fp.write(hdr+'\n')

# write some other data
case = 1
for i in range(10):
    if case == 1:
        # some integers
        fp.write('{:03d}\n'.format(i))      # what happens when you take out the '03' part?
    elif case == 2:
        # some floats in scientific notation
            # generate some random numbers between 0 and 1
        r1 = np.random.random()
        r2 = np.random.random()
            # what does this code do?
        if r1>r2:
            r1,r2 = r2,r1
            # write the pair of numbers
        fp.write('{:8.7e}, {:8.7e}\n'.format(r1, r2))
    elif case == 3:
        # **to be completed by you**
        #  - in the first column, put the index
        #  - in the second column, write a random number between 3 and 8
        #  - only write out data for even indices (use the % operator to get "remainder when divided by")
        ___
        

# close the file
fp.close()

# view the file we just created
%pycat my_file.txt

***Complete the code above for `case == 3`.***

***Write code in the cell below to:***
1. ***read the contents of `example_file.txt`***
2. ***multiply the second column by 2***
3. ***write out a new file `example_file2.txt` with the modified column 2, preserving the metadata***

In [None]:
# **your code here**

### 1.4 File system manipulation

Housekeeping is important, especially on a computer. You should keep your files and data in an organised, logical folder structure. To assist with this, Python offers commands to:
- ***copy, move and delete*** individual files
- ***create and delete*** directories
- a ***wildcard*** notation for file and folder selection

In [None]:
# create a copy of the file 'example_file.txt' named 'example_file_copy.txt'
import os, shutil               # a bunch of useful file system operations
?shutil.copyfile
# **to do**

In [None]:
# create a new subdirectory and move the copied file there
os.makedirs('temporary_folder')
shutil.move('example_file_copy.txt','temporary_folder'+os.sep+'example_file_copy.txt')

# os.sep returns a string to indicate a directory separator - it is different on different operating systems
# e.g., os.sep returns '\\' on windows and '/' on unix/mac
# don't worry too much about this...

# note, there are errors when we execute this cell a second time
# 1. we cannot 'create' a directory that already exists
# 2. we cannot move a file from the 'source' destination more than once

# modify the code above with try/except blocks to handle these errors
#try:
#    **some commands**
#except:
#    # an error? do this instead
#    # option 1
#    pass    # <- do nothing
#    # option 2
#    print('folder already exists') # <- print some info

#### 1.4.1 Deleting files and folders

It goes without saying, you should ***exercise caution*** when automating delete commands within a script.

In [None]:
# delete a file
os.remove('example_file_copy.txt')

# delete a folder and everything in it
shutil.rmtree('temporary_folder')

# both of these commands raise errors if the files and folders are not found
# **modify the code to handle these errors**

#### 1.4.2 Using wildcards

The `glob` module is useful to select files and folders conforming to a pattern. Complete the comments in the code below.

In [None]:
# import package for wildcards
from glob import glob

# list all files and folders in the current directory
files = glob('*')        # finds all files and folders matching the string, the * means "any text here"
#for file in files: print(file)  # print the list of files and folders

# loop through the list, if a folder is encountered, print its name and list the folders contents
for file in files:             # for each item in the list
    if os.path.isdir(file):       # **your comment**
        print(file)                  
        subfiles = glob(file+os.sep+'*')  # **your comment**
        if len(subfiles)>0:          # **your comment**
            for subfile in subfiles:    
                print('  '+subfile.split(os.sep)[-1])    
        else:                           
            print('  *empty folder*')      

***Write code below to search for the two subfolders and swap their contents.***

In [None]:
# **your code here**

For more practice working on the file system, open the [**input_output_example.ipynb**](input_output_example.ipynb) notebook and complete the exercises.

## 2 Linked Lists

***Before starting this section, work through the [notebook](object_oriented.ipynb) on object-oriented programming.***

In ENGGEN 131 you were introduced to the concept of an array: a contiguous block of memory in which a series of integer or floating point numbers could be stored. This form of memory management is limited with regard to **insertion** or **deletion** of elements from the interior of the list.

To get around these limitations, we use a data-type called a ***[Linked List](https://www.cs.cmu.edu/~adamchik/15-121/lectures/Linked%20Lists/linked%20lists.html)***.

Each item in the linked list is a container - called a **node** - that contains two pieces of information:
1. The **value**, the thing being stored in the list (a number, a string, some other object, another list).
2. A **pointer** to the next node in the list.

To access the list, we only need to know the where the first node is, called the **head**. Subsequent items in the list are found by following the trail of pointers. The last value in the list has a **null pointer** (in Python, we shall use `None`).

Note, in Python the [List](https://www.tutorialspoint.com/python/python_lists.htm) class is actually [implemented as an array](https://docs.python.org/3/faq/design.html#how-are-lists-implemented).

### 2.1 Linked lists versus arrays

This is not a case of one being better than the other, but rather each type being more suited to a particular application. The strengths and weaknesses of these data-types are discussed below.

#### Memory

Linked lists require more memory than arrays. This is because, in addition to allocating memory to store an item's value, they also require memory to store the pointer to the next item.

#### Ease of access

Because arrays store their values sequentially in memory, to access any value, we need only know its index and the pointer to the beginning of the array. This is called *random access*, i.e., we can immediately access random values.

To access a particular item in a linked list, we must begin at the head and then follow the trail of pointers until the correct node is arrived at. This is called *sequential access*.

#### Flexibility of manipulation

Insertion or deletion of values into or from the middle of a linked list is handled by a reasonably straightforward reassignment of pointers. Memory allocation and deallocation for these tasks is simple. 

When inserting or deleting from an array, the *tail* (items occurring after the insertion/deletion) have to be shifted. If there are a large number of insertions, memory may have to be reallocated and the array copied.

#### [When should I use which data-type?](http://stackoverflow.com/questions/393556/when-to-use-a-linked-list-over-an-array-array-list)

### 2.2 Python implementation

A simple implementation of the linked list is given below. Two classes are defined - `LinkedList` and `Node` - as well as methods to `append`, `insert`, `get_length`, `get_value`, and `get_node`.

***Execute the cell below to make linked list classes and methods available.***

In [None]:
class LinkedList(object):
    '''A class with methods to implement linked list behavior.
    '''
    def __init__(self):
        '''Initialise an empty list.
        '''
        self.head = None
        
    def __repr__(self):
        '''Print out values in the list.
        '''
        # special case, the list is empty
        if self.head is None:
            return '[]'

        # print the head node
        ret_str = '['                        # open brackets
        node = self.head
        ret_str += '{}, '.format(node.value) # add value, comma and white space

        # print the nodes that follow, in order
        while node.pointer is not None:      # stop looping when reach null pointer
            node = node.pointer               # get the next node
            ret_str += '{}, '.format(node.value)
        ret_str = ret_str[:-2] + ']'         # discard final white space and comma, close brackets
        return ret_str
    
    def append(self, value):
        '''Insert a new node with VALUE at the end of the list.
        '''
        # insert value at final index in list        
        self.insert(self.get_length(), value)
        
    def insert(self, index, value):
        '''Insert a new node with VALUE at position INDEX.
        '''
        # create new node with null pointer
        new_node = Node(value, None)
        
        # special case, inserting at the beginning
        if index == 0:
            # new node points to old head
            new_node.pointer = self.head
            # overwrite list head with new node
            self.head = new_node
            return
        
        # get the node immediately prior to index
        node = self.get_node(index-1)
        
        # logic to follow
        if node is None:                    # special case, out of range
            print("cannot insert at index {:d}, list only has {:d} items".format(index, self.get_length()))
        elif node.pointer is None:           # special case, inserting as last node
            # ** your comment here **
            node.pointer = new_node
        else:
            # ** your comment here **
            new_node.pointer = node.pointer
            # ** your comment here **
            node.pointer = new_node
            
    def get_length(self):
        '''Return the length of the linked list.
        '''
        # special case, empty list
        if self.head is None:
            return 0
        
        # initialise counter
        length = 1
        node = self.head
        while node.pointer is not None:
            node = node.pointer
            length += 1
        
        # output computed length    
        return length
        
    def get_node(self, index):
        '''Return the node at INDEX.
        '''
        # special case: index = -1, retrieve last node
        if index == -1:
            # begin at head
            node = self.head
            
            # loop through until Null pointer
            while node.pointer is not None:
                node = node.pointer
            return node
        
        # begin at head, use a counter to keep track of index
        node = self.head
        current_index = 0
        
        # loop through to correct index
        while current_index < index:
            node = node.pointer
            if node is None:
                return node
            current_index += 1
        
        # output located node
        return node
        
    def get_value(self, index):
        '''Return the value at INDEX.
        '''
        # get the node at INDEX
        node = self.get_node(index)
        
        # return its value (special case if node is None)
        if node is None:
            return None
        else: 
            return node.value
        
class Node(object):
    '''A class with methods for node object.
    '''
    def __init__(self, value, pointer):
        '''Initialise a new node with VALUE and POINTER
        '''
        self.value = value
        self.pointer = pointer
        
    def __repr__(self):
        '''Print out value of node.
        '''
        return '{}'.format(self.value)
 

#### 2.2.1 Linked list exercises


***Execute the cells below one at a time.***

In [None]:
# 1. Create an empty list
ll = LinkedList()
print(ll)

In [None]:
# 2. Add some values
ll = LinkedList()
ll.append(value=1)      # note: append = "insert value at end of list"
ll.append(value=2)
ll.append(value=3)      
print(ll)

In [None]:
# 3. Add some more values of different 'types'
ll.append(value=-3.5)
ll.append(value=None)
ll.append(value='engsci233 - so great')      
ll.append(value=[1,2,3])                     
print(ll)

In [None]:
# 4. Insert a value into the middle of the list 
ll.insert(index=4, value='middle of list')
print(ll)

In [None]:
# 5. Insert a value at the head of the list
ll.insert(index=0, value='new head')
print(ll)

***The method `append` takes one input, called `value`, whereas the method `insert` takes two inputs, `index` and `value`. Why?***

> <mark>*~ your answer here ~*</mark>

***We can OMIT `index=` and `value=` from the call to `insert` - i.e., `ll.insert(0,'new head')`. How will the method know which is which? (hint: refer to the method definition above).***

> <mark>*~ your answer here ~*</mark>

#### 2.2.2 Visualising the list

The code below visualises a simple linked list. ***Modify the list and rerun the code.***

In [None]:
# imports
%matplotlib inline
from data233 import *

# create a simple linked list with three values (dropping the 'value=' part)
ll = LinkedList()
ll.append(1); ll.append(2); ll.append(3)

# **your modifications here**


# visualize the list
f,ax = plt.subplots(1,1)
f.set_size_inches(12,1)

show_list(ll,ax)


#### 2.2.3 Accessing the list

The `get_node()` method returns a `Node` object. For example, ***execute the commands below*** to obtain the 5th and 6th nodes from the list and inspect their values.

In [None]:
# create a list
ll = LinkedList()
for i in range(22,45,3):
    ll.append(i**2-10*i+5)
    
# get a node
nd = ll.get_node(index=5)

# get the next node along
nd_next = ll.get_node(index=6)

In the cell below, add code to extract **the next list node** from the previous.

In [None]:
# get the SECOND node, assuming we have ONLY the first node
# *hint* use the POINTER attribute
nd_next2 = ___


Execute the code below which demonstrates the `get_node` method.

In [None]:
# imports
%matplotlib inline
from ipywidgets import interact

# create a list
ll = LinkedList()
for i in range(22,45,3):
    ll.append(i**2-10*i+5)

# defining an interactive plot function - don't worry about this
def interactive_access(index=0):
    # specify an index to find
    ll.get_value(index);

    # visualize
    f,ax = plt.subplots(1,1)
    f.set_size_inches(12,1)

    show_list(ll,ax,highlight=index)
    
interact(interactive_access, index=(0,7,1));

In the diagram above, **arrows** are *pointers*, **boxes** are *nodes*, and the **numbers** are *values*.

***Consider the SECOND node (containing 509). Which pointer (arrow) BELONGS to this node - the one pointing TO it or the one pointing FROM it?***

> <mark>*~ your answer here ~*</mark>

***Starting from the ZEROTH (head) node, how many "pointer jumps" are required to reach the SECOND node?***

> <mark>*~ your answer here ~*</mark>

***The $N^{th}$ node?***

> <mark>*~ your answer here ~*</mark>

***- - - - CLASS CODING EXERCISE - - - -***

In [None]:
# PART ONE
# --------
# create an EMPTY linked list and then
# WRITE a for loop to APPEND consecutive 
# integers up to 20

# **your code here**

In [None]:
# PART TWO
# --------
# nd = ll.head   assigns the FIRST node in a linked list to `nd'. 
# `nd' has attributes "value" and "pointer". "pointer" is the NEXT 
# node object in the list. 
#
# WRITE a for loop to iterate over your list from part one above USING 
# the "pointer" attribute to move to each subsequent node. 
# PRINT each node's value.

nd = ll.head     # get the first, head node

# **your code here**

In [None]:
# OPTIONAL CHALLENGE
# ------------------
# Instead of using a FOR loop in PART TWO, use a WHILE loop with the 
# appropriate stopping condition (node.pointer is None). Hence compute 
# the LENGTH of the linked list.

# **your code here**

#### 2.2.4 Inserting an item in the list

Executing the code below to visualise use of the `insert` method.

In [None]:
# imports
%matplotlib inline

def interactive_insert(index=0):
    # visualize the list
    f,(ax1,ax2) = plt.subplots(2,1)
    f.set_size_inches(12,2.5)

    # create a simple linked list with three values
    ll = LinkedList()
    for i in range(1,6):
        ll.append(i)
    show_list(ll,ax1,label='original list')

    # modify the linked list to insert a new item
    ll.insert(index,'(o-o)')
    show_list(ll, ax2, highlight=index, label='with insertion at position {:d}'.format(index))

interact(interactive_insert, index = (0,5,1));

In the diagram above, set the INDEX slider to 3.

***Which pointer in the ORIGINAL list is SEVERED?***

> <mark>*~ your answer here ~*</mark>

***Which pointers in the MODIFIED list are NEW?***

> <mark>*~ your answer here ~*</mark>

***Which lines of code in the definition of `insert` above achieve the pointer reassingment?***

> <mark>*~ your answer here ~*</mark>


#### 2.2.5 Deleting and popping items (HOMEWORK EXERCISE)

**Popping** refers to removing an item from a list and returning it to the user. For example, popping repeatedly the first item of a list is equivalent to operating a **queue**. Equally, popping from the end of a list is referred to as a **stack**.

In [None]:
# SKETCH the operations for a POP method, the opposite of INSERT, including:
# - LOCATING the item to be popped
# - reassigning pointers so that the linked list BYPASSES the popped item
# - RETURNING the popped item

***Complete the ```pop``` method in the cell in the ```LinkedList``` class below.*** 

In [None]:
class LinkedListPop(LinkedList):
    '''Modify the LinkedList class to implement pop and delete methods.
    '''
    def __init__(self):
        '''Initialise an empty list.
        '''
        super(LinkedListPop,self).__init__()
    def pop(self, index):
        '''Delete node at INDEX and return its value.
        '''
        # special case, index == 0 (delete head)
        if index == 0:
            # popped value
            pop = self.head.value
            # set new head as second node
            self.head = self.head.pointer
            return pop
        
        # get the node immediately prior to index
        ___
        
        # logic to follow
        if node is None:                    # special case, out of range
            print("cannot access index {:d}, list only has {:d} items".format(index, self.get_length()))
            return None
        elif node.pointer is None:          # special case, out of range
            print("cannot access index {:d}, list only has {:d} items".format(index, self.get_length()))
            return None
        elif node.pointer.pointer is None:  # special case, deleting last node
            # popped value
            pop = node.pointer.value
            
            # make prior node the last node
            node.pointer = None
        else:
            # popped value
            ___
            
            # set this nodes pointer so that it bypasses the deleted node
            ___
        
        return pop
        
    def delete(self,index):
        '''Delete node at INDEX.        
        '''
        # use pop method and discard output
        self.pop(index)
            

***Run the cell below to visualise your `pop` method.***

In [None]:
# imports
%matplotlib inline
from ipywidgets import interact

def interactive_insert_and_delete(insert=0, pop=0):
    # visualize the list
    f,(ax1,ax2,ax3) = plt.subplots(3,1)
    f.set_size_inches(12,3.5)

    # create a simple linked list with three values
    ll = LinkedListPop()
    for i in range(1,6):
        ll.append(i)
    show_list(ll,ax1,label='original list')

    # modify the linked list to insert a new item
    ll.insert(insert,'[o-o]')
    show_list(ll, ax2, highlight=insert,label='with insertion at position {:d}'.format(insert))
    
    # modify the linked list to delete an item
    value = ll.pop(pop)
    show_list(ll,ax3, popped=value, label='after popping item {:d}'.format(pop))

interact(interactive_insert_and_delete, insert = (0,5,1), pop = (0,5,1));


***At which index should insertion occur, and which index should be popped so the list that remains is: `1->2->3->4->[o-o]`?***

> <mark>*~ your answer here ~*</mark>

***The linked list implemented here is singly linked (pointers to next node). What is required for a doubly (pointers to next AND previous nodes) linked list? How will this affect memory requirements.***

> <mark>*~ your answer here ~*</mark>


## 3 Networks

A network is a generalisation of the linked list concept. 

The network comprises a set of nodes connected to other nodes by **arcs**. Each arc is assigned a **weight** (which in a transport network might represent a traffic capacity). Arcs are directional, i.e., they go **from** one node **to** another. Each node can have multiple arcs entering and exiting it.

In the first lab, you will be generalising the linked list implementation given above to describe a network.