# P04: dictionaries, sets, and files
- **Concepts**: ordered vs not, hash
- **Python dict, set**: hashable, set, dict, .update(), del
- **Python files**: open, close, with open as fp

## Concepts

### Hashable, immutable

- [**Immutable**](https://en.wikipedia.org/wiki/Immutable_object) variables/objects cannot be changed after they are created.  For instance, in Python, a string cannot be modified, it can just be replaced with a new string.  In contrast, a list can be modified: you can add new items to it without creating a new list object.

- **Hashable** variables/objects are immutable, and can support the calculation of a [hash](https://en.wikipedia.org/wiki/Hash_function).  This includes integers, floats, strings, tuples, booleans.

### Sets

### Dictionary: Hash-table / mapping

### Files

#### Paths

#### Formats

## Python

### Sets

Sets are *unordered*, *mutable*, *collections*, of *distinct*, *hashable* elements.

In Python sets are like lists in that they:
 - are *mutable* (so you can change them),
 - are *collections* of other objects (so you can iterate over them, get their `len()`, check membership with `in`).

However, they are unlike lists in that:
- they are *unordered* so they cannot be indexed or sliced.
- they may contain only *hashable* items (so they cannot contain things like lists, dictionaries, sets).
- every item they contain must be *distinct*.

Sets are created with `set(vals)` where `vals` is typically a list or some other kind of iterable entity.

In [21]:
x = set(['a', 'b', 3, 4])
print(x)

{3, 'b', 4, 'a'}


Sets can be modified with the methods `.add()` or `.remove()`


In [22]:
print(x)
x.add('abs')
x.add(10)
print(x)

x.add(3) # no effect
print(x)

{3, 'b', 4, 'a'}
{3, 4, 10, 'b', 'a', 'abs'}
{3, 4, 10, 'b', 'a', 'abs'}


Note: sets are presented in some kind of order, but that order changes when the set is changed.

Note: adding an existing item has no effect, since sets have only disctinct items.


In [None]:
print(x)
x.remove(5)
print(x)

Set membership can be evaluated with `in`

In [None]:
6 in x

Sets can be iterated over with a `for loop`

In [23]:
for item in x:
    print(item)

3
4
10
b
a
abs


Sets are useful because:
2. they are efficient for keeping track of unique things.
1. they support *set operations* (union, intersection, difference

#### Set operations

In [26]:
A = set('panda')
B = set('conga')
print(A)
print(B)

{'p', 'd', 'n', 'a'}
{'o', 'c', 'g', 'a', 'n'}


Set union: items in A or B

In [27]:
print(A | B)

{'p', 'o', 'c', 'g', 'a', 'd', 'n'}


Set intersections: items in A and B

In [28]:
print(A & B)

{'n', 'a'}


Set difference: items in A but not B

In [29]:
print(A - B)

{'p', 'd'}


Set symmetric difference: items in either set but *not both* sets.  (items in union but not in intersection)

In [30]:
print(A ^ B)

{'p', 'o', 'd', 'c', 'g'}



### Dictionaries


Dictionaries are *unordered*, *mutable*, *collections*, of *key*-*value* pairs.  They are a mapping from *distinct*, *hashable* keys, onto values.

In Python dictionaries are like lists in that they:
 - are *mutable* (so you can change them),
 - are *collections* of other objects (so you can iterate over them, get their `len()`, check membership with `in`).
 - you can get items with square brackets `[]` (but not with integer index)

However, they are unlike lists in that:
- they are *unordered* so they cannot be indexed with integers or sliced.
- the are *mappings* between *distinct*, *hashable* keys, and values.

Dictionaries are created with `dict()`, or with `{key:value}` notation.

In [53]:
courses = dict()
courses['CSS2'] = 'Data/Model Python'
courses['CSS1'] = 'Intro Python'
print(courses)

courses = {'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python'}

{'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python'}


Dictionary elements can be accessed via their keys.

In [54]:
print(courses['CSS1'])

Intro Python


Items can be added to dictionaries by assigning to new keys,

In [55]:
courses['ABB'] = 'Is this a course?'
print(courses)

{'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python', 'ABB': 'Is this a course?'}


Dictionaries can be updated with `.update()`, which will add new keys, and update the values of existing keys.

In [56]:
new_courses = {'ABB': 'this is not a course', 'CSS100':'Analytic Programming'}
courses.update(new_courses)
print(courses)

{'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python', 'ABB': 'this is not a course', 'CSS100': 'Analytic Programming'}


Elements of dictionaries can be deleted with the `del` keyword:

In [57]:
del courses['ABB']
print(courses)

{'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python', 'CSS100': 'Analytic Programming'}


You can check if a key exists in a dictionary with `in`:

In [58]:
print('CSS1' in courses)
print('ABB' in courses)

True
False




#### Keys, Values, Items

You can get (or iterate over) just the keys with `.keys()`.

In [59]:
print(courses.keys())
print('')
for k in courses.keys():
    print(k)

dict_keys(['CSS2', 'CSS1', 'CSS100'])

CSS2
CSS1
CSS100


You can get just the values with `.values()`.

In [60]:
print(courses.values())
print('')
for v in courses.values():
    print(v)

dict_values(['Data/Model Python', 'Intro Python', 'Analytic Programming'])

Data/Model Python
Intro Python
Analytic Programming


You can get key-value pairs (as tuples) with `.items()`.

In [61]:
print(courses.items())
print('')
for pair in courses.items():
    print(pair)

dict_items([('CSS2', 'Data/Model Python'), ('CSS1', 'Intro Python'), ('CSS100', 'Analytic Programming')])

('CSS2', 'Data/Model Python')
('CSS1', 'Intro Python')
('CSS100', 'Analytic Programming')


It is often useful to do assignment unpacking, to unpack the (key,value) tuple into two variables:

In [62]:
for k,v in courses.items():
    print(f'Course number {k} is titled {v}')

Course number CSS2 is titled Data/Model Python
Course number CSS1 is titled Intro Python
Course number CSS100 is titled Analytic Programming


#### Sorting

As you saw above, the order of key-value pairs in a dictionary is determined by when they were added.  Often we want to sort the contents either by the keys, or by the values.   In either case, to get a sorted dictionary, we will end up making a new dictionary by inserting key-value pairs in a sorted order.

**By keys**

In [63]:
sorted_courses = dict()
for k in sorted(courses.keys()):
    sorted_courses[k] = courses[k]

print(courses)
print(sorted_courses)

{'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python', 'CSS100': 'Analytic Programming'}
{'CSS1': 'Intro Python', 'CSS100': 'Analytic Programming', 'CSS2': 'Data/Model Python'}


**By values**

Sorting by values is a bit tricky -- we can sort the values, but we have no reliable way to figure out which keys were associated with the sorted values.  Consequently, we have to sort the key-value items.  But doing so requires that we can tell the `sorted` function to use the second element of the pair to sort.  This is all doable, but involves either writing an anonymous function (not hard, we just havent covered it yet), or importing a library that creates that anonymous function for us.

We will show you how to do this using the `itemgetter` function from the `operator` library

In [64]:
from operator import itemgetter # this imports the itemgetter function
print("before sorting, items in order of insertion:  ")
for item in courses.items():
    print(item)

print("default sorting sorts by first element of pair (key):  ")
for item in sorted(courses.items()):
    print(item)

print("sorting by the second element (index=1) of the pair with key=itemgetter(1):  ")
for item in sorted(courses.items(), key=itemgetter(1)):
    print(item)

before sorting, items in order of insertion:  
('CSS2', 'Data/Model Python')
('CSS1', 'Intro Python')
('CSS100', 'Analytic Programming')
default sorting sorts by first item of tuple:  
('CSS1', 'Intro Python')
('CSS100', 'Analytic Programming')
('CSS2', 'Data/Model Python')
sorting by the second item of the tuple with key=itemgetter(1):  
('CSS100', 'Analytic Programming')
('CSS2', 'Data/Model Python')
('CSS1', 'Intro Python')


Making a new sorted dictionary:

In [65]:
courses_by_title = dict()

for number,title in sorted(courses.items(), key=itemgetter(1)):
    courses_by_title[number] = title

print(courses_by_title)

{'CSS100': 'Analytic Programming', 'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python'}


Finally, for a very pithy, advanced syntax, we could do it with dictionary comprehension:

In [67]:
courses_by_title = {k:v for k,v in sorted(courses.items(), key=itemgetter(1))}
print(courses_by_title)

{'CSS100': 'Analytic Programming', 'CSS2': 'Data/Model Python', 'CSS1': 'Intro Python'}


### Reading files

#### Finding a file

- Files are stored somewhere in your computer systems hard drive or storage area.

- That location is specified via a file **path**.

- A file path encodes the location of the file within the directory structure on the computer.

- *Absolute* file paths encode the location of the file relative to the root, or base directory of the file system.  So let's say we have a file named `filename.ext` located in `folder2`, which is located inside `evul`, which is inside `Users`, which is in the base (root) directory of the file system.  On a unix machine (such as Mac OS, Linux, etc.) this location is encoded as follows `/Users/evul/folder2/filename.ext` where the slashes (`/`) indicates directories or folders.  Windows machines use the forward slash, so that path would look like `C:\Users\evul\folder2\filename.ext`.

- *Relative* file paths encode the location of the file relative to the current location or path, where the program is running.  So if a program has been launched in the folder `/Users/evul/programs/`, then the relative path to `/Users/evul/folder2/filename.ext` would involve going up one directory in the file tree `..`, then down into `folder2` and getting `filename.ext`.  The full relative path would be `../folder2/filename.ext`.  The key part here is that `..` refers to the parent directory.

#### Opening a file

- Files are accessed via the `open(file, mode)` command, specifying the (relative or absolute) path to the  file, and the mode with which you want to open it ('r' for read, 'w' for write, 'a' for append, there are [more](https://docs.python.org/3/library/functions.html#open)).

- Opening a file creates a file object which can be read from (or written to).  Assuming we are dealing with text files, `file.read()` reads the entire content of the file as one string.  `file.readlines()` reads the entire content of the file as a list of strings, with each element corresponding to one line in the text file.  `file.readline()` reads one line at a time, and is helpful if your file is large, and you do not want to load its entirety into memory.

- When you open a file, your operating system is notified that some program is doing something to that file, and it will prevent other changes from being made to that file. Consequently, it is important that you *close* the file after opening it.  In Python, the easiest way to make this that this is done without errors is via the `with` keyword, that creates a temporary context in which the file is open, and then closes the file as soon as all the operations that need to be carried out on the file are completed.

To avoid overloading you with options and alternatives, we will presume that all text files are read as follows:

In [68]:
with open('../datasets/example.txt', 'r') as fp:
    file_contents = fp.readlines()

print(file_contents)

['This is line 1.\n', 'This is line 2, it contains the following phrase: "Hello!"\n', '"What?" is the first word on line 3.\n', '\n', '(the line above is blank)\n', 'This is the last line of the file.']


this creates the variable `file_contents` which contains a list, with each element of that list being a line of the file (here, the file is `example.txt` located in a sibling directory called `datasets`.  Note the escape sequence `'\n'` in the strings -- these are the *newline* escape character, and is how we encode line breaks inside strings.