# Even More Python Fundamentals

In [None]:
## Notebook settings 

# multiple lines of output per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

>### Today
>
> - Modules and Packages
>
>
> - The Python Standard Library

## Modules and Packages

Logically, Python modules are groupings of related code that are structures as to facilitate its re-use. 

Physically, modules are `.py` files implementing a set of **functions, classes or variables**, as well as **executable statements**, that can be accessed from other modules by using the `import` command.

The `import` command can be used both to import **the whole code** of a module, using the following syntax:

```python
import module
```

or just **specific attributes** (one or more functions, variables, classes or a combination of these) with the following syntax:

```python
from module import name1, name2, name3
```

For example, if order to know what is our current working directory, we can use the function `getcwd()` available `os` module (see below) in two different ways:

In [None]:
import os
os.getcwd()

'/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks'

Note that all the other functions of the module are available as well

In [None]:
# os.list(path) returns the names of the entries in the folder "path"
print(os.listdir(os.getcwd()))

['11_Sentiment_Analysis.ipynb', '.DS_Store', '7_WordNet.ipynb', '10_WordEmbeddings.ipynb', 'images', '8_Vector_Semantics.ipynb', '0_HelloWorld.ipynb', '3_EvenMoreFundamentals.ipynb', 'stuff', '2_MoreFundamentals.ipynb', '13_Clustering_TopicModelling.ipynb', '4_ScientificProgramming.ipynb', '6_WebScraping_APIs.ipynb', '1_Fundamentals.ipynb', '.ipynb_checkpoints', 'data', '4_RegularExpressions.ipynb', '9_ML.ipynb', '7_Distributions_in_text.ipynb', '5_NLP_pipelines.ipynb', '12_Recommender_Systems.ipynb']


**Important**: use the `dir(module_name)` function to list all the attributes available in a (loaded) module

In [None]:
import numpy
print(dir(numpy)[:10])



In [None]:
# NOTE that Without arguments, `dir()` lists all names (variables, modules, functions etc.) defined currently
foo = 10
print(dir())

['In', 'InteractiveShell', 'Out', '_', '_2', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i2', '_i3', '_i4', '_i5', '_ih', '_ii', '_iii', '_oh', 'exit', 'foo', 'get_ipython', 'numpy', 'os', 'quit']


#### [Method 2]: importing only the needed attributes

In [None]:
from os import getcwd
getcwd()

'/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks'

**Important**: Note that the syntax adopted in the import statement influence the way attributes are called:

- if the whole module is imported, then the attribute is called as **`module.attribute`** (e.g. `os.getcwd()`)


- if only the attribute is imported, than it can be accessed simply as **`attribute`** (e.g. `getcwd()`)

**Import as [local name]**

In [None]:
import numpy as np
np.array([1, 2, 3])

array([1, 2, 3])

#### Re-loading a module

The first time you import a module or a function in a module (i.e. no matter the `import` syntax you adopt), all the code in the module source file is run at once. If another module, or the user in the interactive shell, import the module a second time, **nothing happens**. The Python interpreter won't load that code twice.

This can be seen clearly by importing the "Zen of Python" twice:

In [None]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [None]:
# the second time nothing happens
import this

This behavior can be bypassed by importing a previously imported module with the `reload()` function:

In [None]:
from importlib import reload
reload(this)

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


<module 'this' from '/anaconda3/envs/uva/lib/python3.8/this.py'>

### (Extra) Style Guide for Python Imports

> extracted from: [PEP 8 -- Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/#imports)

#### Imports should usually be on separate lines

```python
import os
import sys
```

- **don't** do this:

```python
import os, sys
```

- while this is ok:

```python
from subprocess import Popen, PIPE
```

#### Placement

Imports are always put **at the top of the file**, just after any module comments and docstrings, and before module globals and constants. They should be grouped in the following **order** (groups should be separated by blank lines):

1. future-imports
2. standard library imports
3. related third party imports
4. local application/library specific imports

For instance:

```python
from __future__ import division

import os
import sys

import numpy
import matplotlib

import my_measures
```

- Module level "dunders" (i.e. names with two leading and two trailing underscores) such as `__author__` and `__version__`, should be placed after the module docstring but before any import statements (except from `__future__` imports). For instance:

```python
""" This is a sample docstring

It doesn't do anything, really
"""

from __future__ import division

__version__ = '0.0'

__author__ = "Tom Marvolo Riddle"

import os
import sys
```

#### Absolute imports

The use of relative imports (quite common in the past) turned out to be a bad idea. As a general rules, use absolute imports whenever possible. Simply put, they are more readable and behave better:

```python
import mypkg.sibling
from mypkg import sibling
from mypkg.sibling import example
```

Relative imports looks like (but **don't** do it): 

```python
from . import sibling
```

#### AVOID wildcard imports

It is not uncommon to bump into code where modules are imported by using wildcards, such as in:

```python
from module import *
```

This practice can be harmful, in that it messes up the namespace (both for the human and the computer). 

While there can be a situation in which this practice can be defended (i.e. to republish an internal interface as part of a public API), in general it should be avoided.

### Modules Location

After receiving an `import` instruction, the Python interpreter searches for the requested module in the following locations:

- the current working directory


- (if not found) each directory in the environmental variable PYTHONPATH


- (if nothing else works) the installation-dependent default path

Want to know where the (compiled) source code of a module you've imported is located in your HD? Use the `__file__` method

In [None]:
import os
os.__file__

'/anaconda3/envs/uva/lib/python3.8/os.py'

### Writing Modules

In itself, writing a Python module is a trivial process: simply create a `[module_name].py` and import it in another file (or in the shell) by executing an `import [module_name]` statement. Try the following exercise:

- In you current working directory (`%pwd`), create a text file called `exemplar_module.py`


- Open this file and copy the following source code:

```python 
def schedule_meeting(week_day_numer):
    weekdays = ["zondag", "maandag", "dinsdag", "woensdag", "donderdag", "vrijdag", "zaterdag"]
    print("the meeting is scheduled for " + weekdays[week_day_numer])
```

- import the function `schedule_meeting()` from this module and call it.

As we've said before, modules can contain **executable statements**. These statements are executed **only the first time** the module is imported, and are executed **also when the module is run as a script** (i.e. from the shell).

When these statements are used to **initialize the module**, this behavior is not problematic (actually, that's why the interpreter works in this way).

This behavior is unwanted when we have portion of the code that are intended to be run **solely when the module is run as a script**. The solution is to have this portion of code inside an `if __name__ == "__main__":` statement **at the bottom** of the script.

Try the following exercise:

- In you current working directory (`%pwd`), create a text file called `exemplar_module_main.py`


- Open this file and copy the following source code:

```python 
def schedule_meeting(week_day_numer):
    weekdays = ["zondag", "maandag", "dinsdag", "woensdag", "donderdag", "vrijdag", "zaterdag"]
    print("the meeting is scheduled for " + weekdays[week_day_numer])

if __name__ == "__main__":
    print("I won't show up when imported!")
```

- if you import the module, no message is printed...

- Contrary to what happens if you execute the script from the command line: `!python exemplar_module_main.py`

**Explanation**: when the Python interpreter reads a source file, it automatically sets the variable `__name__` to:

- the module name if it is imported

- `"__main__"` if the module is run as the main program file

### Writing Packages

*"Packages are a way of structuring Python’s module namespace by using “dotted module names”. For example, the module name `A.B` designates a submodule named `B` in a package named `A`"* (source: [documentation](https://docs.python.org/2/tutorial/modules.html#packages)).

Simply put, packages are hierarchical directory structures organizing an environment composed by modules and subpackages. Directories of this sort are recognized by the Python interpreter because they **have to contain** a file called `__init__.py`. The content of this file (e.g. initialization code, declaration of some sort, it may even be empty) is roughly irrelevant.

Try the following exercise:

- In your current working directory (%pwd), create a folder called `exemplar_package`


- In this subfolder, create an empty text file called `__init__.py`


- In this folder, create another text file called `exemplar_module_main.py`, in which you have to copy the following source code:

```python 
def schedule_meeting(week_day_numer):
    weekdays = ["sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday"]
    print("the meeting is scheduled for " + weekdays[week_day_numer])

if __name__ == "__main__":
    print("I won't show up when imported!")
```

We can import this module and execute its only function in the following way:

```python
import exemplar_package.exemplar_module_main
exemplar_package.exemplar_module_main.schedule_meeting(4)
```

Note that we didn't override the dutch-y version of `schedule_meeting` import from the module in our working directory.

Functions can be imported directly as well (note that in this way we override the `schedule_meeting` function from the older module)

```python
from exemplar_package.exemplar_module_main import schedule_meeting
schedule_meeting(4)
```

---

## Batteries Included: the Python Standard Library 

One of the Python's mostly cited strengths is the so-called "batteries included" philosophy, that translates into real life as the fact that the Python source distribution comes with a rich library of tools. 

### A Quick Tour of (some of) the "batteries" we will depend on

##### OS Interface

The `os` module provides 500+ **platform-independent** functions that allows us to interact with the OS.

Exemplar functions of this module include:

- `os.getcwd()`: Return the current working directory
- `os.chdir(path/to/directory)`: Change the current working directory to `path/to/directory`
- `os.rename(source/path, target/path)`: Rename a file or directory `source/path` as `target/path`
- `os.remove(path/to/file)`: Remove the `path/to/file` file
- `os.makedirs(path/to/folder)`: Create the `path/to/folder` folder
- `os.path.exists(path/to/file)`: Check if the `path/to/file` file exists
- `os.path.isdir(path/to/folder)`: Check if the `path/to/folder` folder exists
- `os.path.split(a/path)`: Split `a/path` into a pair "(head, tail)", where "tail" is the last component (e.g. a filename) and "head" is the rest of the path
- `os.path.join(head/path, tail/path)`: Appends the `tail/path` to the `head/path`

But the `os` module can be used also to retrieve system information, to manage processes, to create files and so forth.

The Python Standard Library includes higher level alternatives to this module, among which the `tempfile` module to create temporary files, the `shutil` module for high-level file and directory handling (e.g. to copy or move a file) and `glob` to perform directory wildcard searches.


##### [Case 1]: Create a new folder (by overwriting existing folder)

The function `os.makedirs(folder/path)` creates a new directory, but raises an error exception if the directory already exists:

In [None]:
folder_name = "tmp_folder"

In [None]:
try:
    os.makedirs(folder_name)
    print("created!")
except OSError as e:
    print (e)

created!


In [None]:
try:
    os.makedirs(folder_name)
    print("created!")
except OSError as e:
    print (e)

[Errno 17] File exists: 'tmp_folder'


The `shutil` module offers the function `rmtree(path/to/folder)` to delete an entire directory, together with its files and subdirectories

**Always double check!**

In [None]:
import shutil

In [None]:
try:
    os.makedirs(folder_name)
    print("created!")
except OSError:
    shutil.rmtree(folder_name)
    os.makedirs(folder_name)
    print("deleted and created!")

deleted and created!


**NOTES:**

- the module `os` offers the function `rmdir()` to delete a directory. Why didn't we use this? Because it only works **when the directory is empty**.


- we could have used the `os.path.exists()` function to check if a directory exists. However, there's a chance the directory can be created between the `os.path.exists()` and the `os.makedirs()` calls, resulting in an OSError.

##### [Case 2]: List files and/or subfolders in a folder

There are at least three functions in the python standard library that allows you to navigate a folder:

**[`os.listdir()`]** returns a list of all the entries (i.e. files, links, subfolders...) in a directory:

In [None]:
# let's see what's in our working directory
os.listdir(os.getcwd())

['tmp_folder',
 '11_Sentiment_Analysis.ipynb',
 '.DS_Store',
 '7_WordNet.ipynb',
 '10_WordEmbeddings.ipynb',
 'images',
 '8_Vector_Semantics.ipynb',
 '0_HelloWorld.ipynb',
 '3_EvenMoreFundamentals.ipynb',
 'stuff',
 '2_MoreFundamentals.ipynb',
 '13_Clustering_TopicModelling.ipynb',
 '4_ScientificProgramming.ipynb',
 '6_WebScraping_APIs.ipynb',
 '1_Fundamentals.ipynb',
 '.ipynb_checkpoints',
 'data',
 '4_RegularExpressions.ipynb',
 '9_ML.ipynb',
 '7_Distributions_in_text.ipynb',
 '5_NLP_pipelines.ipynb',
 '12_Recommender_Systems.ipynb']

In [None]:
# it may have sense to select only the file entries:
for entry in os.listdir(os.getcwd()):
    if os.path.isfile(entry):
        print (entry)

11_Sentiment_Analysis.ipynb
.DS_Store
7_WordNet.ipynb
10_WordEmbeddings.ipynb
8_Vector_Semantics.ipynb
0_HelloWorld.ipynb
3_EvenMoreFundamentals.ipynb
2_MoreFundamentals.ipynb
13_Clustering_TopicModelling.ipynb
4_ScientificProgramming.ipynb
6_WebScraping_APIs.ipynb
1_Fundamentals.ipynb
4_RegularExpressions.ipynb
9_ML.ipynb
7_Distributions_in_text.ipynb
5_NLP_pipelines.ipynb
12_Recommender_Systems.ipynb


**[`os.walk()`]** navigates the directory top-down (or bottom-up, if specified) and returns, for each directory, a 3-tuple `(dirpath, dirnames, filenames)`: `dirpath` is the path to the directory, `dirnames` is a list of the names of the subdirectories, `filenames` is a list of the names of the non-directory files.

In [None]:
for dirpath_infos in os.walk(os.getcwd()):
    print (dirpath_infos)

('/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks', ['tmp_folder', 'images', 'stuff', '.ipynb_checkpoints', 'data'], ['11_Sentiment_Analysis.ipynb', '.DS_Store', '7_WordNet.ipynb', '10_WordEmbeddings.ipynb', '8_Vector_Semantics.ipynb', '0_HelloWorld.ipynb', '3_EvenMoreFundamentals.ipynb', '2_MoreFundamentals.ipynb', '13_Clustering_TopicModelling.ipynb', '4_ScientificProgramming.ipynb', '6_WebScraping_APIs.ipynb', '1_Fundamentals.ipynb', '4_RegularExpressions.ipynb', '9_ML.ipynb', '7_Distributions_in_text.ipynb', '5_NLP_pipelines.ipynb', '12_Recommender_Systems.ipynb'])
('/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/tmp_folder', [], [])
('/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/images', [], ['supervised-classification.png', 'Indent.png', 'weighting-users.png', 'ipython.png', 'unicode.png', 'string-slicing.png', 'pathlen.png', 'notebook-ui.png', 'SA-techniques.png', 'tokens-types.png', 'a

The `glob` module provides a function, **[`glob.glob()`]**, that returns all the directory entries that match a given pattern (without recursing into subdirectories). 

It can be used to find entries that have a given prefix, a given suffix or some key character sequences. 

In [None]:
import glob

The `glob()` function supports two wildcards and character ranges:

- `*`: **zero or more characters** in a segment of a name

In [None]:
# list all the .py files in the current working directory
glob.glob(os.path.join(os.getcwd(), "*.ipynb"))

['/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/11_Sentiment_Analysis.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/7_WordNet.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/10_WordEmbeddings.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/8_Vector_Semantics.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/0_HelloWorld.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/3_EvenMoreFundamentals.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/2_MoreFundamentals.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/13_Clustering_TopicModelling.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/4_ScientificProgramming.ipynb',
 '/Users/giovannicolavizza/Drop

- `?`: any **single character** in a segment of a name

In [None]:
# list all the tuesday notebook files
glob.glob(os.path.join(os.getcwd(), "?_*.ipynb"))

['/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/7_WordNet.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/8_Vector_Semantics.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/0_HelloWorld.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/3_EvenMoreFundamentals.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/2_MoreFundamentals.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/4_ScientificProgramming.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/6_WebScraping_APIs.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/1_Fundamentals.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_projects/Teaching/AUC_TMCI_2021/notebooks/4_RegularExpressions.ipynb',
 '/Users/giovannicolavizza/Dropbox/db_project

- selected characters can be specified by means of **character sets**, i.e. by placing the characters of choice between square braces

##### Command Line Arguments

When scripts are launched from the shell, command line arguments are stored in the `argv` attribute of the [`sys`](https://docs.python.org/2.7/library/sys.html) module. 

`sys.argv` is a list composed by the script name followed by the ordered sequence of command line arguments.

Try the following exercise:

- In your current working directory (`%pwd`), create a text file called `exemplar_module_args.py`.


- Open this file and copy the following source code:

```python 
import sys

print(sys.argv)
    
```

- launch the following script from the command line: `!python exemplar_module_args.py arg_1 arg_2 arg_3`.

> See the **[getopt](https://docs.python.org/2.7/library/getopt.html)** and the **[argparse](https://docs.python.org/2.7/library/argparse.html)** modules to build more powerful and flexible command line processing tools
>

---

##### Reading and Writing CSV files

The **CSV** (Comma Separated Values) is a **text file** format often used to import and export spreadsheet and databases. 

Notwithstanding its widespread use, the CSV is not a fully standardized file format, but it follows some conventions:

- each spreadsheet row is a record. Most of the times a row correspond to an input file line, but sometimes it can span multiple lines;


- each record consists of fields (you can think of them as column cells), separated by a **delimiter**. This is often a comma “`,`”, but tabs “`\t`”, white spaces “` `” and semicolons “`;`” are widely used;


- sometimes the first line of the text file reports the column **headers**, but this is optional;


- **quote characters** (usually double quotes “`"`”) are used **at least** to handle ambiguous cases (e.g. strings containing the separator character), but many options are available.

CSV files are usually parsed by looping through the input file and splitting each row, while writing to csv requires to loop through an iterator and join all the fields. The Python `csv` module takes over the handling of these tasks, by supplying specific objects that can be parameterized at will, and even some common settings called *dialects*.

In [None]:
import csv

To **read** a csv file, use the function `reader()` to generate an object that can be used to process the rowd of the input file:   

In [None]:
# read & text mode
with open("data/example_default_setting.csv", "r") as infile:
    scrubs_reader = csv.reader(infile)
    for line in scrubs_reader:
        print(line)

['Perry Cox', 'M.D.', 'John C. McGinley', '45']
['Carla Espinosa-Turk', 'RN', 'Judy Reyes', '36']
['Christopher Turk', 'M.D.', 'Donald Faison', '31']
['Bob Kelso', 'M.D.', 'Ken Jenkins', '70']
['Elliot Reid', 'M.D.', 'Sarah Chalke', '29']
['J.D.', 'M.D.', 'Zach Braff', '31']
['Janitor', 'unknown', 'Neil Flynn', '40']


To ensure flexibility, the `reader()` functions accepts a wide range of parameters, among which the **delimiter** character,  the character used for **quoting** and the quoting behavior (see the [documentation](https://docs.python.org/3/library/csv.html) for the full list).

In [None]:
with open("data/example_alternative_settings.csv", "r") as infile:
    scrubs_reader = csv.reader(infile, delimiter = '\t', quotechar = "'", quoting = csv.QUOTE_NONNUMERIC)
    header = next(scrubs_reader)  # let's extract the header
    for line in scrubs_reader:
        print (line)

['Perry Cox', 'M.D.', 'John C. McGinley', 45.0]
['Carla Espinosa-Turk', 'RN', 'Judy Reyes', 36.0]
['Christopher Turk', 'M.D.', 'Donald Faison', 31.0]
['Bob Kelso', 'M.D.', 'Ken Jenkins', 70.0]
['Elliot Reid', 'M.D.', 'Sarah Chalke', 29.0]
['J.D.', 'M.D.', 'Zach Braff', 31.0]
['Janitor', 'unknown', 'Neil Flynn', 40.0]


In [None]:
# to extract the header we used the .next() method to perform the first step of the iteration
print (header)

['character', 'credentials', 'actor', 'age']


CSV files can be written by using the `writer()` function to create an object for writing, and using `writerow()` to print each single row. For instance, to write the following list of lists:

In [None]:
list2write = [['Perry Cox', 'M.D.', 'John C. McGinley', 45.0],
              ['Carla Espinosa-Turk', 'RN', 'Judy Reyes', 36.0],
              ['Christopher Turk', 'M.D.', 'Donald Faison', 31.0],
              ['Bob Kelso', 'M.D.', 'Ken Jenkins', 70.0],
              ['Elliot Reid', 'M.D.', 'Sarah Chalke', 29.0],
              ['J.D.', 'M.D.', 'Zach Braff', 31.0],
              ['Janitor', 'unknown', 'Neil Flynn', 40.0]]

In [None]:
with open("data/example_output.csv", "w") as outfile:
    scrubs_writer = csv.writer(outfile, quoting = csv.QUOTE_ALL)
    for row in list2write:
        scrubs_writer.writerow(row)    

46

48

50

41

44

35

41

##### Counting

A `Counter` is a container that records how many times a given object is added. As implemented in the `collections` module, this data type is as a subclass of `dict` where objects are dictionary keys and counts are values. 

In [None]:
import collections

Counters can be initialized in four different ways (note that the result is identical):

- from a sequence: 

In [None]:
collections.Counter(['t', 'e', 'x', 't', 'm', 'i', 'n', 'i', 'n', 'g'])

Counter({'t': 2, 'e': 1, 'x': 1, 'm': 1, 'i': 2, 'n': 2, 'g': 1})

- from a dictionary containing values and counts: 

In [None]:
collections.Counter({'e': 1, 'g': 1, 'i': 2, 'm': 1, 'n': 2, 't': 2, 'x': 1})

Counter({'e': 1, 'g': 1, 'i': 2, 'm': 1, 'n': 2, 't': 2, 'x': 1})

- using keyword arguments mapping string names to counts: 

In [None]:
collections.Counter(e = 1, g = 1, i = 2, m = 1, n = 2, t = 2, x = 1)

Counter({'e': 1, 'g': 1, 'i': 2, 'm': 1, 'n': 2, 't': 2, 'x': 1})

- by creating and empty counter and populating it via the `update()` method:

In [None]:
tm_counter = collections.Counter()
tm_counter.update("textmining")
print(tm_counter)

Counter({'t': 2, 'i': 2, 'n': 2, 'e': 1, 'x': 1, 'm': 1, 'g': 1})


Counters works as dictionaries, except that they return a **zero count for missing items** (rather than an error):

In [None]:
tm_counter["y"]

0

Counters support the following three methods:

- The `elements()` method returns an iterator over keys repeating each keys as many times as its count:

In [None]:
list(tm_counter.elements())

['t', 't', 'e', 'x', 'm', 'i', 'i', 'n', 'n', 'g']

- The `most_common(n)` method returns a sorted list of the most common *n* (key, count) pairs (when *n* is omitted, all the keys are returned)

In [None]:
tm_counter.most_common(3)

[('t', 2), ('i', 2), ('n', 2)]

In [None]:
tm_counter.most_common()

[('t', 2), ('i', 2), ('n', 2), ('e', 1), ('x', 1), ('m', 1), ('g', 1)]

- The `subtract()` method subtract elements from an iterable or from another mapping:

In [None]:
# from an iterable
tm_counter.subtract("text")
print (tm_counter)

Counter({'i': 2, 'n': 2, 'm': 1, 'g': 1, 't': 0, 'e': 0, 'x': 0})


In [None]:
# from another mapping
tm_counter.subtract(collections.Counter(e = 1, t = 2, x = 1))
print (tm_counter)

Counter({'i': 2, 'n': 2, 'm': 1, 'g': 1, 'e': -1, 'x': -1, 't': -2})


> #### Further Reading:
>
> Other relevant data structures in the `collections` module include `defaultdict` and `OrderedDict`.
>
> The best way to become familiar with the Python Standard Library is to read Doug Hellmann's [Python Module of the Week](https://pymotw.com/3/) series.

---

### Exercise 1.

Browse functions offered by the  the `random` module as described in the [official documentation](https://docs.python.org/3/library/random.html) to look for a way to organize the following students in 2 mutually exclusive groups of size $2 \leq s  \leq3$.

In [None]:
students = ["Katrien", "Daniel", "Bob", "York", "James"]

In [None]:
# your code here

### Exercise 2.

The following code doesn't work. Find the bug and fix it:

```python
from random import choice

outcomes = {'heads':0, 'tails':0}
outcomes_keys = list(outcomes.keys())


for i in range(10000):
    outcomes[random.choices(outcomes_keys)] += 1

print('Heads:', outcomes['heads'])
print('Tails:', outcomes['tails'])
```

In [None]:
# your code here

### Exercise 3.

Read the file data/adams-hhgttg.txt and:


- in a separate module, write a function that counts how many times a given word -- independently of its case -- occurs in a text


- import this function and count the number of words in `adams-hhgttg.txt` (hint: use the set operation to get the words you want to apply the function on. You may filter this list somewhat to expedite calculations).


- save these frequencies in a csv file called `data/hhgttg-frqs.csv`

In [None]:
# your code here

---