## Writing type-rich code

Python comes with a wealth of built-in types ranging from essential primitives like *numbers* and *strings* to collections such as *dictionaries*, *tuples*, and *lists*. By combining the types that come with the language you can write just about anything, and indeed that's the way many scripts are written. 

However, a general purpose programming language like Python can only go so far in providing structures that make it easy and convenient to work with your specific problem. 

If you are interested in dependency parsing, you often need to store and manipulate the dependencies between the tokens of a sentence, as well as properties of these tokens. 

<img src="depgraph0.png" width="450px">

You can do this with a set of parallel arrays:

#### Parallel array representation

```
id = "s01"
forms = ["John", "loves", ...]
heads = [1, 0, ...]
tags = ["NOUN", "VERB", ...]
```

But if you have more than one sentence, this quickly becomes somewhat unweildy:

#### Instances as variables

```
id_s01 = "s01"
forms_s01 = ["John", "loves", ...]
heads_s01 = [1, 0, ...]
tags_s01 = ["NOUN", "VERB", ...]

id_s02 = "s02"
forms_s02 = ["Jesus", "wept"]
heads_s02 = [1, 0]
tags_s02 = ["NOUN", "VERB"]
```


To keep things managable we could represent each sentence as a *dictionary* and put all of them in a *list*:



#### Instances as elements of a list

```
[{'id': "s01",
  'forms': ["John", "loves", ...],
  'heads': [1, 0, ...],
  'tags': ["NOUN", "VERB", ...]
  },
  {'id': "s02",
  'forms': ["Jesus", "wept"],
  'heads': [1, 0],
  'tags': ["NOUN", "VERB"
  }]
```


Note that we no longer need the `_s01` and `_s02` prefixes and can access attributes of different sentences by the same name (e.g. `forms` or `heads`). Because each sentence has its own dictionary, the names do not collide. Another way to say this is that the sentences have private *namespaces*. 


The idea of a namespace is pervasive in type-rich programming. In fact it is common to have user-defined types that do nothing but contain data. Here is a simple example

In [2]:
class Sentence():

    def __init__(self, sid, forms, heads, tags): 
        self.sid = sid
        self.forms = forms
        self.heads = heads
        self.tags = tags

We make one *instance* of the class in the cell below. This calls the special function `__init__`, which are passed the parameters and performs any necessary initialization. 



#### Terminology 

- Instances of classes are called *objects* or *instances*.
- Functions defined in the context of a class are called *methods*. 
- Variables within classes are sometimes referred to as *member variables*.
- The funny looking `__init__` method is called a *constructor*

In [3]:
short = Sentence(sid="s02", forms=["Jesus", "wept"], heads=[1, 0], tags=["NOUN", "VERB"])
short

<__main__.Sentence at 0x106ca37b8>

The member variables, which on the *inside* (within the class definition) are referenced using `self` (e.g. `self.forms`), can be accessed on the *outside* using the name of the instance. 

In [4]:
print(short.forms)
print(short.heads)

['Jesus', 'wept']
[1, 0]


### Encapsulation and data consistency

So far custom types at least one advantage over dictionaries: Python checks your code for (some types of) spelling mistakes. In languages like C and Java custom types are also much faster than dictionaries. For instance, Python will throw an error if you mispell `forms` as `forns` when you access a member variable. A custom type limits your freedom the data by *encapsulating*  it. This is a good thing, because it prevents many errors. 

Encapsulating enables you to go further in ensuring that your data is consistent. One useful pattern is to check for consistency in the initializer and fail to construct the object if the data is invalid. This obsession with errors may seem strange, since we want our programs to run without errors. But errors will always find a way to surface, and it is much preferable to catch an error as early and as close to the point where it is introduced as possible, since that makes the cost of finding it far smaller.

Furthermore, if you only manipulate the data through methods provided provided by the class, you can even maintain the consistency throughout the lifetime of an object through a *class invariant*. 

In [5]:
class ConsistentSentence(object): 

    def __init__(self, sid, forms, heads, tags): 
        assert len(forms) == len(heads), "Forms and heads should be same length"
        assert len(heads) == len(tags), "Heads and tags should be same length"
        
        self.sid = sid
        self.forms = forms
        self.heads = heads
        self.tags = tags


In [6]:
short = ConsistentSentence(sid="s02", forms=["Jesus", "wept"], heads=[1, 0, 0], tags=["NOUN", "VERB"])
short

AssertionError: Forms and heads should be same length

### Methods 

In [7]:
class SentenceWithMethods(object): 

    def __init__(self, sid, forms, heads, tags): 
        self.sid = sid
        self.forms = forms
        self.heads = heads
        self.tags = tags
        
    def number_of_chars(self):
        n_chars = sum(len(form) for form in self.forms) + len(self.forms) - 1
        return max(n_chars, 0)
    
    def return_myself(self):
        return self

In [8]:
short = SentenceWithMethods(sid="s02", forms=["Jesus", "wept"], heads=[1, 0], tags=["NOUN", "VERB"])
short.number_of_chars()
print(short.return_myself())
print(short)

<__main__.SentenceWithMethods object at 0x10734c7b8>
<__main__.SentenceWithMethods object at 0x10734c7b8>


## Modeling choices

When do you need to define custom types and what should they be?

- In machine learning you have *classifiers*, *preprocessing steps*, and *datasets*

- Information retrieval deals *documents* and have different *ranking strategies*.

- Physics simulations model *particles*. 

- A chess game has a *board* and a *game strategy*. 

## Duck-typing

![]()



The type system in Python is very lax in the sense that it rarely if ever insists on a specific type. What matters is that the type can do what is is asked to do. E.g. implements a function with a particular name.

In other words, **if it walks like a duck and quacks like a duck, it is a duck.**


<img src="duck_and_owl.jpg" width="300px"/>
<img src="lokkeand.jpg" width="300px"/>

For Python it doesn't matter whether the type is called `NaiveStrategy` or `MeanStrategy` as long it implements, say, the `.get_next_move()` function.

Python uses this a lot. For instance, the `len` function can return the length of objects of many different types, including types that you define. 

In [26]:
import numpy as np
x = [1, 2, 3, 4]
y = {4, 5, 6}
z = np.array(x)

In [27]:
[len(x), len(y), len(z)]

[4, 3, 4]