# Python pitfalls for budding machine learning engineers

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2017 Florian Leitner. All rights reserved.

## The `defaultdict`

Quick review of Python's `deafultdict`:

In [1]:
from collections import defaultdict

# store integers mapped to some key
demo = defaultdict(int)

# NB: we did not define a value for 'defined'
demo['defined'] = demo['defined'] + 1
# But: it magically initialized itself!
demo['missing'], demo['defined']

(0, 1)

All practicality aside, this can become a trap, too:

In [2]:
'missing' in demo

True

In [3]:
'really missing' in demo

False

Because, by fetching the value for `'missing'` (via `demo['missing']`), we actually added it to the mapping!

## Banker's rounding & Notebook display precision

In Python, explicit rounding doesn't behave like type casting and Integer divisions don't behave the same: Explicit rounding uses "Banker's rounding".

Casting a `float` to an `int` (or using Integer divisions) chops off the decimals: 

In [4]:
int(1.01), int(1.5), int(1.99), int(2.5), int(3.5)

(1, 1, 1, 2, 3)

But `round(float)` applies **Banker's rounding** (round down on even numbers, up on uneven ones):

In [5]:
round(1.01), round(1.5), round(1.99), round(2.5), round(3.5)

(1, 2, 2, 2, 4)

Therefore, if you therefore use **`round(float)`**, you get:

In [6]:
[round(i/2) for i in range(1,5)]

[0, 1, 2, 2]

But if you use **Integer division**, you get a *different result*:

In [7]:
[i//2 for i in range(1,5)]

[0, 1, 1, 2]

In case you are surprised by this behaviour; It is the standard [IEEE 754](https://en.wikipedia.org/wiki/Floating_point#Rounding_modes) behaviour for floating points that should be followed by all programming languages; It's languages that *don't* exhibit this bevahiour that are "wrong".

With that in mind, Jupyter provides formating of floating-point numbers; e.g., scientific notation:

In [8]:
%precision %e

'%e'

In [9]:
from math import pi
pi**10

9.364805e+04

Round to the last n digits:

In [10]:
%precision 3

'%.3f'

Be aware that the notebooks uses **Banker's rounding**:

In [11]:
1.0035, 2.0035

(1.004, 2.003)

## `hash() != id()`

The Python `hash` function generates an *almost* unique, "identifying" integer for any Python object.

Calculate the (64 bit) integer hash of a string:

In [12]:
hash('str')

8360721101177055972

Note that the `id` method Python uses internally to identify objects is not the same as `hash` and uses a smaller numerical range (because having a few billion objects in your Python process is rather unusal, so for performance reasons, `id` only hashes to 32 bits):

In [13]:
id('str')

4413646752

## Probabilities and underflows

Imagine we have the following three vectors of 100 probability scores:

In [14]:
probabilities = [
    [1e-3]*100, [5e-4]*100, [1e-5]*100
]

But when hashing features, 32 bits might be too small, so be aware that `hash` gives you a larger range ("more buckets").

To provide a possible, real-life setting, from where these vectors came from: E.g., we are trying to label a document with one of three different labels, and will assign it the label that has the highest joint probability over all features, e.g., a multinomial Naive Bayes classifier. So the three vectors above are the probabilities for each token and label pair, and we'll just assume we found the exact same probability for each token given the desired label (as that does not matter for our purposes).

But due to that simplification, you can see immediately that the document should be assigned the first label, as that will have the highest joint probability ($0.001^{100} = \prod{p_i}$)
. 

So lets try calculate those joint probabilities for each of the three labels:

In [15]:
# show us "everything, please!"
%precision %e

'%e'

In [16]:
pow(1e-3, 100)

1.000000e-300

In [17]:
pow(5e-4, 100)

0.000000e+00

In [18]:
pow(1e-5, 100)

0.000000e+00

Whoops! We quickly reached the limit of our processing capabilites; Two of the three cases give us a wrong answer (`P=0`).
This problem is known as a (precision) **undeflow**.

Now, this is typically remedied with a simple trick:
Work in log-transformed space if working with probabilities.

In [19]:
from math import log, exp

In [20]:
A = sum([log(1e-3)]*100); A

-6.907755e+02

In [21]:
B = sum([log(5e-4)]*100); B

-7.600902e+02

In [22]:
C = sum([log(1e-5)]*100); C

-1.151293e+03

Note, though, that we cannot move back into normal space, as we'd still get a zero result:

In [23]:
exp(C)

0.000000e+00

So how do we calculate the actual probability for each of the three labels after this log transformation? I.e., how to get correct scores for each of the three labels, that, when summed up, equal 1? After all, we cannot transform the logs back directly, unless we end up with that pesky underflow issue again, as we just have seen.

In [24]:
logP = [A, B, C]
#
# Here is the critical trick: subtract the max. log(P_i) value from all values:
#
normP = [i - max(logP) for i in logP]
#
# Result:
#
normP

[0.000000e+00, -6.931472e+01, -4.605170e+02]

This means, the normalized `log(P_i)` for the label[s] with the max. probablity now is scored with **zero**, and all other results are adjusted accordingly to this norm.

Now we can safely transform the normalized scores back into correct probabilities. First, we reverse the log transformation of the scores (i.e., take the exponent):

In [25]:
expP = list(map(exp, normP)); expP

[1.000000e+00, 7.888609e-31, 1.000000e-200]

(Remark: Even if a probabilty now would end up being zero, you can under normal circumstances happily ignore that last underflow, as that label's probablity is so infinitesimaly small that for all practical concerns it can indeed be considered zero.)

Finally we can calculate the true label probabilities from the proportions we just found:

In [26]:
P = [i / sum(expP) for i in expP]; P

[1.000000e+00, 7.888609e-31, 1.000000e-200]

Note that there is still a tiny margin of floating-point error in this result, becasue the above result is strictly summed up, does not sum to one, but to slightly more than one.

But that's not even testable (It might become noticable if the three label probabilities had been a closer call, although - so usually, you will need to check floating point numbers according to some precision you wish to ensure to make sure that your probabilities are all correct):

In [27]:
sum(P) == 1.0

True

Overall, unless you care about being correct beyond this, the outcome you can handle this way is typically acceptable in most practical cases. But to conclude: Working with floating point is probably still harder than you think, even now... FWIW, banks therefore typically prefer to use integers only.