# Deduplication of TF features

The TF code for reading a TF feature deduplicates values, both integers and strings.

If several nodes have the same feature value, they all refer to the same Python object.

The piece of code that does this for node features is in `tf.core.data`, function `_readDataTf()`,
near the end of the function body:

```
    seen = {}
    datax = {}
    for n, ms in data.items():
        if ms not in seen:
            seen[ms] = ms
        datax[n] = seen[ms]
    self.data = datax
```

The crucial thing to understand here is that when Python does

```
datax[n] = seen[ms]
```

it retrieves not just the value of `seen[ms]`, but it retrieves the object `seen[ms]`.
It then assigns that object to `datax[n]`.

If you do not believe this, and you should not believe it, we show it in action.

In [15]:
import pickle

In [3]:
original1 = {1: "Cody Kingham 999", 2: "Dirk Roorda 888", 3: "Cody Kingham 999", 4: "Dirk Roorda 888"}

In [4]:
for (k, v) in original1.items():
    print(f"{k} = ({id(v)}) {v}")

1 = (4477841648) Cody Kingham 999
2 = (4477845296) Dirk Roorda 888
3 = (4477841648) Cody Kingham 999
4 = (4477845296) Dirk Roorda 888


Here the strings are already deduplicated.

In [10]:
originalString = """\
Cody Kingham 999
Dirk Roorda 888
Cody Kingham 999
Dirk Roorda 888"""

In [11]:
original2 = {i + 1: s for (i, s) in enumerate(originalString.split("\n"))}

In [12]:
for (k, v) in original2.items():
    print(f"{k} = ({id(v)}) {v}")

1 = (4484296048) Cody Kingham 999
2 = (4484292784) Dirk Roorda 888
3 = (4484292912) Cody Kingham 999
4 = (4484295984) Dirk Roorda 888


Now we have two copies of each string.

Lets deduplicate them.

In [13]:
deduplicated = {}
seen = {}

for (k, v) in original2.items():
    if v not in seen:
        seen[v] = v
    deduplicated[k] = seen[v]

In [14]:
for (k, v) in deduplicated.items():
    print(f"{k} = ({id(v)}) {v}")

1 = (4484296048) Cody Kingham 999
2 = (4484292784) Dirk Roorda 888
3 = (4484296048) Cody Kingham 999
4 = (4484292784) Dirk Roorda 888


Neatly deduplicated.

In TF this is done for all features, both of type `int` and `str`.

Let's see whether this property survives pickling.

In [21]:
pickled = pickle.dumps(deduplicated, protocol=4)
unpickled = pickle.loads(pickled)

In [22]:
for (k, v) in unpickled.items():
    print(f"{k} = ({id(v)}) {v}")

1 = (4484833200) Cody Kingham 999
2 = (4484833328) Dirk Roorda 888
3 = (4484833200) Cody Kingham 999
4 = (4484833328) Dirk Roorda 888


Not the same object identifiers, but the same deduplication.

Let's check whether unpickling does deduplication anyway.

In [23]:
pickledDup = pickle.dumps(original2, protocol=4)
unpickledDup = pickle.loads(pickledDup)

In [24]:
for (k, v) in unpickledDup.items():
    print(f"{k} = ({id(v)}) {v}")

1 = (4484837936) Cody Kingham 999
2 = (4484838064) Dirk Roorda 888
3 = (4484838576) Cody Kingham 999
4 = (4484836016) Dirk Roorda 888


No, unpickling does not deduplicate objects for you.