We need to make sure that:
- The parameters change as expected when testing the training algorithm,
- The training behaves differently depending on whether we set uniformly to True or False.

We first set up a toy example for which we know more or less how the training should behave. Then, we look at the real-world datasets used for testing.

In [1]:
# Preamble to run notebook in context of source package.
import sys
sys.path.insert(0, '../')

def print_params(uniformly, initial, final):
    print("\nuniformly is", uniformly)
    
    for i_machine, f_machine in zip(initial.machines, final.machines):    
        if i_machine.states != []:

            if i_machine.I != f_machine.I:
                print("\tMachine is", i_machine)
                print("\tInitial I", i_machine.I)
                print("\tFinal I  ", f_machine.I, '\n')
                
            if i_machine.T != f_machine.T:
                print("\tMachine is", i_machine)
                print("\tT's are not the same (omitted as it's quite large)")

            if i_machine.F != f_machine.F:
                print("\tMachine is", i_machine)
                print("\tInitial F", i_machine.F)
                print("\tFinal F  ", f_machine.F, '\n')

### Toy example

In [2]:
import pandas as pd

types = [
    "integer",
    "string",
    "float",
    "boolean",
    "date-iso-8601",
    "date-eu",
    "date-non-std-subtype",
    "date-non-std",
]

df_trainings, y_trainings = [], []

x = ['aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa']
column = 'string'

df_training = pd.DataFrame(x, dtype='str', columns=[column])
y_training = temp = [key + 1 for key, value in enumerate(types) if value == 'integer']

df_trainings.append(df_training)
y_trainings.append(y_training)

df_training.head()

Unnamed: 0,string
0,aaaa
1,aaaa
2,aaaa
3,aaaa
4,aaaa


In [3]:
# NBVAL_IGNORE_OUTPUT
# to ignore convergence warning

from ptype.Ptype import Ptype
from ptype.Trainer import Trainer

uniformly = True
ptype = Ptype(_types=types)
trainer = Trainer(ptype.machines, df_trainings, y_trainings)
initial, final, training_error = trainer.train(20, uniformly)
    
print_params(uniformly, initial, final)

[427.29730177460397, 2.079441918315006]
[427.29730177460397, 2.079441918315006, 2.0794419182491954]

uniformly is True
	Machine is <ptype.Machine.Strings object at 0x7fd0183beb00>
	T's are not the same (omitted as it's quite large)
	Machine is <ptype.Machine.Strings object at 0x7fd0183beb00>
	Initial F {0: -1e+150, 1: -4.2626798770413155}
	Final F   {0: -1e+150, 1: -90.02593332795702} 





- Error changes,
- Parameter values change.

In [4]:
# NBVAL_IGNORE_OUTPUT
# to ignore convergence warning

from ptype.Ptype import Ptype
from ptype.Trainer import Trainer

uniformly = False
ptype = Ptype(_types=types)
trainer = Trainer(ptype.machines, df_trainings, y_trainings)
initial, final, training_error = trainer.train(20, uniformly)

print_params(uniformly, initial, final)

[2.0794419182491954, 2.079441918183399]
[2.0794419182491954, 2.079441918183399, 2.079441918117631]

uniformly is False
	Machine is <ptype.Machine.Strings object at 0x7fd0183c1860>
	Initial F {0: -1e+150, 1: -90.02593332795702}
	Final F   {0: -1e+150, 1: -90.0262827648204} 



- Error changes,
- Parameter values partially change (T is not changed).

We can confirm the two check points mentioned above. We observe that error and parameter values change with the training. Also, the training behaves differently wrt the uniformly parameter (We may need to understand why T is not changed when uniformly is set to False).

### Real-world Datasets

When we are using accident2016, auto and data_gov_3397_1 for training, we do not observe any change in the error or parameter values regardless of uniformly. 

Let us add data_gov_10151_1 into the mix and see what happens.

In [5]:
from tests.test_ptype import get_inputs

dfs, ys = [], []
for dataset_name in ["accident2016", "auto", "data_gov_10151_1", "data_gov_3397_1"]:
    df, y = get_inputs(dataset_name, annotations_file="../annotations/annotations.json", data_folder="../data/")
    dfs.append(df)
    ys.append(y)

In [6]:
# NBVAL_IGNORE_OUTPUT
# to ignore convergence warning

ptype = Ptype(_types=types)
trainer = Trainer(ptype.machines, dfs, ys)
uniformly = True

initial, final, training_error = trainer.train(20, uniformly=uniformly)
print_params(uniformly, initial, final)

[1846.4669799594435, 1602.0794765644862]
[1846.4669799594435, 1602.0794765644862, 1602.0794765644862]

uniformly is True
	Machine is <ptype.Machine.Integers object at 0x7fd01dde4f28>
	T's are not the same (omitted as it's quite large)
	Machine is <ptype.Machine.Integers object at 0x7fd01dde4f28>
	Initial F {0: -1e+150, 1: -1e+150, 2: -2.3978952727983707}
	Final F   {0: -1e+150, 1: -1e+150, 2: -3.0396526443113636} 

	Machine is <ptype.Machine.Strings object at 0x7fd0397565f8>
	T's are not the same (omitted as it's quite large)
	Machine is <ptype.Machine.Strings object at 0x7fd0397565f8>
	Initial F {0: -1e+150, 1: -4.2626798770413155}
	Final F   {0: -1e+150, 1: -17.469224340580382} 

	Machine is <ptype.Machine.Floats object at 0x7fd039756668>
	T's are not the same (omitted as it's quite large)
	Machine is <ptype.Machine.Floats object at 0x7fd039756668>
	Initial F {0: -1e+150, 1: -1e+150, 2: -1e+150, 3: -2.70805020110221, 4: -2.639057329615259, 5: -2.5649493574615367, 6: -1e+150, 7: -2.70

- Error changes,
- Parameter values change.

This suggests that the datasets were not helping our model to learn better parameter values.

Let us also check whether setting uniformly to False would change the results.

In [7]:
# NBVAL_IGNORE_OUTPUT
# to ignore convergence warning

ptype = Ptype(_types=types)
trainer = Trainer(ptype.machines, dfs, ys)
uniformly = False

initial, final, training_error = trainer.train(20, uniformly=uniformly)
print_params(uniformly, initial, final)

[1602.0794765644862, 1602.0794765644862]
[1602.0794765644862, 1602.0794765644862, 1602.0794765644862]

uniformly is False


Neither the error nor the parameter values change. So, the behaviour is different than before which is a good thing; however, we may need to think further if this is normal. Note that they arrive at the same error. Perhaps with uniformly set to True, we are alread at an optimal point.

Let us now check whether this also occurs when we use data_gov_10151_1 only.

In [8]:
dfs, ys = [], []
for dataset_name in ["data_gov_10151_1"]:
    df, y = get_inputs(dataset_name, annotations_file="../annotations/annotations.json", data_folder="../data/")
    dfs.append(df)
    ys.append(y)

In [9]:
# NBVAL_IGNORE_OUTPUT
# to ignore convergence warning

ptype = Ptype(_types=types)
trainer = Trainer(ptype.machines, dfs, ys)
uniformly = True
    
initial, final, training_error = trainer.train(20, uniformly=uniformly, threshold=1e-20)
print_params(uniformly, initial, final)

[246.46697995944356, 2.0794415416798455]
[246.46697995944356, 2.0794415416798455, 2.0794415416798455]

uniformly is True
	Machine is <ptype.Machine.Integers object at 0x7fd02896cd68>
	T's are not the same (omitted as it's quite large)
	Machine is <ptype.Machine.Integers object at 0x7fd02896cd68>
	Initial F {0: -1e+150, 1: -1e+150, 2: -2.3978952727983707}
	Final F   {0: -1e+150, 1: -1e+150, 2: -0.7085893135075093} 

	Machine is <ptype.Machine.Booleans object at 0x7fd02896cc18>
	Initial I {'q_0': -1.0986122886681098, 'q_1': -1e+150, 'q_2': -1e+150, 'q_3': -1e+150, 'q_4': -1e+150, 'q_5': -1e+150, 'q_6': -1e+150, 'q_7': -1e+150, 'q_8': -1.0986122886681098, 'q_9': -1e+150, 'q_10': -1e+150, 'q_11': -1e+150, 'q_12': -1e+150, 'q_13': -1e+150, 'q_14': -1e+150, 'q_15': -1e+150, 'q_16': -1.0986122886681098, 'q_17': -1e+150, 'q_18': -1e+150}
	Final I   {'q_0': -0.7834380936745986, 'q_1': -1e+150, 'q_2': -1e+150, 'q_3': -1e+150, 'q_4': -1e+150, 'q_5': -1e+150, 'q_6': -1e+150, 'q_7': -1e+150, 'q_8':

- Both the error and parameter values change.

In [10]:
# NBVAL_IGNORE_OUTPUT
# to ignore convergence warning

ptype = Ptype(_types=types)
trainer = Trainer(ptype.machines, dfs, ys)
uniformly = False

initial, final, training_error = trainer.train(20, uniformly=uniformly, threshold=1e-20)
print_params(uniformly, initial, final)

[2.0794415416798455, 2.0794415416798455]
[2.0794415416798455, 2.0794415416798455, 2.0794415416798455]

uniformly is False


We observe the same issue here. This may be okay.