## Initialization of programming env

In [None]:
#Color printing
from termcolor import colored

#General data operations library
import math
import string
from datetime import datetime
import numpy as np

#The tensorflow library
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"
import tensorflow  as tf

#Plotting libraries
import matplotlib as mpl
import matplotlib.pyplot as plt

#Increase plots font size
params = {'legend.fontsize': 'xx-large',
          'figure.figsize': (10, 7),
         'axes.labelsize': 'xx-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'xx-large',
         'ytick.labelsize':'xx-large'}
plt.rcParams.update(params) 

#append path with python modules
import importlib
import sys
sys.path.append("../modules")

#Private functions
import plotting_functions as plf
importlib.reload(plf);

#Hide GPU
#tf.config.set_visible_devices([], 'GPU')

<br/><br/>
<br/><br/>

<h1 align="center">
 Machine learning II
</h1>

<br/><br/>
<br/><br/>
<br/><br/>
<br/><br/>

<h1 align="right">
Artur Kalinowski <br>
University of Warsaw <br>
Faculty of Physics <br>    
</h1>

Data in the form of a pair of `x,y` matrices is inefficient when there is a lot of data or it is spread over many files.
TF provides a dedicated class to handle the input stream:
```Python
tf.data.Dataset(variant_tensor)
```

The `tf.Dataset` class allows advanced operations on data. The implementation of these operations uses parallel data processing to increase the throughput of the input stream: `number of examples per second`.



The `tf.Dataset` object can be created:

* from matrix:

```Python
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
```

* from generator type function

```Python
dataset = tf.data.Dataset.from_generator(...)
```

* from CSV file

```Python
dataset = tf.data.TextLineDataset(...)
```

* from file contatining data in dedicated `TFRecord` format

```Python
dataset = tf.data.TFRecordDataset(["file1.tfrecords", "file2.tfrecords"])
```



Constructing `tf.data.Dataset` from NumPy matrix

In [None]:
nExamples = 5
nFeatures = 3
epsilon = 0.01
x = tf.random.uniform((nExamples, nFeatures), minval=-1, maxval=1, dtype=tf.float32, name="features")
y = tf.math.reduce_sum(x**2, axis=1)
y = tf.reshape(y, (-1, 1))

print(colored("Features shape:", "blue"), x.shape)
print(colored("Labels shape:", "blue"), y.shape)

dataset = tf.data.Dataset.from_tensor_slices((x, y))
print(dataset)
print(colored("Dataset lenght:", "blue"), len(dataset))

`tf.Dataset` behaves like a collection - you can iterate through it, easily adjusting the number of elements to be analysed and the starting point:

In [None]:
print(colored("Iteration over the full dataset", "blue"))
for item in dataset:
    print(item)

print(colored("Iteration over n elements", "blue"))
n = 3
for item in dataset.take(n):
    print(item)

print(colored("Iteration over n elements starting from m", "blue"))
n = 3
m = 2
for item in dataset.skip(m).take(n):
    print(item)    

Constructing `tf.Dataset` from generating function.

In this case, in addition to the generating function, we need to provide information about the shape and type of data generated by the function:
```Python
dataset = tf.data.Dataset.from_generator(
         generator,                                      # generating function
         output_signature=(                              # shape and type
             (tf.TensorSpec(shape=(3), dtype=tf.float32),# of the data generated
             tf.TensorSpec(shape=(1), dtype=tf.int32)))  # by function
    )
```

In [None]:
#Generator function definition
nFeatures = 3

def points3DGenerator():
    while True:
        x = tf.random.uniform(shape=(nFeatures,), minval=-1, maxval=1, dtype=tf.float32, name="features")
        y = tf.math.reduce_sum(x**2, axis=0)
        y = tf.reshape(y, (1))
        yield x,y

#Dataset from generator
dataset = tf.data.Dataset.from_generator(points3DGenerator,
         output_signature=(
             (tf.TensorSpec(shape=(nFeatures,), dtype=tf.float32, name="features"),
             tf.TensorSpec(shape=(1), dtype=tf.float32, name="labels")))
    )


In [None]:
print(colored("Iteration over n elements starting from m", "blue"))
n = 3
m = 2
for item in dataset.skip(m).take(n):
    print(item) 

Various transformation operations can be performed on the `tf.Dataset`:

```Python
dataset = dataset.repeat(n) - # repeats data n times
                              # When the argument is not given
                              # data is repeated ad infinitum
```

**Note:** it is not necessary to use `repeat` to get multiple epochs during training. The function `model.fit(...)` 
itself manages multiple passes through the dataset

```Python
dataset = dataset.batch(n)   - # grouping set into batches
                               # During the training batches are
                               # automatically recognized, and their 
                               # sizes do not need to be supplied
```

```Python
dataset = dataset.skip(m)    - # skips first m examples
                              
```

```Python
dataset = dataset.take(n)    - # restricts the set to first n examples
                              
```

```Python
dataset = dataset.skip(m).take(n)    - # skips first m examples, then
                                       # takes in the next n examples
                              
```

In [None]:
dataset_batched = dataset.batch(2)

#Access a single example (batch in this case)
it = iter(dataset_batched)
print(colored("Features shape:", "blue"), next(it)[0].numpy().shape)
print(colored("Labels shape:", "blue"), next(it)[1].numpy().shape)

print(colored("Iteration over n elements starting from m", "blue"))
m = 5
n = 1
for item in dataset_batched.skip(m).take(n):
    print(colored("\tLabels:","blue"),item[0].numpy())
    print(colored("\tFeatures:","blue"),item[1].numpy())

The dataset can also be subjected to a general transformation that changes the content of individual rows:
```Python
dataset_transformed = dataset.map(func) # func is a function that takes a given row and outputs the new one
```

In [None]:
def func(features, label):
    return features**2, label

dataset_transformed = dataset.map(func)

print(colored("Iteration over original dataset", "blue"))
for item in dataset.skip(m).take(n):
    print(colored("\tLabels:","blue"),item[0].numpy())
    print(colored("\tFeatures:","blue"),item[1].numpy())

print(colored("Iteration over transformed dataset", "blue"))
for item in dataset_transformed.skip(m).take(n):
    print(colored("\tLabels:","blue"),item[0].numpy())
    print(colored("\tFeatures:","blue"),item[1].numpy())

**Please:**

* copy the `discGenerator` function from the previous classes.
* create a `tf.Dataset` set of wheel images using the generator directly. Please assume a resolution of 256 $\times$ 256

In [None]:
def discGenerator(res=256):

    from skimage.draw import disk
    while True:
        center = tf.random.uniform([2], minval=0, maxval = res, dtype=tf.int32, name='center')
        radius = tf.random.uniform([1], minval=5, maxval = res//2, dtype=tf.int32, name='radius')        
        shape = (res, res)
        image = np.full(shape, 0)
        yy, xx = disk(center=center.numpy(), radius=radius.numpy()[0], shape=shape)
        image[xx,yy] = 1
        features = tf.concat(values=(center, radius), axis=0 )
        label = tf.constant(image, dtype=tf.int32, name='image')
        label = tf.reshape(label, (res, res, 1))
        yield  features, label

#BEGIN_SOLUTION
res = 256
dataset = tf.data.Dataset.from_generator(discGenerator,
         output_signature=(
             (tf.TensorSpec(shape=(3), dtype=tf.int32),
             tf.TensorSpec(shape=(res,res,1), dtype=tf.int32)))
    )
#END_SOLUTION

item = next(iter(dataset))
print(colored("Features shape:", "blue"), item[0].shape)
print(colored("Labels shape:", "blue"), item[1].shape)

**Please:**

* write a function `reading_benchmark(dataset)` that takes a dataset, iterates over the entire dataset, calculates and prints its execution time to the screen
* in iterating over the elements of the dataset, please insert a short stop:
```Python
time.sleep(1E-10)
```
* please call the function on a set that has $10^{4}$ elements and record the runtime after the set

In [None]:
import time

def reading_benchmark(dataset):
#BEGIN_SOLUTION 
    start_time = time.perf_counter()
    for sample in dataset:
        # Performing a training step
        time.sleep(1E-10)
    tf.print("Execution time: {:3.2f} s".format(time.perf_counter() - start_time))


reading_benchmark(dataset.take(int(1E4)))    
#END_SOLUTION    
pass

Generacja danych za każdym razem kiedy jest wywoływana iteracja po zbiorze jest kosztowana - 
lepiej wygenerować dane raz i je zapisać w pamięci podręcznej. To samo dotyczy zbiorów czytanych z dysku i 
poddawanych kosztownym operacjom przekształcania. Zapisywanie zbioru w tymczasowym pliku można uzyskać przez metodę `cache`:
```Python
dataset_cached = dataset.cache()
```

**Please:**

* call the `reading_benchmark` function twice on the `dataset_cached` collection.
* is there any difference in execution time?
* if so, where does it come from?

In [None]:
#BEGIN_SOLUTION
dataset_cached = dataset.take(int(1E4)).cache()

print(colored("First pass", "blue"))
reading_benchmark(dataset_cached)
print(colored("Second pass", "blue"))
reading_benchmark(dataset_cached)
#END_SOLUTION
pass

Each row of data in the form of `tf.Dataset` should contain features and labels to be passed to the function that trains the model:
```Python
model.fit(dataset, ...)  # Provide only tf.Dataset. 
                         # model.fit(...) method decomposes each row into features and labels by itself
```

If the `tf.Dataset` comes from a generator, it is best to provide a new collection as validation data. 
In this situation, you also need to specify the number of examples for both sets:

```Python
model.fit(dataset.batch(batchSize).take(nStepsPerEpoch),
          epochs=nEpochs, 
          validation_data=dataset.batch(batchSize).map(mapFunc).take(100))
```

**Please:**

* train a **minimal** model that calculates the square of the distance of a point from the centre of the coordinate system.
* as learning, validation and test sets, please use `tf.Dataset` objects filled with `points3DGenerator(...)`.
* use a function that transforms the label to the required form using `tf.Dataset.map(...)`.
* adopt the following training parameters:
```Python
nEpochs = 5
nStepsPerEpoch = 4096
batchSize = 32
initial_learning_rate = 5E-2
```
* plot a training history
* print out the model weights in a way that allows interpretation
* calculate the fraction of examples from the test set for which the model result differs from the label by no more than 1%.

**Hint:**
label values can be extracted in the following (suboptimal) manner:
```Python
y = np.array([y.numpy() for x,y in dataset_test.unbatch()])
```

**Note:** training should take about 3'

Is the result on the test set as expected?

1) Preparation of data from the generator.

In [None]:
##BEGIN_SOLUTION
dataset = tf.data.Dataset.from_generator(points3DGenerator,
         output_signature=(
             (tf.TensorSpec(shape=(nFeatures,), dtype=tf.float32, name="features"),
             tf.TensorSpec(shape=(1), dtype=tf.float32, name="labels")))
    )
##END_SOLUTION
item = next(iter(dataset))
print(colored("Features shape:", "blue"), item[0].shape)
print(colored("Labels shape:", "blue"), item[1].shape)

2) data preparation

* subdivision into packets
* modification of data content using `dataset.map(...)`.
* preparing the appropriate number of examples
* caching


In [None]:
#BEGIN_SOLUTION
nStepsPerEpoch = 4096
batchSize = 64
dataset_train = dataset.batch(batchSize).map(func).take(nStepsPerEpoch).cache()
#END_SOLUTION

item = next(iter(dataset_train))
print(colored("Features shape:", "blue"), item[0].shape)
print(colored("Labels shape:", "blue"), item[1].shape)

3) model definition

In [None]:
#BEGIN_SOLUTION
def getModel(inputShape, nNeurons, hiddenActivation="relu", outputActivation="linear", nOutputNeurons=1):
   
    inputs = tf.keras.Input(shape=inputShape, name="features")
    x = inputs
    
    for iLayer, n in enumerate(nNeurons):
        x = tf.keras.layers.Dense(n, activation=hiddenActivation, 
                                  kernel_initializer='glorot_uniform',
                                  bias_initializer=tf.keras.initializers.RandomUniform(minval=-1, maxval=1),
                                  kernel_regularizer=tf.keras.regularizers.L2(l2=0.001),
                                  name="layer_"+str(iLayer))(x)
                
    outputs = tf.keras.layers.Dense(nOutputNeurons, activation=outputActivation, name = "output")(x)   
    model = tf.keras.Model(inputs=inputs, outputs=outputs, name="DNN")
    return model

model = getModel(inputShape=(nFeatures,), nNeurons = np.array([]), 
                 hiddenActivation="relu", 
                 outputActivation="linear", 
                 nOutputNeurons=1)
#END_SOLUTION
model.summary()

4) training with all the standard elements:
* learning rate schedule
* early stop
* loss function change chart

In [None]:
#BEGIN_SOLUTION
nEpochs = 5

initial_learning_rate = 5E-2
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate,
                decay_steps=nStepsPerEpoch*3,
                decay_rate=0.95,
                staircase=True)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
            loss=tf.keras.losses.MeanAbsolutePercentageError(),
            metrics=[])

early_stop_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, verbose=1, min_delta=1E-3)
callbacks = [early_stop_callback]
    
history = model.fit(dataset_train.skip(batchSize*4),
                    epochs=nEpochs, 
                    validation_data=dataset_train.take(batchSize*4),
                    callbacks=callbacks,
                    verbose=1)

model.evaluate(dataset_train.take(16))
plf.plotTrainHistory(history)

print(colored("Model weights:","blue"))
print(colored("output:","blue"), model.get_layer('output').weights[0].numpy()[:,0])
#END_SOLUTION
pass


5) estimation of model performance on test data

In [None]:
#BEGIN_SOLUTION
dataset_test = dataset.batch(128).map(func).take(16)
y_pred = model.predict(dataset_test)
y = np.array([y.numpy() for x,y in dataset_test.unbatch()])

pull = (y_pred - y)/y
pull = pull.flatten()
threshold = 1E-2

print(colored("Fraction of examples with abs(pull)<0.01:","blue"),"{:3.2f}".format(np.mean(np.abs(pull)<threshold)))
print(colored("Pull standard deviation:","blue"),"{:3.2f}".format(pull.std()))
#END_SOLUTION
pass

**Please:**

* solve a problem with the difference in results on the training and test sets.
* plot a histogram of the relative difference:

$$
{\huge
\mathrm{pull} = \frac{\mathrm{model} - \mathrm{true}}{\mathrm{true}}
}
$$

In [None]:
#BEGIN_SOLUTION
dataset_test = dataset.batch(128).map(func).take(32).cache()
y_pred = model.predict(dataset_test)
y = np.array([y.numpy() for x,y in dataset_test.unbatch()])

pull = (y_pred - y)/y
pull = pull.flatten()
threshold = 1E-2

fig, axis = plt.subplots(1,1, figsize=(5, 5))
axis.hist(pull, bins=100, range=(-0.01, 0.01), color='b', alpha=0.7, label='Pull');
axis.set_xlabel(r'$\frac{model-true}{true}$')
axis.set_ylabel('Counts')
model.evaluate(dataset_test)
print(colored("Fraction of examples with abs(pull)<0.01:","blue"),"{:3.4f}".format(np.mean(np.abs(pull)<threshold)))
print(colored("Pull standard deviation:","blue"),"{:3.4f}".format(pull.std()))
#END_SOLUTION
pass

## TensorFlow datasets

The TensorFlow environment provides a convenient user iterface for accessing public datasets (as do other packages):
[TensorFlow Datasets](https://www.tensorflow.org/datasets).


In [None]:
import tensorflow_datasets as tfds

#Create a dataset builder object
mnist_builder = tfds.builder('mnist')

#Download the dataset as a dictionary of tf.data.Datasets
data_dir = "../data/tensorflow_datasets/"

datasets, ds_info = tfds.load("mnist", 
                              data_dir = data_dir,
                              with_info=True)

#Download the dataset as a tuple of tf.data.Datasets
#datasets, ds_info = tfds.load("mnist", as_supervised=True, with_info=True)

# Load data from disk as tf.data.Datasets
train_dataset, test_dataset = datasets['train'], datasets['test']

# Fetch the first batch of the dataset
item = next(iter(train_dataset.batch(16)))

print(colored("Features shape:", "blue"), item['image'].shape)
print(colored("Labels shape:", "blue"), item['label'].shape)

The `tensorflow_datasets` library provides a useful function to test the performance of loading a dataset:

```Python
tfds.benchmark(train_dataset, # Object that provides the iterator interface.
                batch_size)   # A number used to normalise the number of examples loaded. 
                              # The batch division has to be set on the collection explicite.
```

**Please:**

* run the performance test twice on the MNIST set loaded using the `tensorflow_datasets` module for a batch size of `32`.

In [None]:
#BEGIN_SOLUTION
batchSize = 32
train_dataset_batched = train_dataset.batch(batchSize)

tfds.benchmark(train_dataset_batched, batch_size=batchSize)
tfds.benchmark(train_dataset_batched, batch_size=batchSize)
#END_SOLUTION
pass

The `tfds.show_examples(...)` function allows you to quickly display examples from the given set.

**Note:** function requires a `dataset_info.DatasetInfo` object.

In [None]:
fig = tfds.show_examples(train_dataset, ds_info, rows=2, cols=2)

# Homework

**Please:**

* write a function `load_wksf_dataset(filePath)` loading and preprocessing a collection of text fragments in [Polish](https://drive.google.com/drive/folders/18vDJPEZd2C6_-TualBIhsR5zmbhDA00D?usp=drive_link) from [Wzbogaconego korpusu słownika frekwencyjnego polszczyzny współczesnej](https://clarin-pl.eu/dspace/handle/11321/715)
* the function should perform the following steps:
  * loading all files in the directory given as `filePath` into a `tf.data.Dataset` object. 
  * processing the resulting `tf.data.Dataset` object to remove:
    * information about the source of the citation
    * references in the text
    * fragments of type `[/]`. 

* the function should be placed in the `text_functions.py` file.
* run the cell below
 
**Hint:**
* you can use the functions `tf.strings.regex_full_match(...)` and `tf.strings.regex_replace(...)` to filter lines or replace parts of strings
  

In [None]:
import text_functions as txtfunc
importlib.reload(txtfunc)

filePath = "../data/wksf/Korpus_surowy/"
dataset = txtfunc.load_wksf_dataset(filePath)

for item in dataset.take(5):
    print(colored("Item:","blue"), end=" ")
    print(item.numpy().decode("utf-8"))
