# What's New for TensorFlow 2.9

https://blog.tensorflow.org/2022/05/whats-new-in-tensorflow-29.html

TensorFlow 2.9 has been released! Highlights include performance improvements with oneDNN, and the release of DTensor, a new API for model distribution that can be used to seamlessly move from data parallelism to model parallelism

Improvements to the core library, including Eigen and tf.function unification, deterministic behavior, and new support for Windows' WSL2. 

The oneDNN improvements are applicable to all Linux x86 packages and for CPUs with neural-network-focused hardware features found on 2nd Gen Intel Xeon Scalable processors and newer CPUs. Intel calls this performance optimization “software AI acceleration” and says it can make a measurable impact in certain cases. 

## Improved CPU performance: oneDNN by default

The oneDNN performance library with TensorFlow for Intel CPUs. Since TensorFlow 2.5, TensorFlow has had experimental support for oneDNN, which could provide up to a 4x performance improvement. In TensorFlow 2.9, they are turning on oneDNN optimizations by default on Linux x86 packages and for CPUs with neural-network-focused hardware features such as AVX512_VNNI, AVX512_BF16, AMX, and others, which are found on Intel Cascade Lake and newer CPUs.

Users running TensorFlow with oneDNN optimizations enabled might observe slightly different numerical results from when the optimizations are off. This is because floating-point round-off approaches and order differ, and can create slight errors. If this causes issues for you, turn the optimizations off by setting TF_ENABLE_ONEDNN_OPTS=0 before running your TensorFlow programs. To enable or re-enable them, set TF_ENABLE_ONEDNN_OPTS=1 before running your TensorFlow program. To verify that the optimizations are on, look for a message beginning with "oneDNN custom operations are on" in your program log. 


To enable ONEDNN for Intel:
`export TF_ENABLE_ONEDNN_OPTS=1`

to disable ONEDNN for Intel:

`export TF_ENABLE_ONEDNN_OPTS=0`




### Example Implementation

Watch the output displayed below.  Restart the session (Python Kernel), change the environment variables and observe the results.

In [1]:
import os

SEED_INIT=7

os.environ["TF_ENABLE_ONEDNN_OPTS"]= "1"       #turn off OneDNN Intel Optimization, already on by default as of 2.9
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"      #turn of utilization of any GPU's
os.environ['TF_XLA_FLAGS']         = "--tf_xla_enable_xla_devices"
                                               #XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.
os.environ["XLA_FLAGS"]            = "--xla_gpu_cuda_data_dir=/usr/local/cuda-11.8/lib64/libcudart.so.11.0"

    
import tensorflow as tf
tf.random.set_seed(SEED_INIT)

print("Version ",tf.__version__)

tf.config.set_soft_device_placement(False)      #True to see verbose details
tf.debugging.set_log_device_placement(False)    #True to see verbose details
gpus = tf.config.experimental.list_physical_devices('GPU')
print("Num Physical GPU's Available: {} ".format(len(tf.config.experimental.list_physical_devices('GPU'))))
print("Num Logical  GPU's Available: {} ".format(len(tf.config.experimental.list_logical_devices('GPU'))))
print("Num CPU's Available: {} ".format(len(tf.config.experimental.list_physical_devices('CPU'))))

with tf.device(f"/job:localhost/replica:0/task:0/device:CPU:0"):
    #sample Model
    model = tf.keras.Sequential([
          tf.keras.layers.InputLayer(input_shape=(28, 28)),
          tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
          tf.keras.layers.Conv2D(32, 3, activation='relu'),
          tf.keras.layers.Flatten(),
          tf.keras.layers.Dense(128, activation='relu'),
          tf.keras.layers.Dense(10)
      ])

print(model.summary())

2022-12-16 17:50:56.567951: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


Version  2.4.1
Num Physical GPU's Available: 0 
Num Logical  GPU's Available: 0 
Num CPU's Available: 1 
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
reshape (Reshape)            (None, 28, 28, 1)         0         
_________________________________________________________________
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
flatten (Flatten)            (None, 21632)             0         
_________________________________________________________________
dense (Dense)                (None, 128)               2769024   
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 2,770,634
Trainable params: 2,770,634
Non-trainable params: 0
_________________________________________________________________
None


2022-12-16 17:50:57.852238: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-12-16 17:50:57.852291: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-12-16 17:50:57.852310: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (codespaces-0a9740): /proc/driver/nvidia/version does not exist
2022-12-16 17:50:57.853091: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-16 17:50:57.859780: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2793435000 Hz
2022-12-16 17:50:57.860518: I tensorflow/c

## Model parallelism with DTensor

DTensor is a new TensorFlow API for distributed model processing that allows models to seamlessly move from data parallelism to single program multiple data (SPMD) based model parallelism, including spatial partitioning. This means you have tools to easily train models where the model weights or inputs are so large they don’t fit on a single device. (If you are familiar with Mesh TensorFlow in TF1, DTensor serves a similar purpose.)

DTensor is designed with the following principles at its core:

A device-agnostic API: This allows the same model code to be used on CPU, GPU, or TPU, including models partitioned across device types.
Multi-client execution: Removes the coordinator and leaves each task to drive its locally attached devices, allowing scaling a model with no impact to startup time.
A global perspective vs. per-replica: Traditionally with TensorFlow, distributed model code is written around replicas, but with DTensor, model code is written from the global perspective and per replica code is generated and run by the DTensor runtime. Among other things, this means no uncertainty about whether batch normalization is happening at the global level or the per replica level.

### Example Implementation

https://www.tensorflow.org/guide/dtensor_overview

## Support for WSL2

The Windows Subsystem for Linux lets developers run a Linux environment directly on Windows, without the overhead of a traditional virtual machine or dual boot setup. TensorFlow now supports WSL2 out of the box, including GPU acceleration. Please see the documentation for more details about the requirements and how to install WSL2 on Windows.

## Optimized Training with Keras

In TensorFlow 2.9, a new experimental version of the Keras Optimizer API was released, tf.keras.optimizers.experimental. The API provides a more unified and expanded catalog of built-in optimizers which can be more easily customized and extended.

In a future release, tf.keras.optimizers.experimental.Optimizer (and subclasses) will replace tf.keras.optimizers.Optimizer (and subclasses), which means that workflows using the legacy Keras optimizer will automatically switch to the new optimizer. The current (legacy) tf.keras.optimizers.* API will still be accessible via tf.keras.optimizers.legacy.*, such as tf.keras.optimizers.legacy.Adam.

Here are some highlights of the new optimizer class:

Incrementally faster training for some models.
Easier to write customized optimizers.
Built-in support for moving average of model weights ("Polyak averaging").

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/experimental

References:

    + https://blog.tensorflow.org/2022/05/whats-new-in-tensorflow-29.html
    + https://www.tensorflow.org/xla
    + https://github.com/oneapi-src/oneDNN
    + https://venturebeat.com/ai/tensorflow-now-defaults-to-intel-onednn-ai-optimizations/
    + https://www.intel.com/content/www/us/en/newsroom/news/intel-onednn-speeds-ai-optimizations-in-tensorflow.html#gs.kqp0t8