# Introduction to Jupyter Lab/Notebooks & Machine Learning

## What are the Big Steps for Machine Learning?

1. Gather your data, sanitize it, and prepare it for processing.
2. Design/build your neural model and train against your data.
  1. Define the model
  2. Compile the model
  3. Fit the model
  4. Evaluate the model
3. Create a predictive output.

## Terms

+ **neural networks** - are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input.  Some core capabilities:
    + Classification (images, categories, etc.)
    + Clustering (grouping like things)
    + Regression Analysis (finding non-linear relationships between the inputs and outputs)
+ **activation function** - In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" or "OFF", depending on input.
+ **optimization algorithm** - Given an algorithm f(x), an optimization algorithm help in either minimizing or maximizing the value of f(x). In the context of deep learning, we use optimization algorithms to train the neural network by optimizing the cost function J.
  + Keras Optimizers - https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
+ **loss/cost function** - It is a function that measures the performance of a Machine Learning model for given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.  Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
  + Keras Loss Functions - https://www.tensorflow.org/api_docs/python/tf/keras/losses
+ **epoch** - Number of iterations through the dataset
+ **batch size** - The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm.  If considering the tuning of batch size read this: https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/
+ **model fitting** - Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained. A model that is well-fitted produces more accurate outcomes. A model that is overfitted matches the data too closely. A model that is underfitted doesn’t match closely enough.

## Diagram of a Sample Node

![title](img/perceptron_node.png)

## Neural Networks are typically layered

A node layer is a row of those neuron-like switches that turn on or off as the input is fed through the net. Each layer’s output is simultaneously the subsequent layer’s input, starting from an initial input layer receiving your data.

![title](img/mlp.png)

The essence of learning in deep learning is nothing more than adjusting a model’s weights in response to the error it produces, until you can’t reduce the error any more.

## **Types of neural network:**

  + Perceptron, the simplest and oldest model of Neuron, as we know it. Takes some inputs, sums them up, applies activation function and passes them to output layer. No magic here.  
  ![title](img/perceptron.png)
  + Feed forward neural networks are also quite old — the approach originates from 50s. The way it works is described in one of my previous articles — “The old school matrix NN”, but generally it follows the following rules: (1) all nodes are fully connected (2) activation flows from input layer to output, without back loops (3)there is one layer between input and output (hidden layer).  
  ![title](img/feedforward.png)
  + Deep Feed Forward (DFF) are Feed Forward neural networks that are layered.  
  ![title](img/deepfeedforward.png)
  + Recurrent Neural Networks (RNN) introduce different type of cells — Recurrent cells. The first network of this type was so called Jordan network, when each of hidden cell received it’s own output with fixed delay — one or more iterations. Apart from that, it was like common FNN.  
  ![title](img/rnn.png)
  + Long/Short Term Memory (LTSM), this type introduces a memory cell, a special cell that can process data when data have time gaps (or lags). RNNs can process texts by “keeping in mind” ten previous words, and LSTM networks can process video frame “keeping in mind” something that happened many frames ago. LSTM networks are also widely used for writing and speech recognition.  ![title](img/ltsm.png)
  + Generative Adversarial Network (GAN) represents a huge family of double networks, that are composed from generator and discriminator. They constantly try to fool each other — generator tries to generate some data, and discriminator, receiving sample data, tries to tell generated data from samples. Constantly evolving, this type of neural networks can generate real-life images, in case you are able to maintain the training balance between these two networks.  ![title](img/gan.png)
  + Variational autoencoders (VAE) have the same architecture as AEs but are “taught” something else: an approximated probability distribution of the input samples. It’s a bit back to the roots as they are bit more closely related to BMs and RBMs. They do however rely on Bayesian mathematics regarding probabilistic inference and independence, as well as a re-parametrisation trick to achieve this different representation. The inference and independence parts make sense intuitively, but they rely on somewhat complex mathematics. The basics come down to this: take influence into account. If one thing happens in one place and something else happens somewhere else, they are not necessarily related. If they are not related, then the error propagation should consider that. This is a useful approach because neural networks are large graphs (in a way), so it helps if you can rule out influence from some nodes to other nodes as you dive into deeper layers.  CNN's are born from VAE's.  ![title](img/vae.png)
  
References: https://www.asimovinstitute.org/

# Hyper-parameters

Hyperparameters are the variables which determines the network structure(Eg: Number of Hidden Units) and the variables which determine how the network is trained(Eg: Learning Rate).

Hyperparameters are set before training(before optimizing the weights and bias).

References:
+ https://machinelearningmastery.com/
+ https://towardsdatascience.com/hyperparameters-in-deep-learning-927f7b2084dd
+ https://towardsdatascience.com/what-are-hyperparameters-and-how-to-tune-the-hyperparameters-in-a-deep-neural-network-d0604917584a

## Hyperparameters related to Network structure

### Number of Hidden Layers and units

Hidden layers are the layers between input layer and output layer.

*“Very simple. Just keep adding layers until the test error does not improve anymore.”*

*"The number of the hidden units is the main measure of model’s learning capacity."*

Many hidden units within a layer with regularization techniques can increase accuracy. Smaller number of units may cause underfitting.

### Dropout

Random neurons are cancelled

Dropout is regularization technique to avoid overfitting (increase the validation accuracy) thus increasing the generalizing power.

Generally, use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal effect and a value too high results in under-learning by the network.

Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.

### Network Weight Initialization

Ideally, it may be better to use different weight initialization schemes according to the activation function used on each layer.

+ Mostly uniform distribution is used.
+ Activation function
+ Sigmoid activation function

Activation functions are used to introduce nonlinearity to models, which allows deep learning models to learn nonlinear prediction boundaries.

Generally, the **rectifier** activation function is the most popular.  **Sigmoid** is used in the output layer while making binary predictions. **Softmax** is used in the output layer while making multi-class predictions.

## Hyperparameters related to Training Algorithm

### Learning Rate

The learning rate defines how quickly a network updates its parameters.

Low learning rate slows down the learning process but converges smoothly. Larger learning rate speeds up the learning but may not converge.

Usually a decaying Learning rate is preferred.

![title](img/learning_rate.png)

a) your model will have have hundreds and thousands of parameters each with its own error curve. And learning rate has to shepherd all of them

b) Error curves are not clean u-shapes. they have more complex shapes with local minima

### Momentum

Momentum helps to know the direction of the next step with the knowledge of the previous steps. It helps to prevent oscillations. A typical choice of momentum is between 0.5 to 0.9.

### Number of epochs

Number of epochs is the number of times the whole training data is shown to the network while training.

Increase the number of epochs until the validation accuracy starts decreasing even when training accuracy is increasing(overfitting).

![title](img/epochs.png)

Use EarlyStopping to ensure you don't overfit.

### Batch size


+ Batch size controls the accuracy of the estimate of the error gradient when training neural networks.
+ Batch, Stochastic, and Minibatch gradient descent are the three main flavors of the learning algorithm.
+ There is a tension between batch size and the speed and stability of the learning process.


+ Batch Gradient Descent. Batch size is set to the total number of examples in the training dataset.
+ Stochastic Gradient Descent. Batch size is set to one.
+ Minibatch Gradient Descent. Batch size is set to more than one and less than the total number of examples in the training dataset.


Mini batch size is the number of sub samples given to the network after which parameter update happens.

A good default for batch size might be 32. Also try 32, 64, 128, 256, and so on.

![title](img/batch_size.png)

# Well How Long does it take?

**It is very important to understand that to get an answer means you've already trained your neural layer (which takes a long time) and that your Prediction (software that invokes the layer, provides the inputs and extracts and answer) is typically fast.  In the AEC project it took many hours (12+) to train each neural layer but only took 15 minutes to generate an answer and package the resultant (most of which was I/O).**

## Machine Learning has three (3) major components that represent your time investment:

1. Data preparation

2. Neural Layer Architecture and Training

3. Prediction (getting the answer)

### Data Preparation

 - By far data preparation is your biggest time sync.  ML takes understanding of the data, actual gathering of the data (which is harder than most understand), inspection of the data, and ultimately cleaning the data to a state that is sufficient to support solid training.

 - Understanding the nature of the data has a direct impact on ML architecture (potentially).

### Neural Layer Architecture and Training

 - This is the second level of time investment with the majority of time spent in training.

 - There's no one recipe for any ML problem.  That said there are several pre-built tools and techniques that can support some of the more standard problems.

 - Design time of the neural layer is just like any other software development process.  Design, test, evaluate results. 

 - Training the actual layer once the architecture is defined and data prepared is what takes a LOT of time on the computer.  Training a single neural layer for a particular capability / region might take 20+ hours to finish.  Note that once that layer is trained and persisted (saved in HDF5 format as a matrix of probabilities) the actual prediction of the neural network takes minutes.

### Prediction

 - This is where you provide inputs into the trained neural layer and get an answer (typically very fast, especially in contrast to training).

 - The level of implementation (quality, adherence to requirements) has direct impact on the time spent developing the solution.  In other words...if this is a 6.4 transition it will take more time.

### GPU's

 The Graphics Processing Unit (GPU)1 provides much higher instruction throughput and memory bandwidth than the CPU within a similar price and power envelope. Many applications leverage these higher capabilities to run faster on the GPU than on the CPU (see GPU Applications). Other computing devices, like FPGAs, are also very energy efficient, but offer much less programming flexibility than GPUs.

This difference in capabilities between the GPU and the CPU exists because they are designed with different goals in mind. While the CPU is designed to excel at executing a sequence of operations, called a thread, as fast as possible and can execute a few tens of these threads in parallel, the GPU is designed to excel at executing thousands of them in parallel (amortizing the slower single-thread performance to achieve greater throughput).

The GPU is specialized for highly parallel computations and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. The schematic Figure 1 shows an example distribution of chip resources for a CPU versus a GPU. 

  - If the code isn't structured properly it won't matter how many GPU's you have / use.

  - GPU's are nice / good but availability and ease of use of any computer system always trumps "cool stuff".

  - Why isn't the code already setup for GPU utilization?  It takes specific preparation (see Data Preparation) as well as an actual understanding of how the GPU's are designed to utilize the data most effectively.  Construction of "GPU ready" data is sometimes counter-intuitive to how the data is structured or how some humans encapsulate that data in their mind.  If you're not guaranteed to get access to GPU's why bother to structure the data like that...???

  - How many GPU's do I need?  ALL OF THEM.  Seriously, as many resources that you can utilize is appropriate.  We touched on the design / architecture / training aspects of ML and training is where the computer time comes from.  Also perturbing hyper-parameters has impact on time spent when seeking the best possible fit to your data.

  - Taking 43 hyper-parameters (# of neurons, batch size, activation function, loss function, etc.) and creating a grid-type search for the best fit to a very small dataset took 60+ hours on a 16 GB memory laptop CPU.  That's just to find out which of the many stock options available have a best fit on a very small dataset.  This is just an example of how long trying to figure out what the best potential architecture is for a very small dataset using a recognized (but not dominant) technique for your neural design.
  
Reference: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model


# TensorFlow + Keras and PyTorch

TensorFlow (https://www.tensorflow.org) - An end-to-end open source machine learning platform for everyone. Discover TensorFlow's flexible ecosystem of tools, libraries and community resources.  Latest version 2.1.

Refer to https://www.tensorflow.org/guide/effective_tf2 for support on tensorflow capabilities in general.

Keras (https://keras.io/) - a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

What's a tensor?  
 + In mathematics, a tensor is an algebraic object that describes a (multilinear) relationship between sets of algebraic objects related to a vector space.  
 + A tensor is a container which can house data in N dimensions, along with its linear operations, though there is nuance in what tensors technically are and what we refer to as tensors in practice. 
 
![title](img/scalar-vector-matrix-tensor.jpg) 
 

Install:
  + conda install -c conda-forge pydot 
  
  Install the Graphviz package.

Reference: 
+ https://www.tensorflow.org/guide/keras/overview
+ https://en.wikipedia.org/wiki/Tensor
+ https://www.kdnuggets.com/2018/05/wtf-tensor.html
+ https://www.youtube.com/watch?v=f5liqUk0ZTw

Created by the Google Brain team, TensorFlow is a popular open source library for numerical computation and large-scale machine learning. While building TensorFlow, Google Engineers opted for a static computational graph approach for building machine learning models i.e. In TF you define graph statically before a model can run, all communication with outer world is performed via tf.Session object and tf.Placeholder which are tensors that will be substituted by external data at runtime.

In [1]:
# Python 3.7.3
############################################
# INCLUDES
############################################
#libraries specific to this example
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.backend import clear_session
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

#a set of libraries that perhaps should always be in Python source
import os 
import datetime
import sys
import gc
import getopt
import inspect
import math
import warnings
import types

#Data Science Libraries
import numpy as np
import pandas as pd
import scipy as sp
import scipy.ndimage

#Plotting libraries
import matplotlib as matplt
import matplotlib.pyplot as plt

#a darn useful library for creating paths and one I recommend you load to your environment
from pathlib import Path

# can type in the python console `help(name of function)` to get the documentation
from pydoc import help                          

#Import a custom library, in this case a fairly useful logging framework
debug_lib_location = Path("./")
sys.path.append(str(debug_lib_location))
import debug

warnings.filterwarnings('ignore')               # don't print out warnings


root_location=".." + os.sep + "data";

2022-12-14 21:49:47.948131: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


In [2]:
############################################
#JUPYTER NOTEBOOK OUTPUT CONTROL / FORMATTING
############################################
#set floating point to 4 places to things don't run loose
pd.options.display.float_format = '{:,.4f}'.format
np.set_printoptions(precision=4)

# Variable declaration

In [3]:
############################################
# GLOBAL VARIABLES
############################################
DEBUG = 1                            #General ledger output so you know what's happening.
DEBUG_DATA = 1                       #Extremely verbose output, change to zero (0) to supress the volume of output.

# CODE CONSTRAINTS
VERSION_NAME    = "AEC"
VERSION_ACRONYM = "AEC"
VERSION_MAJOR   = 0
VERSION_MINOR   = 0
VERSION_RELEASE = 1
VERSION_TITLE   = VERSION_NAME + " (" + VERSION_ACRONYM + ") " + str(VERSION_MAJOR) + "." + str(VERSION_MINOR) + "." + str(VERSION_RELEASE) + " generated SEED."

ENCODING  ="utf-8"
############################################
# GLOBAL CONSTANTS
############################################
TEMPERATURE="Temp(C)"
SALINITY="Sal(PSU)"    

############################################
# APPLICATION VARIABLES
############################################

############################################
# GLOBAL CONFIGURATION
############################################
os.environ['PYTHONIOENCODING']=ENCODING


## Tensor Example in Python (data structures)

In [4]:
#######################################################################
#Array Example
#######################################################################
x = np.array(42)
print(x)
print('A scalar is of rank %d' %(x.ndim))
print(" ")
print(" ")
#######################################################################
#Vector Example
#######################################################################
x = np.array([1, 1, 2, 3, 5, 8])
print(x)
print('A vector is of rank %d' %(x.ndim))
print(" ")
print(" ")
#######################################################################
#Matrix Example
#######################################################################
x = np.array([[1, 4, 7],
              [2, 5, 8],
              [3, 6, 9]])
print(x)
print('A matrix is of rank %d' %(x.ndim))
print(" ")
print(" ")
#######################################################################
#3D Tensor
#######################################################################
x = np.array([[[1, 4, 7],
               [2, 5, 8],
               [3, 6, 9]],
              [[10, 40, 70],
               [20, 50, 80],
               [30, 60, 90]],
              [[100, 400, 700],
               [200, 500, 800],
               [300, 600, 900]]])
print(x)
print('This tensor is of rank %d' %(x.ndim))

42
A scalar is of rank 0
 
 
[1 1 2 3 5 8]
A vector is of rank 1
 
 
[[1 4 7]
 [2 5 8]
 [3 6 9]]
A matrix is of rank 2
 
 
[[[  1   4   7]
  [  2   5   8]
  [  3   6   9]]

 [[ 10  40  70]
  [ 20  50  80]
  [ 30  60  90]]

 [[100 400 700]
  [200 500 800]
  [300 600 900]]]
This tensor is of rank 3


## What is transfer learning?

Transfer learning make use of the knowledge gained while solving one problem and applying it to a different but related problem.

For example, knowledge gained while learning to recognize cars can be used to some extent to recognize trucks.
Pre-Training

When we train the network on a large dataset(for example: ImageNet) , we train all the parameters of the neural network and therefore the model is learned. It may take hours on your GPU.

### Fine Tuning

We can give the new dataset to fine tune the pre-trained CNN. Consider that the new dataset is almost similar to the orginal dataset used for pre-training. Since the new dataset is similar, the same weights can be used for extracting the features from the new dataset.

   + If the new dataset is very small, it’s better to train only the final layers of the network to avoid overfitting, keeping all other layers fixed. So remove the final layers of the pre-trained network. Add new layers . Retrain only the new layers.
   + If the new dataset is very trunk = tf.keras.Sequential([...])
head1 = tf.keras.Sequential([...])
head2 = tf.keras.Sequential([...])

path1 = tf.keras.Sequential([trunk, head1])
path2 = tf.keras.Sequential([trunk, head2])

## Train on primary dataset
for x, y in main_dataset:
  with tf.GradientTape() as tape:
    # training=True is only needed if there are layers with different
    # behavior during training versus inference (e.g. Dropout).
    prediction = path1(x, training=True)
    loss = loss_fn_head1(prediction, y)
  # Simultaneously optimize trunk and head1 weights.
  gradients = tape.gradient(loss, path1.trainable_variables)
  optimizer.apply_gradients(zip(gradients, path1.trainable_variables))

## Fine-tune second head, reusing the trunk
for x, y in small_dataset:
  with tf.GradientTape() as tape:
    # training=True is only needed if there are layers with different
    # behavior during training versus inference (e.g. Dropout).
    prediction = path2(x, training=True)
    loss = loss_fn_head2(prediction, y)
  # Only optimize head2 weights, not trunk weights
  gradients = tape.gradient(loss, head2.trainable_variables)
  optimizer.apply_gradients(zip(gradients, head2.trainable_variables))

## You can publish just the trunk computation for other people to reuse.
tf.saved_model.save(trunk, output_path)
large, retrain the whole network with initial weights from the pretrained model.

### How to fine tune if the new dataset is very different from the orginal dataset ?

The earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors), but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.

The earlier layers can help to extract the features of the new data. So it will be good if you fix the earlier layers and retrain the rest of the layers, if you got only small amount of data.

If you have large amount of data, you can retrain the whole network with weights initialized from the pre-trained network.

References:
+ https://towardsdatascience.com/what-is-transfer-learning-8b1a0fa42b4
+ Tensorflow example - https://www.tensorflow.org/guide/effective_tf2

# The following is an example and not to be executed

In [5]:
#example of transfer learning, not to be executed!
"""
trunk = tf.keras.Sequential([...])
head1 = tf.keras.Sequential([...])
head2 = tf.keras.Sequential([...])

path1 = tf.keras.Sequential([trunk, head1])
path2 = tf.keras.Sequential([trunk, head2])

# Train on primary dataset
for x, y in main_dataset:
  with tf.GradientTape() as tape:
    # training=True is only needed if there are layers with different
    # behavior during training versus inference (e.g. Dropout).
    prediction = path1(x, training=True)
    loss = loss_fn_head1(prediction, y)
  # Simultaneously optimize trunk and head1 weights.
  gradients = tape.gradient(loss, path1.trainable_variables)
  optimizer.apply_gradients(zip(gradients, path1.trainable_variables))

# Fine-tune second head, reusing the trunk
for x, y in small_dataset:
  with tf.GradientTape() as tape:
    # training=True is only needed if there are layers with different
    # behavior during training versus inference (e.g. Dropout).
    prediction = path2(x, training=True)
    loss = loss_fn_head2(prediction, y)
  # Only optimize head2 weights, not trunk weights
  gradients = tape.gradient(loss, head2.trainable_variables)
  optimizer.apply_gradients(zip(gradients, head2.trainable_variables))

# You can publish just the trunk computation for other people to reuse.
tf.saved_model.save(trunk, output_path)
"""

'\ntrunk = tf.keras.Sequential([...])\nhead1 = tf.keras.Sequential([...])\nhead2 = tf.keras.Sequential([...])\n\npath1 = tf.keras.Sequential([trunk, head1])\npath2 = tf.keras.Sequential([trunk, head2])\n\n# Train on primary dataset\nfor x, y in main_dataset:\n  with tf.GradientTape() as tape:\n    # training=True is only needed if there are layers with different\n    # behavior during training versus inference (e.g. Dropout).\n    prediction = path1(x, training=True)\n    loss = loss_fn_head1(prediction, y)\n  # Simultaneously optimize trunk and head1 weights.\n  gradients = tape.gradient(loss, path1.trainable_variables)\n  optimizer.apply_gradients(zip(gradients, path1.trainable_variables))\n\n# Fine-tune second head, reusing the trunk\nfor x, y in small_dataset:\n  with tf.GradientTape() as tape:\n    # training=True is only needed if there are layers with different\n    # behavior during training versus inference (e.g. Dropout).\n    prediction = path2(x, training=True)\n    loss = 

## Graphical Processing Units (GPU's)

Reference: https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/

The CPU (central processing unit) has been called the brains of a PC. The GPU its soul. 

GPUs have ignited a worldwide AI boom. They’ve become a key part of modern supercomputing. They’ve been woven into sprawling new hyperscale data centers. They’ve become accelerators speeding up all sorts of tasks from encryption to networking to AI.

And they continue to drive advances in gaming and pro graphics inside workstations, desktop PCs and a new generation of laptops.

What Is a GPU?
What's the difference between a CPU and a GPU?

While GPUs (graphics processing unit) are now about a lot more than the PCs in which they first appeared, they remain anchored in a much older idea called parallel computing. And that’s what makes GPUs so powerful.

CPUs, to be sure, remain essential. Fast and versatile, CPUs race through a series of tasks requiring lots of interactivity. Calling up information from a hard drive in response to user’s keystrokes, for example.

By contrast, GPUs break complex problems into thousands or millions of separate tasks and work them out at once.

That makes them ideal for graphics, where textures, lighting and the rendering of shapes have to be done at once to keep images flying across the screen.

CPU vs GPU

|CPU|GPU|
|---|---|
|Central Processing Unit|Graphics Processing Unit|
|Several cores|Many cores|
|Low latency|High throughput|
|Good for serial processing|Good for parallel processing|
|Can do a handful of operations at once|Can do thousands of operations at once|

Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.

GPUs deliver the once-esoteric technology of parallel computing. It’s a technology with an illustrious pedigree that includes names such as supercomputing genius Seymor Cray. But rather than taking the shape of hulking supercomputers, GPUs put this idea to work in the desktops and gaming consoles of more than a billion gamers.

## Some General Notes

Just a note regarding GPU utilization on Windows 10.  Firefox, photos, Cortana, and a litany of other things can potentially rob you of GPU memory...unexpectedly) and result in a very esoteric error message that doesn't make sense right away.  If your code worked before, it's the GPU be clobbered and time to start the task manager or reboot. 

The real problem was watching a PMI video while trying to run the Python, although I suspect having added a 3rd monitor this weekend might have helped issues along somehow.  I did remove a monitor and then rebooted only to have the same issue persist which makes me think certain programs are greedy. 

There's a push to use less power on PC's and it seems GPU's are more power conservative which means many apps are being pushed to use the GPU (Firefox with video) than the CPU.

Note that WINDOWS Key + CTRL + SHIFT + B resets your GPU's / Display adapter.  This doesn't matter if you're still using subject programs that caused the problem in the first place.

More GPU's mean faster processing, perhaps not massive but significant for sure.  Even a lower end video card can out-pace a server with double the CPU...easily.

TensorFlow doesn't support instance use of all GPU's.  Look into TensorFlow.Strategy().  TensorFlow does make use of multiple CPU's but treats everything as CPU:0.




References:
+ https://pathmind.com/wiki/neural-network
+ http://karpathy.github.io/2019/04/25/recipe/
+ https://machinelearningmastery.com/a-data-driven-approach-to-machine-learning/
+ https://developers.google.com/machine-learning/crash-course/descending-into-ml/training-and-loss
+ https://www.youtube.com/c/3blue1brown