# Assignment 10 Solutions

Submitted By: ANSARI PARVEJ

#### 1.	What does a SavedModel contain? How do you inspect its content?

**Ans:**
![image.png](attachment:image.png)

A SavedModel is a serialization format used in TensorFlow for saving and loading models, including their architecture, variables, and metadata. A SavedModel contains two main components:

- The computation graph: This includes the architecture of the model, the operations used in the model, their inputs and outputs, and the relationships between them.

- The variables: These are the learnable parameters of the model, such as the weights and biases, that are optimized during training.

In addition to the computation graph and variables, a SavedModel may also contain metadata about the model, such as its input and output signatures, hyperparameters, and training configuration.

To inspect the contents of a SavedModel, you can use the saved_model_cli command-line tool provided by TensorFlow. For example, to list the contents of a SavedModel, you can use the following command:

- saved_model_cli show --dir /path/to/saved_model --all


#### 2.	When should you use TF Serving? What are its main features? What are some tools you can use to deploy it?

**Ans:**
![image.png](attachment:image.png)

Use TensorFlow Serving:

- Serving machine learning models in production: TensorFlow Serving is designed to be fast, scalable, and reliable, making it a good choice for serving machine learning models in production environments.

- Serving multiple versions of a model: TensorFlow Serving makes it easy to manage and serve multiple versions of a model, which can be useful for A/B testing or gradual rollouts of new models.

- Serving models over a network: TensorFlow Serving provides a flexible and efficient way to serve machine learning models over a network, making it easy to integrate models into distributed systems.

Some of the main features of TensorFlow Serving include:

- Flexible model serving: TensorFlow Serving supports multiple input and output formats, making it easy to serve models with different input and output signatures.

- Efficient serving: TensorFlow Serving uses a variety of techniques to serve models efficiently, including batching requests, caching intermediate results, and using optimized implementations of operations.

- Model versioning and management: TensorFlow Serving makes it easy to manage and serve multiple versions of a model, and supports dynamic model loading and unloading.

To deploy TensorFlow Serving, there are several tools available, including:

- Docker: TensorFlow Serving provides Docker images that can be used to run the serving system in a containerized environment.

- Kubernetes: Kubernetes is a popular container orchestration system that can be used to deploy TensorFlow Serving at scale.

- Cloud platforms: Many cloud platforms, such as Google Cloud Platform and Amazon Web Services, provide managed services for serving machine learning models, including TensorFlow Serving.

#### 3.	How do you deploy a model across multiple TF Serving instances?

**Ans:**

Deploying a model across multiple TensorFlow Serving instances can be useful for load balancing and improving availability. Here are the high-level steps to deploy a model across multiple TF Serving instances:

- Start multiple instances of TensorFlow Serving: Start multiple instances of TensorFlow Serving on different machines or containers, each with a unique REST API port number.

- Configure a load balancer: Configure a load balancer such as Nginx or HAProxy to distribute requests among the TF Serving instances. You can configure the load balancer to use different algorithms for distributing requests, such as round-robin, least connections, or IP hash.

- Set up model replication: TensorFlow Serving provides several options for replicating models across multiple instances, including model replication and request splitting. In model replication, each instance serves the entire model, while in request splitting, each instance serves a subset of the requests.

- Test and monitor the deployment: Test the deployment by sending sample requests to the load balancer, and monitor the performance and availability of the TF Serving instances using metrics such as request latency, request rate, and CPU usage.

To implement these steps, you can use various tools and libraries, such as Docker, Kubernetes, Nginx, and Prometheus. Some cloud providers also provide managed services for load balancing and serving machine learning models, such as Google Kubernetes Engine and Amazon Elastic Container Service for Kubernetes.

#### 4.	When should you use the gRPC API rather than the REST API to query a model served by TF Serving?

**Ans:**
![image.png](attachment:image.png)

The gRPC API is generally preferred over the REST API when querying a model served by TensorFlow Serving, especially in scenarios where low-latency and high-throughput are critical. Here are some situations where the gRPC API might be preferred over the REST API:

- Low-latency inference: gRPC is faster than REST because it uses a binary protocol that is more efficient for data serialization and transport. gRPC also supports bidirectional streaming, which allows for low-latency communication between the client and the server.

- High-throughput inference: gRPC is designed for high-throughput scenarios, where many requests need to be processed concurrently. gRPC uses a single persistent connection for multiple requests, which reduces the overhead of creating and tearing down connections for each request.

- Multi-language support: gRPC supports multiple programming languages, including C++, Java, Python, and Go. This makes it easier to integrate with other systems and services that use different programming languages.

- Support for advanced features: gRPC provides several advanced features, such as flow control, message compression, and authentication, which are not available in REST.

#### 5.	What are the different ways TFLite reduces a model’s size to make it run on a mobile or embedded device?

**Ans:**
![image.png](attachment:image.png)

TensorFlow Lite (TFLite) provides several techniques to reduce the size of a model to make it more suitable for deployment on mobile or embedded devices. Here are some of the most common ways TFLite reduces a model's size:

- Quantization: TFLite supports both post-training quantization and quantization-aware training. Post-training quantization involves converting the weights and activations of a model to lower precision data types, such as 8-bit integers, which reduces the memory footprint of the model. Quantization-aware training involves training a model from scratch with the goal of using lower precision data types from the outset.

- Weight pruning: TFLite supports weight pruning, which involves setting small weights to zero and removing them from the model. This can significantly reduce the number of parameters in the model and thus reduce its size.

- Model compression: TFLite provides several techniques for compressing a model, such as weight sharing, which involves sharing weights across different layers of the model, and knowledge distillation, which involves training a smaller model to mimic the behavior of a larger, more complex model.

#### 6.	What is quantization-aware training, and why would you need it?

**Ans:**

![image.png](attachment:image.png)

Quantization-aware training (QAT) is a technique used to train models with the goal of using lower-precision data types from the outset, such as 8-bit integers, which can be more memory-efficient and faster to compute on hardware that supports it. QAT involves simulating the effect of quantization during training by introducing fake quantization nodes into the model graph, which represent the rounding of weights and activations to lower-precision data types.

The main reason to use QAT is to create models that can be deployed on hardware with limited computational resources, such as mobile or embedded devices, while still maintaining good accuracy. By training the model to be robust to quantization and other forms of numerical approximation, it can be better prepared to run on hardware that doesn't support full-precision data types, or where full-precision data types would be too memory-intensive or computationally expensive to use.

QAT can also help reduce the memory footprint of a model, allowing it to fit into smaller devices with less memory. It can also improve inference speed on hardware that supports lower-precision data types, as these operations can be computed faster than full-precision operations.

#### 7.	What are model parallelism and data parallelism? Why is the latter generally recommended?

**Ans:**

Model parallelism and data parallelism are two common techniques used to train deep learning models on distributed systems with multiple processing units, such as multiple GPUs or CPUs.

Data parallelism involves splitting the input data across multiple processing units and computing the forward and backward passes of the neural network independently on each unit. The gradients computed on each unit are then aggregated and used to update the model parameters. Data parallelism is generally recommended because it is simpler to implement and more commonly used than model parallelism. It is also more efficient when the model is large and the input data can be split evenly across the processing units.

#### 8.	When training a model across multiple servers, what distribution strategies can you use? How do you choose which one to use?

**Ans:**

When training a model across multiple servers, there are several distribution strategies that can be used, including:

- Mirrored Strategy: This strategy involves creating a copy of the model on each processing unit, and synchronizing the updates to the model parameters across all units during training. This is the most common strategy used for synchronous data parallelism.

- Parameter Server Strategy: This strategy involves dividing the model parameters into partitions and distributing them across different processing units, where each unit computes gradients for a subset of the parameters. The gradients are then averaged across all units and used to update the parameters. This strategy is suitable for models with large parameter sizes and can handle asynchronous updates.

- Multi Worker Mirrored Strategy: This strategy is similar to MirroredStrategy, but it supports training across multiple workers instead of just multiple devices within a single machine. It can be useful for training on large-scale datasets.