<h2>What’s the Problem? </h2>

Semiconductor packages use a variety of insulating and conducting materials that are patterned in a layered structure. Those materials have different thermal conductivities and depending on their pattern within each layer, they will contribute differently to the overall effective thermal conductivity of the package. 

A customer is interested in choosing the optimal combination of materials per layer that maximizes the thermal conductivity in the “z” direction (shown in the figure below), and minimize the materials cost subject to constraints on the available choices of material. 

<b>The Ask </b>

Train an RL agent from scratch using Ray-RLLib that chooses the “optimal” set of materials per layer on a semiconductor package layout given information about the cost per unit weight and available choices of materials per layer per example, and details about the spatial pattern of conducting and and insulating materials in each layer. 

<b>What We’ll Look for</b>

- Resourcefulness - were you able to come up with a self-consistent toy environment on which to train your RL agent? How did you inform your environment design by pulling relevant domain expertise?
- Creativity - this problem is intentionally pretty open but quite technical, which mirrors most problems at Vinci. Did you come up with a novel way of solving it? Where were you blocked and how did you get around those blockers?
- Communication - How clearly can you summarize your approach, the solution, and its implicit and chosen shortcomings?

<b>Additional Details</b>

You will need to construct a custom RL environment that matches the problem statement. We don’t expect you to write an RL algorithm from scratch. Feel free to use out-of box algorithm(s) from RLLib You should think about how the optimal actions of the RL agent would be consumed downstream - what would serving this model in a product look like?

<h2> Approach </h2>
- We will consider three approaches to using RL to handle this problem.

- Single Step (Finite horizon of 1) - Each step of RL, the "action" is to perturb one (or potentially more) parameters, and run the evaluation to find the reward.  The reward is assumed to be determinitistic.  ie. each action is an operation over the entire stack.  Setting the <b>entire</b> configuration of all the layers is considered a single action.

- Multistep (Horizon == Number of layers) - Each timestep, the action is to set the configuration of the next layer.  Layer 1 is set at t=1, layer 2 is set at t=2, given the configuration of layer 1, layer 3 is set at t=3 given the configurations of layers 1 and 2, etc...  Under this approach, reward can be calculated two ways:

    - Delayed reward.  The reward is 0 at each layer, until after the final configuration is determined.  Then the overall reward is calculated.  This approach tends to converge slower as reward after the last step have to be propogated back to determine the values of the decisions of the initial layers (ie.  How did the first layer have impact on the final outcome).  This is known as the credit assignment problem. This is analogous to using RL to solve a maze, but the only reward is given at the very end when the goal is reaching.  This then requires propogating the information back to the initial steps to figure out the best actions to take.
  
    - Incremental reward - Calculate a reward at each layer based on some heuristic, for example, looking at the cost and performance of the materials of the layer you just calculated.  The heuristic typically requires some domain knowledge, as you are essentially already suggesting a path.  (The heuristic ends up acting like the value function).  This information can help lead to a result faster, but risks missing the best solution if the heuristic is poor, potentially getting stuck in local minima, or taking longer to find the optimal solution.  Using the maze analogy, this would be applying a reward at each step based on the decrease in Manhattan distance to goal.  By encouraging actions which get closer to the goal in the short term, a solution might be found faster, but makes it difficult to find paths which have to move further away first but are better overall.  

- RL is used in this problem as an optimization.  The goal is to explore the space of configurations to find the best configuration given the objective function.  This is different than a typical training approach which searchs model parameter space to find the best set for performance over a large training set.

- The exploration conducted via RL algorithms (trying random actions) will be used to explore the configuration space.  

For this implementation, we will use a delayed reward implementation.  The reward is only calculated and returned after the final layer of the environment is calculated.

<h3>Action Space</h3>

The action sapce is a 1-d array of integers where each position represents a configuration for a material.  For each step, only the values for the layers till the current layer are set (with the remainder of the set to 0's to maintain constant array length). 

action = [<i>layer0_config, layer1_config, layer2_config</i>]
<pre>
# Used by the RL library to select the actions
max_sizes = [3,2,2,    # layer 0 (T1, T2, I1)
             2,2,      # layer 1 (T2, I2)
             3,3,3]    # layer 2 (T1, T3, I3)

example_action = [1,0,0,1,0,2,1,0]
</pre>


<h3>Observation Space</h3>

The observation space includes information about the current layer which is about to be set.  This includes:
- The layer thickness
- The number thermal material options per material
- The number of insulating material options
- The pattern of the layer

The observation at any time step is only for the layer we are about to configure.

<pre>
    np.array([
    0.5,   # thickness
    3.0,   # number of options for T1
    2.0,   # number of options for T2
    2.0,   # number of insulating options
    -1., -1., -1., -1.,   # Pattern for that layer
    -1.,  0.,  0., -1.,
    -1.,  1.,  1., -1.,
    -1., -1., -1., -1.
], dtype=np.float32)
</pre>

(Note:  Some of these variables are included to simplify operations in the environment. ie. The number of options will be used as a mask when looking at the action array.   As each layer is treated independently, most of the observation space could be calculated on-demand from the problem specification).



<h2>Sample Problem</h2>
We will define a set of potential materials for the insulators and the thermal materials, each with the related
conductive property and cost.  Then for each layer,
we will specify which materials can be used.  We will also specify the pattern of the materials on the layer. 

In [1]:
import gym
import numpy as np
from gym import spaces
from typing import List, Dict, Any

def make_sample_problem():
    """
    Create a semiconductor-package RL problem with
    explicit -1 cells for insulating regions.
    """
    return {
        "material_library": {
            # Insulators
            "d1": {"k": 0.2,  "cost": 1.0},
            "d2": {"k": 0.05, "cost": 0.6},
            "d3": {"k": 0.15, "cost": 1.2},
            "d4": {"k": 0.03, "cost": 0.5},
            # Conductors
            "c1": {"k": 200.0, "cost": 20.0},
            "c2": {"k": 50.0,  "cost": 8.0},
            "c3": {"k": 10.0,  "cost": 2.0},
            "c4": {"k": 5.0,   "cost": 1.0},
            "c5": {"k": 100.0, "cost": 12.0},
        },

        "layers": [
            {
                "thickness": 0.5,
                "insulating_type": "I1",
                "insulating_options": ["d2", "d4"],
                "thermal_regions": ["T1", "T2"],
                "thermal_options": {
                    "T1": ["c1", "c2", "c3"],
                    "T2": ["c2", "c4"],
                },
                # -1 marks insulating separator cells
                "pattern": np.array([
                    [-1, -1, -1, -1],
                    [-1,  0,  0, -1],
                    [-1,  1,  1, -1],
                    [-1, -1, -1, -1],
                ], dtype=np.int8),
            },
            {
                "thickness": 0.3,
                "insulating_type": "I2",
                "insulating_options": ["d1", "d2"],
                "thermal_regions": ["T2"],
                "thermal_options": {
                    "T2": ["c2", "c4"],
                },
                "pattern": np.array([
                    [-1, -1, -1, -1],
                    [-1,  0,  0, -1],
                    [-1,  0,  0, -1],
                    [-1, -1, -1, -1],
                ], dtype=np.int8),
            },
            {
                "thickness": 0.2,
                "insulating_type": "I3",
                "insulating_options": ["d1", "d3", "d4"],
                "thermal_regions": ["T1", "T3"],
                "thermal_options": {
                    "T1": ["c1", "c2", "c3"],
                    "T3": ["c1", "c3", "c5"],
                },
                "pattern": np.array([
                    [-1,  0, -1,  1],
                    [-1,  0, -1,  1],
                    [-1,  0, -1,  1],
                    [-1,  0, -1,  1],
                ], dtype=np.int8),
            },
        ],
    }


In [2]:
make_sample_problem()

{'material_library': {'d1': {'k': 0.2, 'cost': 1.0},
  'd2': {'k': 0.05, 'cost': 0.6},
  'd3': {'k': 0.15, 'cost': 1.2},
  'd4': {'k': 0.03, 'cost': 0.5},
  'c1': {'k': 200.0, 'cost': 20.0},
  'c2': {'k': 50.0, 'cost': 8.0},
  'c3': {'k': 10.0, 'cost': 2.0},
  'c4': {'k': 5.0, 'cost': 1.0},
  'c5': {'k': 100.0, 'cost': 12.0}},
 'layers': [{'thickness': 0.5,
   'insulating_type': 'I1',
   'insulating_options': ['d2', 'd4'],
   'thermal_regions': ['T1', 'T2'],
   'thermal_options': {'T1': ['c1', 'c2', 'c3'], 'T2': ['c2', 'c4']},
   'pattern': array([[-1, -1, -1, -1],
          [-1,  0,  0, -1],
          [-1,  1,  1, -1],
          [-1, -1, -1, -1]], dtype=int8)},
  {'thickness': 0.3,
   'insulating_type': 'I2',
   'insulating_options': ['d1', 'd2'],
   'thermal_regions': ['T2'],
   'thermal_options': {'T2': ['c2', 'c4']},
   'pattern': array([[-1, -1, -1, -1],
          [-1,  0,  0, -1],
          [-1,  0,  0, -1],
          [-1, -1, -1, -1]], dtype=int8)},
  {'thickness': 0.2,
   'insu

<h2>The Environment</h2>
Assumptions about the dynamics which will be used to build the simulation environment in gym.  

The objective function should weight the output against the cost.

This is where the core logic really lives. 

In [3]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces

class SemiconductorEnvV2(gym.Env):
    metadata = {"render_modes": []}

    def __init__(self, problem, obs_length=10):
        super().__init__()
        self.problem = problem

        # Copy th elayers from the problem
        self.layers = problem["layers"]

        # Get the materials library
        self.material_library = problem["material_library"]

        # Get the number of layers
        self.n_layers = len(self.layers)

        # Determine pattern size (assume all layers share the same grid size)
        self.pattern_shape = self.layers[0]["pattern"].shape
        self.pattern_len   = self.pattern_shape[0] * self.pattern_shape[1]

        # observation: info about the materials + flattened pattern
        # We'll override obs_length to guarantee enough room
        self.obs_length = 1 + 1 + max(len(l["thermal_regions"]) for l in self.layers) \
                            + self.pattern_len
        # 1 thickness, up to N thermal option counts, 1 insulating count, pattern_len

        self.observation_space = spaces.Box(
            low=0.0, high=1000.0, shape=(self.obs_length,), dtype=np.float32
        )

        # Max-size action space covering all layers
        # This is the maximum value of each entry.  How many different materials can this component select from?
        max_sizes = []
        for layer in self.layers:
            for r in layer["thermal_regions"]:
                max_sizes.append(len(layer["thermal_options"][r]))
            max_sizes.append(len(layer["insulating_options"]))

        # The action space is defined as a multi discrete array.  ie. a value of assigned each insulation and thermal component.
        # This will be used by the training algorithms to select actions as part of exploration/learning
        self.action_space = spaces.MultiDiscrete(max_sizes)

        # Initialize the current layer and the current selected materials
        self.current_layer = 0
        self.selected_materials = []

    # Reset the problem
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.current_layer = 0
        self.selected_materials = []
        return self._get_obs(), {}

    # Get the observations.  This is really just the state of the world.  (Not doing parital observability here)
    # The observation is the information about the next layer we need to configure.  
    # This only describes the layer we are about to configure.  
    # Values in the observation vector include:
    # - Thickness
    # - Number of values each material can take (used for filtering out the action vector)
    # - The pattern of materials on the layer. Flattened.
    def _get_obs(self):
        layer = self.layers[self.current_layer]

        # Basic numeric info
        obs = [layer["thickness"]]
        for r in layer["thermal_regions"]:
            obs.append(len(layer["thermal_options"][r]))
        obs.append(len(layer["insulating_options"]))

        # Flatten pattern and normalize to [0,1] (convert -1 to 0 for padding/insulator indicator)
        pat = layer["pattern"].astype(np.float32)
        pat = np.where(pat < 0, 0, pat)  # -1 -> 0
        pat = pat / (pat.max() if pat.max() > 0 else 1.0)
        obs.extend(pat.flatten())

        # Zero-pad to full length if needed
        if len(obs) < self.obs_length:
            obs += [0.0] * (self.obs_length - len(obs))
        return np.array(obs, dtype=np.float32)

    # Define the step function describing how the environment operatoes. 
    # Treat these problem as a multistep learning problem.  Each layer, starting at the bottom, is a step.
    # Assign values at that layer, then step to the next layer.
    # This fundamentally defines the problem and the structure of the learning. 
    def step(self, action):
        if isinstance(action, np.ndarray) and action.ndim > 1:
            action = action.flatten()

        layer = self.layers[self.current_layer]
        chosen = {}
        num_regions = len(layer["thermal_regions"])
        num_options = num_regions + 1

        # Take only first num_options entries. ie. only consider the part of the action space related to the layers so far.
        # Ignore the rest of the vector as those consider actions related to future layers.
        action = action[:num_options]

        # Choose thermal materials
        for idx, r in enumerate(layer["thermal_regions"]):
            max_idx = len(layer["thermal_options"][r]) - 1
            chosen[r] = layer["thermal_options"][r][int(np.clip(action[idx], 0, max_idx))]

        # Choose insulating material
        max_idx = len(layer["insulating_options"]) - 1
        chosen[layer["insulating_type"]] = layer["insulating_options"][int(np.clip(action[num_regions], 0, max_idx))]

        self.selected_materials.append(chosen)
        self.current_layer += 1

        # Determine if we have reached the last layer
        done = self.current_layer >= self.n_layers

        # If so, calculate the reward. Otherwise, reward is just 0.  This is the delayed reward approach
        reward = self._compute_reward() if done else 0.0
        obs = self._get_obs() if not done else np.zeros(self.obs_length, dtype=np.float32)
        return obs, reward, done, False, {}

    
    # Goal is to maximize thermal conductivity in z-direction, while minimizing the cost of the materials used
    def _compute_reward(self):

        # The total k
        total_k_inv = 0.0

        # The total cost
        total_cost = 0.0

        # For each layer, calculate the reward of that layer
        for layer, chosen in zip(self.layers, self.selected_materials):
            pattern = layer["pattern"]
            total_cells = np.sum(pattern >= 0)  # ignore -1 insulating separators

            # accumulate weighted conductivity and cost
            layer_k = 0.0
            layer_cost = 0.0

            
            # Sum up the cost and thermal impact from the thermal regions, weighted by the percentage of the layout they take up          
            for r_idx, region in enumerate(layer["thermal_regions"]):
                region_cells = np.sum(pattern == r_idx)
                frac = region_cells / total_cells if total_cells > 0 else 0
                mat = chosen[region]
                layer_k += frac * self.material_library[mat]["k"]
                layer_cost += frac * self.material_library[mat]["cost"]

            # Same thing for the insulating material weight
            ins_cells = np.sum(pattern == -1)
            frac_ins = ins_cells / (pattern.size)  # entire grid fraction
            ins_mat = chosen[layer["insulating_type"]]
            layer_k += frac_ins * self.material_library[ins_mat]["k"]
            layer_cost += frac_ins * self.material_library[ins_mat]["cost"]

            # Account for the thickness of the layer
            k_eff = layer_k
            total_k_inv += layer['thickness'] / k_eff if k_eff > 0 else np.inf
            total_cost += layer_cost

        K_total = 1.0 / total_k_inv if total_k_inv > 0 else 0.0
        return K_total - 0.1 * total_cost


<h2>Training Loop</h2>

Given an environment, the training loop is relatively straightforward.  We will use PPO to conduct the search.

In [4]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import DummyVecEnv

problem = make_sample_problem()
env = DummyVecEnv([lambda: SemiconductorEnvV2(problem)])  # uses new class

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=5000)

raw_env = env.envs[0]
obs = raw_env.reset()[0]
done = False
while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _, _ = raw_env.step(action)

print("Selected materials per layer:", raw_env.selected_materials)


Using cuda device




-----------------------------
| time/              |      |
|    fps             | 630  |
|    iterations      | 1    |
|    time_elapsed    | 3    |
|    total_timesteps | 2048 |
-----------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 551        |
|    iterations           | 2          |
|    time_elapsed         | 7          |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.02599395 |
|    clip_fraction        | 0.425      |
|    clip_range           | 0.2        |
|    entropy_loss         | -7.14      |
|    explained_variance   | 0.000477   |
|    learning_rate        | 0.0003     |
|    loss                 | 161        |
|    n_updates            | 10         |
|    policy_gradient_loss | -0.0617    |
|    value_loss           | 610        |
----------------------------------------
-----------------------------------------
| time/   

<h2>Next Steps</h2>

- A preference could be added to each of the materials.  There may be more involved than the cost.  This could result in more of an interactive component to the RL system. Solutions could be presented to the customer to rank, and the preference would be learned (RLHF-style)

- The current environmental model does not consider the interactions between the layers.  This could be added to the SemiconductorEnv to get a more realistic behavior of the system.

- Experiments should be running using Single Step RL (configure all layers as single action) and Multistep w/ Incremental Reward to compare the performance