<h2>What’s the Problem? </h2>

Semiconductor packages use a variety of insulating and conducting materials that are patterned in a layered structure. Those materials have different thermal conductivities and depending on their pattern within each layer, they will contribute differently to the overall effective thermal conductivity of the package. 

A customer is interested in choosing the optimal combination of materials per layer that maximizes the thermal conductivity in the “z” direction (shown in the figure below), and minimize the materials cost subject to constraints on the available choices of material. 

The Ask 

Train an RL agent from scratch using Ray-RLLib that chooses the “optimal” set of materials per layer on a semiconductor package layout given information about the cost per unit weight and available choices of materials per layer per example, and details about the spatial pattern of conducting and and insulating materials in each layer. 

What We’ll Look for

- Resourcefulness - were you able to come up with a self-consistent toy environment on which to train your RL agent? How did you inform your environment design by pulling relevant domain expertise?
- Creativity - this problem is intentionally pretty open but quite technical, which mirrors most problems at Vinci. Did you come up with a novel way of solving it? Where were you blocked and how did you get around those blockers?
- Communication - How clearly can you summarize your approach, the solution, and its implicit and chosen shortcomings?

Additional Details

You will need to construct a custom RL environment that matches the problem statement. We don’t expect you to write an RL algorithm from scratch. Feel free to use out-of box algorithm(s) from RLLib You should think about how the optimal actions of the RL agent would be consumed downstream - what would serving this model in a product look like?

<h2> Approach </h2>
- There are two approaches to using RL to handle this problem.

- Single Step (Finite horizon of 1) - Each step of RL, the "action" is to perturb one (or potentially more) parameters, and run the evaluation to find the reward.  The reward is assumed to be determinitistic.  ie. each action is an operation over the entire stack.  Setting the *entire* configuration of all the layers is considered a single action.

- Multistep (Horizon == Number of layers) - Each timestep, the action is setting the configuration of the next layer.  Layer 1 is set at t=1, layer 2 is set at t=2, given the configuration of layer 1, layer 3 is set at t=3 given the configurations of layers 1 and 2, etc...  Under this approach, reward can be calculated two ways:

    - Delayed reward.  The reward is 0 at each layer, until after the final configuration is determined.  Then the overall reward is calculated.  This approach tends to converge slower as reward after the last step have to be propogated back to the decisions of the initial layers (ie.  How did the first layer have impact on the final outcome).  This is known as the credit assignment problem. This is analogous to using RL to solve a maze, but the only reward is given at the very end when the goal is reaching.  This then requires propogating the information back to the initial steps to figure out the best actions to take.
  
    - Incremental reward - Calculate a reward at each layer based on some heuristic, for example, looking at the cost and performance of the materials of the layer you just calculated.  The heuristic typically requires some domain knowledge, as you are essentially already suggesting a path.  (The heuristic ends up acting like an the value function).  This information can help lead to a result faster, but risks missing the best solution if the heuristic is poor, and potentially getting stuck in local minima.  Using the maze analogy, this would be applying a reward at each step based on the decrease in Manhattan distance to goal.  By encouraging actions which get closer to the goal, a solution might be found faster, but makes it difficult to find paths which have to move further away first but are better overall.  

- RL is used here as an optimization.  The goal is to explore the space of configurations to find the best configuration given the objective function.  This is different than a typical training approach which searchs model parameter space to find the best set for performance over a large training set.

- The exploration conducted via RL algorithms (trying random actions) will be used to explore the configuration space.  

For this implementation, we will use a multistep incremental reward implementation.  This can easily be modified to do the delayed reward approach.

<h3>Action Space</h3>

The action sapce is a 1-d array of integers where each position represents a configuration for a material.  For each step, only the values for the layers till the current layer are set (with the remainder of the array being clipped).  

<h3>Observation Space</h3>

The observation space includes information about the current layer which is about to be set.  Includes the thickness, the thermal material, and the insulating material. 


<h2>Sample Problem</h2>
We will define a set of potential materials for the insulators and the thermal materials, each with the related
conductive property and cost.  Then for each layer,
we will specify which materials can be used.  We will also specify the topology of the materials on the layer. 

In [None]:
import gym
import numpy as np
from gym import spaces
from typing import List, Dict, Any

def make_sample_problem():
    """
    Create a semiconductor-package RL problem with
    explicit -1 cells for insulating regions.
    """
    return {
        "material_library": {
            # Insulators
            "d1": {"k": 0.2,  "cost": 1.0},
            "d2": {"k": 0.05, "cost": 0.6},
            "d3": {"k": 0.15, "cost": 1.2},
            "d4": {"k": 0.03, "cost": 0.5},
            # Conductors
            "c1": {"k": 200.0, "cost": 20.0},
            "c2": {"k": 50.0,  "cost": 8.0},
            "c3": {"k": 10.0,  "cost": 2.0},
            "c4": {"k": 5.0,   "cost": 1.0},
            "c5": {"k": 100.0, "cost": 12.0},
        },

        "layers": [
            {
                "thickness": 0.5,
                "insulating_type": "I1",
                "insulating_options": ["d2", "d4"],
                "thermal_regions": ["T1", "T2"],
                "thermal_options": {
                    "T1": ["c1", "c2", "c3"],
                    "T2": ["c2", "c4"],
                },
                # -1 marks insulating separator cells
                "pattern": np.array([
                    [-1, -1, -1, -1],
                    [-1,  0,  0, -1],
                    [-1,  1,  1, -1],
                    [-1, -1, -1, -1],
                ], dtype=np.int8),
            },
            {
                "thickness": 0.3,
                "insulating_type": "I2",
                "insulating_options": ["d1", "d2"],
                "thermal_regions": ["T2"],
                "thermal_options": {
                    "T2": ["c2", "c4"],
                },
                "pattern": np.array([
                    [-1, -1, -1, -1],
                    [-1,  0,  0, -1],
                    [-1,  0,  0, -1],
                    [-1, -1, -1, -1],
                ], dtype=np.int8),
            },
            {
                "thickness": 0.2,
                "insulating_type": "I3",
                "insulating_options": ["d1", "d3", "d4"],
                "thermal_regions": ["T1", "T3"],
                "thermal_options": {
                    "T1": ["c1", "c2", "c3"],
                    "T3": ["c1", "c3", "c5"],
                },
                "pattern": np.array([
                    [-1,  0, -1,  1],
                    [-1,  0, -1,  1],
                    [-1,  0, -1,  1],
                    [-1,  0, -1,  1],
                ], dtype=np.int8),
            },
        ],
    }


In [None]:
make_sample_problem()

<h2>The Environment</h2>
Assumptions about the dynamics which will be used to build the simulation environment in gym.  

The objective function should weight the output against the cost.

This is where the core logic really lives. 

In [None]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces

# Gym environment for this task
class SemiconductorEnv(gym.Env):
    metadata = {"render_modes": []}

    # Initialization function
    def __init__(self, problem, obs_length=10):
        super().__init__()
        self.problem = problem

        # The layers
        self.layers = problem["layers"]

        # The types of materials for the insulation and thermal components which can be selected and their properties
        self.material_library = problem["material_library"]

        # The number of layers
        self.n_layers = len(self.layers)

        # Start at layer 0 (the bottom)
        self.current_layer = 0
        self.selected_materials = []

        
        self.obs_length = obs_length
        self.observation_space = spaces.Box(
            low=0.0, high=1000.0, shape=(self.obs_length,), dtype=np.float32
        )

        # Max-size action space covering all layers
        max_sizes = []
        for layer in self.layers:
            for r in layer["thermal_regions"]:
                max_sizes.append(len(layer["thermal_options"][r]))
            max_sizes.append(len(layer["insulating_options"]))

        # The action space is defined as a multi discrete array.  ie. a value of each insulation and thermal component. 
        self.action_space = spaces.MultiDiscrete(max_sizes)

    # Reset the problem
    def reset(self, seed=None, options=None):

        # Back to layer 0
        self.current_layer = 0
        self.selected_materials = []
        return self._get_obs(), {}

    # The observation is the information about the next layer we need to configure.  
    def _get_obs(self):
        layer = self.layers[self.current_layer]
        obs = [layer["thickness"]] + \
              [len(layer["thermal_options"][r]) for r in layer["thermal_regions"]] + \
              [len(layer["insulating_options"])]
        obs += [0] * (self.obs_length - len(obs))  # pad
        return np.array(obs, dtype=np.float32)


    # Treat these problem as a multistep learning problem.  Each layer, starting at the bottom, is a step.
    # Assign values at that layer, then step to the next layer.

    # This fundamentally defines the problem and the structure of the learning.
    def step(self, action):
        # Flatten if VecEnv sends 2D array
        if isinstance(action, np.ndarray) and action.ndim > 1:
            action = action.flatten()
    
        layer = self.layers[self.current_layer]
        chosen = {}
    
        num_regions = len(layer["thermal_regions"])
        num_options = num_regions + 1  # thermal + insulating
    
        # Take only first num_options entries
        action = action[:num_options]
    
        # Clip each action to layer-specific max index
        for idx, r in enumerate(layer["thermal_regions"]):
            max_idx = len(layer["thermal_options"][r]) - 1
            clipped = int(np.clip(action[idx], 0, max_idx))
            chosen[r] = layer["thermal_options"][r][clipped]
    
        # Insulating
        max_idx = len(layer["insulating_options"]) - 1
        clipped = int(np.clip(action[num_regions], 0, max_idx))
        chosen[layer["insulating_type"]] = layer["insulating_options"][clipped]

        # Set the configuration of the current layer, and move the index to the next layer for the next step.
        self.selected_materials.append(chosen)
        self.current_layer += 1

        # Return done if we are at the last layer
        done = self.current_layer >= self.n_layers

        # Only compute the reward if this is the last layer
        reward = self._compute_reward() if done else 0.0
        obs = self._get_obs() if not done else np.zeros(self.obs_length, dtype=np.float32)
        return obs, reward, done, False, {}

    # Goal is to maximize thermal conductivity in z-direction, while minimizing the cost of the materials used
    
    def _compute_reward(self):
        # The total k
        total_k_inv = 0.0

        # Total cost given the materials
        total_cost = 0.0

        # For each layer, calculate 
        for layer, chosen in zip(self.layers, self.selected_materials):
            k_sum = 0.0

            # Sum up the cost and thermal impact from the thermal regions
            for r in layer['thermal_regions']:
                mat = chosen[r]
                k_sum += self.material_library[mat]['k']
                total_cost += self.material_library[mat]['cost']

            # cost and thermal impact of the insulation
            ins_mat = chosen[layer['insulating_type']]
            k_sum += self.material_library[ins_mat]['k']
            total_cost += self.material_library[ins_mat]['cost']

            # Account for the thickness of the layer
            k_eff = k_sum / (len(layer['thermal_regions']) + 1)
            total_k_inv += layer['thickness'] / k_eff

        K_total = 1.0 / total_k_inv
        reward = K_total - 0.1 * total_cost
        return reward


<h2>Training Loop</h2>

Given an environment, the training loop is relatively straightforward.  We will use PPO to conduct the search.

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import DummyVecEnv

problem = make_sample_problem()
env = DummyVecEnv([lambda: SemiconductorEnv(problem)])  
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=5000)

# Access the raw environment
raw_env = env.envs[0]  # 
obs = raw_env.reset()[0]  # Gymnasium-style reset returns (obs, info)
done = False

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _, _ = raw_env.step(action)  # Gymnasium API


print("Selected materials per layer:", env.envs[0].selected_materials)


<h2>Next Steps</h2>

- A preference could be added to each of the materials.  This could result in more of an interactive component to the RL system. 
Solutions could be presented to the customer to rank, and the preference would be learned (RLHF-style)

- The current environmental model does not consider the actual location of the materials on the layer, or the interactions between the layers.  This could be added to the SemiconductorEnv to get a more realistic behavior of the system.

- Run experiments on Single Step RL (configure all layers as single action) and Multistep w/ Delayed Reward to compare the performance