### Learning Spectrum Availability using Reinforcement Learning
We are using reinforcement learning technique known as Q-Learning to learn spectrum availability in a predefined spectrum environment. Here are some assumptions about the spectrum environment:
- The spectrum environment consist of slots.
- Being in a slot is considered being a state.
- A transmitter in a slot may decided to transit or stay idle.
- There is an associated penalty for not transmitting.
- Transmitting with no collision attracts a reward
- Transmitting with collision results in a penalty.
- Rounds to transmit or not transmit in slots are termed episodes.
- A terminal state in an episode is when an action to transmit results in a collision.





In [75]:
import numpy as np

NOTRANSMIT_PENALTY = -1
COLLISION_PENALTY = -5
TRANSMIT_REWARD = 10

class ChannelSpectrumEnv:
    def __init__(self, num_slots, busy_prob):
        self.num_slots = num_slots
        self.busy_prob = busy_prob
        self.state = 0
        self.actions = [0,1]
        self.reward_table = np.zeros((num_slots, 2))
        self.q_table = np.zeros((num_slots,2))

        np.random.seed(42)
        
        for i in range(self.num_slots):
            self.reward_table[i][0] = NOTRANSMIT_PENALTY
            self.reward_table[i][1] = TRANSMIT_REWARD if np.random.uniform(0,1) > busy_prob else COLLISION_PENALTY

        # print(self.reward_table)

    def reset(self):
        self.state = 0
        return self.state
    
    def step(self, action):
        done = False
        reward = self.reward_table[self.state][action]

        if action == 1 and reward == COLLISION_PENALTY:
            done = True
            next_state = self.state
        else:
            next_state = self.state + 1 if self.state < self.num_slots - 1 else 0

        self.state = next_state
        return next_state, reward, done
    
    def get_optimal_policy(self):
        return np.argmax(self.q_table, axis=1)
    

    def train(self, episodes, learning_rate=0.01, discount_factor=0.95, epsilon=0.1):
        for episode in range(episodes):
            state = self.reset()
            done = False
            while not done:
                # exploration vs exploitation
                if np.random.uniform(0,1) < epsilon:
                    action = np.random.choice(self.actions)
                else:
                    action = np.argmax(self.q_table[state])
                
                # take a step
                next_state, reward, done = self.step(action)

                # update q-table using the  Q-Learning equation
                self.q_table[state][action] += learning_rate * (reward + discount_factor * np.max(self.q_table[next_state]) - self.q_table[state][action])

                state = next_state


In [76]:
# Example usage:
num_slots = 10
busy_prob = 0.3
episodes = 10000

env = ChannelSpectrumEnv(num_slots, busy_prob)
env.train(episodes)

print("Optimal policy:", env.get_optimal_policy())
print(f"Q-table: {env.q_table}")
print(f'Rewards table: {env.reward_table}')

Optimal policy: [1 1 1 1 0 0 0 1 1 1]
Q-table: [[125.30819191 136.30819191]
 [121.95599148 132.95599148]
 [118.42735946 129.42735946]
 [114.71300996 125.71300996]
 [121.80316837 110.71300996]
 [129.26649303 117.80316837]
 [137.12262424 125.26649303]
 [134.39223604 145.39223604]
 [131.5181432  142.5181432 ]
 [128.49278231 139.49278232]]
