# **Spaceship Survival Game**

*Problem Statement*: You are developing a mobile game where players control a spaceship navigating through an asteroid field. The objective is to avoid collisions with the asteroids for as long as possible. The game environment is represented as a 2D grid, where the spaceship can move up, down, left, or right.

Spaceship should pass through the obstacles and try to survive for a maximum time possible. The spaceship can move up(↑), down(↓), right(→), or left(←).

*Objective*: Design a deep neural network that takes the current state of the game environment (i.e., the positions of the spaceship and asteroids on the grid) as input and outputs the optimal action (i.e., move up, down, left, or right) to maximize the spaceship's survival time.

*Additional Information*:

The game environment is dynamic, with asteroids moving randomly across the grid.
The spaceship's movement speed and agility are constant.
The reward system is based on the survival time, with higher rewards for longer survival durations.
The neural network should use function approximation to learn the optimal policy for navigating the spaceship through the asteroid field.


**Elaborate on, how the described problem could be solved using deep neural network and explain the action plan to create a gaming environment**

Environment Setup: Design a 2D grid environment where a spaceship navigates through moving asteroids.

State Representation: Represent the game state with spaceship and asteroid positions.

Action Space: Define discrete actions for the spaceship: up, down, left, and right.

Reward System: Implement a reward function based on survival time without collisions.

Deep Q-Network (DQN):

Neural Network: Create a neural network to map states to actions.
Experience Replay: Store experiences to stabilize training.
Target Network: Use a separate network for stability.
Q-Learning Update: Update Q-values based on observed rewards.
Epsilon-Greedy Policy: Balance exploration and exploitation.
Training: Interact with the environment, updating the network to maximize rewards.

Testing: Evaluate the trained model to measure performance metrics.

Fine-tuning: Refine the model and environment iteratively for better performance.

Following this plan should lead to an effective spaceship navigation system through the asteroid field.

**Implementation**

In [None]:
#Defining the game environment and creating a custom OpenAI Gym environment called SpaceShipEnv.

import gym
from gym import spaces
import numpy as np

class SpaceShipEnv(gym.Env):
    def __init__(self, grid_size=(10, 10), num_asteroids=10):
        super(SpaceShipEnv, self).__init__()

        # Define grid size
        self.grid_size = grid_size
        self.num_rows, self.num_cols = self.grid_size

        # Define action and observation spaces
        self.action_space = spaces.Discrete(4)  # 4 possible actions: up, down, left, right
        self.observation_space = spaces.Box(low=0, high=2, shape=(self.num_rows, self.num_cols), dtype=np.uint8)

        # Define other parameters
        self.num_asteroids = num_asteroids
        self.asteroids = []

        self.reset()

    def reset(self):
        # Initialize spaceship position
        self.spaceship_pos = [np.random.randint(0, self.num_rows), np.random.randint(0, self.num_cols)]

        # Initialize asteroids
        self.asteroids = []
        for _ in range(self.num_asteroids):
            asteroid_pos = [np.random.randint(0, self.num_rows), np.random.randint(0, self.num_cols)]
            while asteroid_pos == self.spaceship_pos:
                asteroid_pos = [np.random.randint(0, self.num_rows), np.random.randint(0, self.num_cols)]
            self.asteroids.append(asteroid_pos)

        # Reset step count
        self.steps = 0

        # Return initial observation
        return self._get_observation()

    def step(self, action):
        # Execute action
        if action == 0:  # Up
            self.spaceship_pos[0] = max(0, self.spaceship_pos[0] - 1)
        elif action == 1:  # Down
            self.spaceship_pos[0] = min(self.num_rows - 1, self.spaceship_pos[0] + 1)
        elif action == 2:  # Left
            self.spaceship_pos[1] = max(0, self.spaceship_pos[1] - 1)
        elif action == 3:  # Right
            self.spaceship_pos[1] = min(self.num_cols - 1, self.spaceship_pos[1] + 1)

        # Check for collision with asteroids
        reward = -1
        done = False
        for asteroid_pos in self.asteroids:
            if asteroid_pos == self.spaceship_pos:
                reward = -10
                done = True
                break

        # Update step count
        self.steps += 1

        # Return next observation, reward, done, info
        return self._get_observation(), reward, done, {}

    def _get_observation(self):
        # Create grid with spaceship and asteroids
        observation = np.zeros((self.num_rows, self.num_cols), dtype=np.uint8)
        observation[self.spaceship_pos[0], self.spaceship_pos[1]] = 1
        for asteroid_pos in self.asteroids:
            observation[asteroid_pos[0], asteroid_pos[1]] = 2

        return observation


In this environment:

The state space is represented by a 2D grid where each cell can be empty, contain the spaceship (value 1), or an asteroid (value 2).
The action space consists of four discrete actions: 0 for moving up, 1 for moving down, 2 for moving left, and 3 for moving right.
The reward is -1 for each step taken, and -10 if the spaceship collides with an asteroid.
The episode terminates when the spaceship collides with an asteroid.
This environment can be used for training and evaluating the spaceship agent to avoid collisions with asteroids and maximize its survival time.

In [None]:
#Creating a class called ReplayBuffer that stores experiences (state, action, reward, next state, terminal flag) and provides methods to add experiences and sample batches of experiences for training

from collections import deque
import random

class ReplayBuffer:
    def __init__(self, max_size):
        self.buffer = deque(maxlen=max_size)

    def add(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = map(np.array, zip(*batch))
        return states, actions, rewards, next_states, dones


In [None]:
#Designing the neural network architecture for the DQN using Convolutional Neural Networks.

import tensorflow as tf

class DQN(tf.keras.Model):
    def __init__(self, num_actions):
        super(DQN, self).__init__()
        self.conv1 = tf.keras.layers.Conv2D(32, kernel_size=8, strides=4, activation='relu')
        self.conv2 = tf.keras.layers.Conv2D(64, kernel_size=4, strides=2, activation='relu')
        self.conv3 = tf.keras.layers.Conv2D(64, kernel_size=3, strides=1, activation='relu')
        self.flatten = tf.keras.layers.Flatten()
        self.fc1 = tf.keras.layers.Dense(512, activation='relu')
        self.fc2 = tf.keras.layers.Dense(num_actions)

    def call(self, inputs):
        x = self.conv1(inputs)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.flatten(x)
        x = self.fc1(x)
        q_values = self.fc2(x)
        return q_values


In this architecture:

-We use three convolutional layers (Conv2D) with ReLU activation functions to extract features from the input game state.

-The output of the last convolutional layer is flattened (Flatten), and then passed through two fully connected (dense) layers (Dense) with ReLU activation functions.

-The final fully connected layer outputs the Q-values for each possible action.

In [None]:
#Epsilon-Greedy Exploration

import numpy as np

class DQNAgent:
    def __init__(self, state_size, action_size):
        # Other initialization code...
        self.epsilon = 1.0  # Initial exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            # Explore: Select a random action
            return np.random.randint(self.action_size)
        else:
            # Exploit: Select the action with the highest Q-value
            q_values = self.model.predict(state)
            return np.argmax(q_values[0])

    def replay(self, batch_size):
        # Replay buffer and training code...
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay


In this implementation:

-During each action selection, with probability epsilon, the agent explores by selecting a random action.

-With probability 1 - epsilon, the agent exploits by selecting the action with the highest Q-value according to the current policy.

-The value of epsilon is decayed over time to gradually shift towards more exploitation and less exploration as training progresses.

In [None]:
#Implementing training loop for DQN Agent

class DQNAgent:
    def __init__(self, state_size, action_size):
        # Initialization code...
        self.model = DQN(action_size)
        self.target_model = DQN(action_size)
        self.target_model.set_weights(self.model.get_weights())

    def train(self, env, num_episodes, batch_size):
        for episode in range(num_episodes):
            state = env.reset()
            state = np.reshape(state, [1, state_size[0], state_size[1], state_size[2]])
            total_reward = 0
            while True:
                action = self.act(state)
                next_state, reward, done, _ = env.step(action)
                next_state = np.reshape(next_state, [1, state_size[0], state_size[1], state_size[2]])
                total_reward += reward

                self.remember(state, action, reward, next_state, done)

                state = next_state

                if done:
                    break

                self.replay(batch_size)
                self.update_target_model()

            print(f"Episode {episode + 1}/{num_episodes}, Total Reward: {total_reward}")

    def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())


In this implementation:

-The train method trains the DQN agent in the given environment for a specified number of episodes.

-Within each episode, it follows the training loop described above: selects actions, executes them in the environment, stores experiences in the replay buffer, samples a batch of experiences, computes target Q-values, updates the Q-network, and periodically updates the target network.

-The update_target_model method updates the weights of the target network to match the current weights of the Q-network. This update typically occurs periodically to stabilize training.

In [None]:
#Training and Evaluation

import numpy as np

class DQN:
    def __init__(self, num_actions):
        pass

    def predict(self, state):
        pass

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.model = DQN(action_size)
        self.target_model = DQN(action_size)
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return np.random.randint(self.action_size)
        else:
            q_values = self.model.predict(state)
            return np.argmax(q_values[0])

    def remember(self, state, action, reward, next_state, done):
        pass

    def replay(self, batch_size):
        pass

    def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())

    def train(self, env, num_episodes, batch_size):
        for episode in range(num_episodes):
            state = env.reset()
            state = np.reshape(state, [1, *self.state_size])  # Adjusted for 2D state
            total_reward = 0
            while True:
                action = self.act(state)
                next_state, reward, done, _ = env.step(action)
                next_state = np.reshape(next_state, [1, *self.state_size])  # Adjusted for 2D state
                total_reward += reward

                self.remember(state, action, reward, next_state, done)

                state = next_state

                if done:
                    break

            print(f"Episode {episode + 1}/{num_episodes}, Total Reward: {total_reward}")

# Initialize environment and agent
env = SpaceShipEnv()
state_size = env.observation_space.shape
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)

# Train agent
num_episodes = 1000
batch_size = 32
agent.train(env, num_episodes, batch_size)

# Test agent
test_episodes = 10
total_rewards = []
for _ in range(test_episodes):
    state = env.reset()
    state = np.reshape(state, [1, *state_size])
    total_reward = 0
    while True:
        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        state = np.reshape(next_state, [1, *state_size])
        if done:
            break
    total_rewards.append(total_reward)

average_reward = sum(total_rewards) / len(total_rewards)
print(f"Average reward over {test_episodes} test episodes: {average_reward}")


Episode 1/1000, Total Reward: -18
Episode 2/1000, Total Reward: -20
Episode 3/1000, Total Reward: -10
Episode 4/1000, Total Reward: -13
Episode 5/1000, Total Reward: -10
Episode 6/1000, Total Reward: -57
Episode 7/1000, Total Reward: -13
Episode 8/1000, Total Reward: -19
Episode 9/1000, Total Reward: -21
Episode 10/1000, Total Reward: -15
Episode 11/1000, Total Reward: -15
Episode 12/1000, Total Reward: -54
Episode 13/1000, Total Reward: -61
Episode 14/1000, Total Reward: -43
Episode 15/1000, Total Reward: -14
Episode 16/1000, Total Reward: -21
Episode 17/1000, Total Reward: -28
Episode 18/1000, Total Reward: -76
Episode 19/1000, Total Reward: -32
Episode 20/1000, Total Reward: -74
Episode 21/1000, Total Reward: -11
Episode 22/1000, Total Reward: -12
Episode 23/1000, Total Reward: -12
Episode 24/1000, Total Reward: -39
Episode 25/1000, Total Reward: -25
Episode 26/1000, Total Reward: -64
Episode 27/1000, Total Reward: -20
Episode 28/1000, Total Reward: -26
Episode 29/1000, Total Reward

Overall, it is implementing a DQN agent that learns to make decisions in an environment through interaction and reinforcement learning:

1. **DQN Class**: Represents the neural network architecture for the Deep Q-Network (DQN) used by the agent.

2. **DQNAgent Class**: Represents the agent that interacts with the environment, learns from experiences, and makes decisions.
   - **Initialization**: Sets up the agent with the given state and action sizes, creates neural networks, and initializes exploration parameters.
   - **act**: Chooses actions based on an epsilon-greedy policy.
   - **remember**: Stores experiences (state, action, reward, next state, done) in a replay buffer.
   - **replay**: Samples experiences from the replay buffer and updates the model.
   - **update_target_model**: Updates the target network's weights to match the main network's weights.
   - **train**: Trains the agent by interacting with the environment and updating the model.

3. **Training and Testing**:
   - Initializes the environment, state size, action size, and the agent.
   - Trains the agent for a specified number of episodes.
   - Tests the trained agent by running it for a certain number of test episodes and evaluating its performance based on the total rewards obtained.
