In [None]:
def maddpg(n_episodes=1000, print_every=100):
    # loop from n_episodes
    for i_episode in range(1, n_episodes+1):

        env_info = env.reset(train_mode=True)[brain_name]     

        states = env_info.vector_observations
        states = np.reshape(states, (1, 48))
        agent1.reset()
        agent2.reset()
        agent_scores = np.zeros(num_agents)

        while True:
            # Determine actions for the Unity ML-agents from current sate
            actions1 = agent1.act(states, add_noise=True)
            actions2 = agent2.act(states, add_noise=True)

            # Send the actions to the Unity ML-agents in the environment and receive environment information
            actions = np.concatenate((actions1, actions2), axis=0) 
            actions = np.reshape(actions, (1, 4))
            env_info = env.step(actions)[brain_name]        

            next_states = env_info.vector_observations   # Get the next states for each unity agent in the environment
            next_states = np.reshape(next_states, (1, 48))
            rewards = env_info.rewards                   # Get the rewards for each the Unity ML-agent in the environment
            dones = env_info.local_done                  # See if episode has finished for each the Unity ML-agent in the environment

            #Send (S, A, R, S') info to the training agent for Replay Buffer (memory) and network updates
            agent1.step(states, actions1, rewards[0], next_states, dones[0])
            agent2.step(states, actions2, rewards[1], next_states, dones[1])
            
            # Set new states to current states for determining next actions
            states = next_states

            # Update episode score for each the Unity ML-agent
            if np.any(dones):
                break

        scores.append(np.max(agent_scores))
        average_score = np.mean(scores[i_episode-min(i_episode,average_score_range):i_episode+1])

        print('\nEpisode {}\tEpisode Score: {:.3f}\tAverage Score: {:.3f}'.format(i_episode, scores[i_episode-1], average_score), end="")

        an_filename = "actor_agent1.pth"
        torch.save(agent1.actor_local.state_dict(), an_filename)
        cn_filename = "critic_agent1.pth"
        torch.save(agent1.critic_local.state_dict(), cn_filename)
        
        an_filename = "actor_agent2.pth"
        torch.save(agent2.actor_local.state_dict(), an_filename)
        cn_filename = "critic_agent2.pth"
        torch.save(agent2.critic_local.state_dict(), cn_filename)

        # Check to see if the task is solved (i.e,. avearge_score > solved_score over 100 episodes). 
        # If yes, save the network weights and scores and end training.
        if i_episode > 100 and average_score >= solved_score:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.3f}'.format(i_episode, average_score))
            break
    return scores

This code implements the MADDPG algorithm for training multiple agents to play a Unity ML-Agents environment. The algorithm uses a Replay Buffer (memory) to store and sample experience tuples and update the agents' neural networks. The function takes in two arguments, n_episodes and print_every, and returns a list of scores for each episode.

The maddpg function loops through n_episodes and for each episode, it resets the environment, gets the initial states for each agent, and then enters a while loop to interact with the environment. The agents determine their actions based on the current states, and these actions are sent to the environment, which returns the next states, rewards, and whether the episode has ended (i.e., dones). The agents then store these tuples (states, actions, rewards, next_states, dones) in their Replay Buffer and use them to update their neural networks. This process continues until the episode ends, and the episode score for each agent is stored in the scores list.

After each episode, the scores list is updated, and the average score over the previous 100 episodes is calculated. The current episode score and average score are printed every print_every episodes. The agents' neural network weights are saved every episode, and if the average score over the last 100 episodes is greater than or equal to the solved_score, the algorithm stops training, and the neural network weights and scores are saved.

Overall, this code implements the MADDPG algorithm for training multiple agents to play a Unity ML-Agents environment and saves the trained neural network weights and scores once the environment is solved.

## Future Work

Here are a few possible directions for future work on MADDPG:

### Parameter tuning: 

There are several hyperparameters that need to be tuned for the MADDPG algorithm, such as the learning rate, discount factor, batch size, noise level, etc. Optimizing these hyperparameters could improve the performance of the algorithm. One possible approach is to use grid search or random search to find the optimal hyperparameters.
Incorporating hierarchical learning: Hierarchical learning is a way to structure the learning process such that the agent can learn to solve complex tasks by breaking them down into smaller sub-tasks. DDPG could be extended to incorporate hierarchical learning by adding a hierarchy of policies that learn to solve sub-tasks.

### Curriculum learning: 

Curriculum learning is a training technique where the difficulty of the environment is gradually increased over time. This can help the agents learn a better policy by starting with simpler tasks and gradually increasing the complexity. This approach can be especially useful for environments where the reward signal is sparse or delayed.
Adapting to changing environments: In many real-world scenarios, the environment can change over time. DDPG could be extended to learn to adapt to changes in the environment, for example by using meta-learning or by incorporating online learning.

### Multi-agent exploration:

One of the challenges in multi-agent reinforcement learning is exploration. If each agent follows a deterministic policy, it may get stuck in a suboptimal equilibrium. One possible solution is to add a separate exploration policy for each agent, which can help the agents explore the environment more effectively.


### Transfer learning:

Transfer learning is a technique where a model trained on one task is used to initialize the model for another related task. In multi-agent reinforcement learning, this can be used to transfer knowledge from a simpler environment to a more complex one. This can help reduce the training time and improve the performance of the agents.

### Adaptive noise scaling:

The current implementation uses a fixed level of noise during training to encourage exploration. However, this may not be optimal for all environments. Adaptive noise scaling can be used to dynamically adjust the level of noise based on the performance of the agents. For example, if the agents are performing well, the noise level can be reduced to focus on exploitation.

### Population-based training: 
Population-based training is a technique where multiple agents are trained in parallel and periodically share their experiences to improve the overall performance. This can help prevent overfitting and improve exploration. One example of this is the Population-Based Training (PBT) algorithm developed by DeepMind.

### Multi-agent imitation learning: 
Imitation learning is a supervised learning technique where the agents learn to mimic an expert policy. In multi-agent imitation learning, each agent learns to imitate the behavior of the other agents in the environment. This can be useful for environments where the reward signal is sparse or delayed. One example of this is the Multi-Agent Imitation Learning (MAIL) algorithm developed by OpenAI.

### Hierarchical reinforcement learning: 
Hierarchical reinforcement learning is a technique where the environment is decomposed into a hierarchy of subtasks, and each subtask is learned separately. This can help reduce the complexity of the environment and make it easier to learn a good policy. One example of this is the Feudal Multi-Agent Deep Reinforcement Learning (F-MADRL) algorithm developed by Tencent AI Lab.

### Decentralized training with centralized control (DTC): 
In DTC, each agent has its own local policy, but the policies are coordinated by a centralized controller that observes the global state of the environment. This can help prevent overfitting and improve coordination between the agents. One example of this is the Decentralized Actor-Critic (DIC) algorithm developed by DeepMind.

### Reward shaping: 
Reward shaping is a technique where the reward function is modified to encourage desired behavior. This can be useful for environments where the reward signal is sparse or delayed. For example, in the Predator-Prey game, the reward for the predator can be shaped to encourage it to chase the prey, and the reward for the prey can be shaped to encourage it to evade the predator. One example of this is the Deep Q-learning with Reward Shaping (DQ-RS) algorithm developed by MIT.

