# COGS 188 - Final Project

# Reinforcement Learning for Volleyball: A Comprehensive Analysis of Different RL Algorithms

## Group members

- Caleb Galdston
- Aden Harris
- Ayden Tabatabi
- Sia Khorsand

# Abstract 

In this project we aim to compare the performance of different reinforcement learning algorithms at playing slime volleyball - a 2 dimensional volleyball game in which the objective is to get the ball to land on the opponents side of the net. To train our agents, we used the SlimeVolleyballGym environment which allowed us to test the performance of multi-agent reinforcement learning algorithms and have different agents play against eachother. This environment has a simple reward structure: the agent recieves +1 if the ball lands on the opponents side of the net, and -1 if it lands on the agent's side. To compare the performance of our algorithms, we will focus on the average reward that they receive over 1000 episodes against the baseline model defined in the environment, and also we will see the average reward that they receive when competing against one another. Each episode has a maximum score of 5 as it ends once one of the agents wins 5 rounds. 

^ This section should be short and clearly stated. It should be a single paragraph <200 words. It should summarize:

what your goal/problem is
what the data used represents
the solution/what you did
major results you came up with (mention how results are measured)
NB: this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

For our background, we wanted to do more research into some of the algorithms discussed in class, as well as other potential approaches we wanted to attempt for this problem. Since this repository already has explanations and implementations of a few reinforcement learning algorithms, we wanted to focus on technqiues not already implemented. 

### DQN
Deep Q-Learning eliminates the need for a Q-table, which an agent traditionally uses to maximize future rewards. Implementing a Q-table becomes impractical in large or complex environments, such as the SlimeVolley environment, which operates in a continuous space with 12-dimensional features. Instead, Deep Q-Learning employs a deep neural network to approximate Q-values, making it a more scalable and feasible approach for such tasks. This method was first introduced by researchers at Google DeepMind in their 2013 paper, *Playing Atari with Deep Reinforcement Learning* by Mnih et al. <a name="mnih"></a>[<sup>[1]</sup>](#mnihnote) The paper's goal was to merge deep learning with traditional reinforcement learning techniques, such as Q-Learning, to create an algorithm capable of learning to play Atari games using pixelated images as inputs. A key innovation in their approach was the use of experience replay, where the agent stores past experiences in a memory pool and randomly samples from this pool to perform Q-Learning updates. This technique helps stabilize training by breaking the correlation between consecutive experiences. While our SlimeVolley environment doesn't rely on pixelated images, we believe this strategy will still be effective in helping our agent learn optimal volleyball strategies.

In addition to the foundational DQN approach, we explored more recent advancements that have demonstrated significant improvements. One such advancement is Double Q-Learning, introduced by van Hasselt et al. in their paper *Deep Reinforcement Learning with Double Q-learning* <a name="hasselt"></a>[<sup>[2]</sup>](#hasseltnote). This method addresses a known issue in DQN: the tendency to overestimate Q-values in certain scenarios. Double Q-Learning extends the original tabular Double Q-Learning concept to the DQN framework. Traditional Q-Learning and DQN use the same values to both select and evaluate actions, which can lead to compounded overestimations. Van Hasselt's solution decouples action selection from evaluation, mitigating this issue. For our project, we aim to investigate whether implementing Double Q-Learning can enhance the performance of our algorithm in the SlimeVolley environment.

# Background

Fill in the background and discuss the kind of prior work that has gone on in this research area here. **Use inline citation** to specify which references support which statements.  You can do that through HTML footnotes (demonstrated here). I used to reccommend Markdown footnotes (google is your friend) because they are simpler but recently I have had some problems with them working for me whereas HTML ones always work so far. So use the method that works for you, but do use inline citations.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 3 to 5 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

# Problem Statement

How can we compare the performance of different reinforcement learning approaches to the same problem and what are the key distinctions between these approaches that make some more suitable than others?

# Data

Detail how/where you obtained the data and cleaned it (if necessary)

If the data cleaning process is very long (e.g., elaborate text processing) consider describing it briefly here in text, and moving the actual clearning process to another notebook in your repo (include a link here!).  The idea behind this approach: this is a report, and if you blow up the flow of the report to include a lot of code it makes it hard to read.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

For our evaluation metric, we wanted to compare how our agents performed against the baseline model, so after training each of the models individually, we started by having them play 500 games against the baseline to see how the average reward across these games compared. We also had our models compete against eachother to determine which had a higher reward on average when competing against the other algorithms we implmented. 

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Deep Q-Network

Our first implementation of the DQN algorithm was very simple. We discretized all possible actions, and used a four layer neural network to estimate q-values. It followed very closely to what we did for the cartpole balance task in assignment four. Even with such a basic architecture, it was able to beat the baseline model a few times. A few issues that we quickly noticed were that we didn't need to include all of the actions, since a few possible options such as going left and right at the same time were likely not going to be useful in this environment. Additionally, we noticed that the epsilon decay rate may have been a bit too aggressive, as early on it didn't seem like the agent was exploring many different actions as for the first few hundred iterations, it made almost no progress at all. The reward function that the environment uses by default was also a bit sparse as the agent only receives positive or negative rewards after the ball hits the ground, which doesn't encourage potentially positive actions such as hitting the ball over the net. However, to keep our benchmarks the same across the different algorithms, we decided that it would be best to keep it this way. 

Here is a plot of the reward from the first 1000 training episodes of our dqn training. 

<img src="plots/dqn_one.png" width="500" height="300">

As we can see towards then last few hundred episodes the agent definately showed signs of improvement, however, we wanted to see if we could do better. 

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Probably you should include a learning curve to demonstrate how much better the model gets as you increase the number of trials

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Generally reinforement learning tasks may require a huge amount of training, so extensive grid search is unlikely to be possible. However expoloring a few reasonable hyper-parameters may still be possible.  Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?  Or you compare a completely different approach/alogirhtm to the problem? Whatever, this stuff is just serving suggestions.



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.


### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   


### Future work
Looking at the limitations and/or the toughest parts of the problem and/or the situations where the algorithm(s) did the worst... is there something you'd like to try to make these better.

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name='mnihnote'></a>1.[^](#mnihnote):  Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602<br>
<a name="hasselt"></a>2.[^](#hasseltnot): van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://doi.org/10.1609/aaai.v30i1.1029<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
