This project implements a Q-Learning agent to solve the classic Taxi-v3 environment from OpenAI Gym. The agent learns to navigate a taxi efficiently within a grid-world, pick up a passenger, and drop them off at their designated destination. This repository showcases the Q-Learning algorithm, its training process, and performance evaluation through various scenarios. It also includes the generation of a dataset from the Q-table, which can be used for potential future work with neural networks.
- Introduction
- Environment: Taxi-v3
- Q-Learning Algorithm
- Project Structure
- Setup and Installation
- Usage
- Results
- Future Improvements
- License
- Contact
Reinforcement Learning (RL) allows an agent to learn by interacting with an environment, aiming to maximize cumulative rewards. This project applies Q-Learning, a model-free, off-policy RL algorithm, to the Taxi-v3 problem. The agent constructs a Q-table, which stores the expected future rewards for each state-action pair, thereby learning an optimal policy.
The agent's decision-making balances exploration (trying new actions) and exploitation (choosing known best actions) using an
The Taxi-v3 environment is a well-known discrete Reinforcement Learning problem from OpenAI Gym.
- Grid World: A 5x5 grid represents the taxi's operational area.
- State Space: The environment's state is defined by the taxi's location (row, column), the passenger's location, and the destination. There are 404 valid states, considering various combinations.
- Action Space: The taxi has 6 possible discrete actions:
0: Move South1: Move North2: Move East3: Move West4: Pickup passenger5: Drop off passenger
- Rewards:
+20: Successful passenger drop-off at the correct destination.-10: Invalid pickup or drop-off attempt.-1: For each step taken.
- Goal: The primary objective is for the agent to learn the most efficient sequence of actions to pick up and drop off the passenger.
Q-Learning learns an action-value function,
The Q-table is iteratively updated using the Bellman equation:
Where:
-
$Q(s, a)$ : Current Q-value for state$s$ and action$a$ . -
$\alpha$ (learning rate): Dictates how much new information influences the existing Q-value (range: 0 to 1). -
$R_{t+1}$ : The immediate reward received from the environment. -
$\gamma$ (discount factor): Balances the importance of immediate vs. future rewards (range: 0 to 1). -
$\max_{a'} Q(s', a')$ : The maximum Q-value achievable from the next state$s'$ . -
$s'$ : The state reached after performing action$a$ .
.
├── taxi_q_learning.py # Main script: Q-Learning agent, training, and evaluation
├── q_learning_data.csv # Generated CSV: Q-table derived data (output after training)
├── README.md # This documentation file
└── requirements.txt # Python package dependencies
To run this project, ensure you have Python installed. Using a virtual environment is highly recommended.
-
Clone the repository:
git clone https://github.com/AbhinavBugudi69/Taxi-Pathfinding-Q-Learning.git # Update with your actual repo URL if different cd Taxi-Pathfinding-Q-Learning
-
Create and activate a virtual environment:
python -m venv venv # On Windows: .\venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install dependencies: Create a
requirements.txtfile in your project's root directory with the following content:gymnasium numpy pandas matplotlibThen install them:
pip install -r requirements.txt
Execute the taxi_q_learning.py script to initiate the agent's training, data generation, simulation, and result plotting.
python taxi_q_learning.pyThe TaxiV3QLearningAgent class encapsulates the training logic. Key hyperparameters are configured in the if __name__ == "__main__": block:
num_train_episodes: Number of training iterations.alpha(learning rate): Impact of new learning on existing knowledge.gamma(discount factor): Preference for immediate vs. future rewards.epsilon: Initial exploration vs. exploitation balance.epsilon_min: The lowest exploration rate allowed.epsilon_decay: Rate at which exploration decreases.
Upon completion of training, an q_learning_data.csv file is created. This CSV maps environment states (encoded taxi, passenger, and destination locations) to the optimal action determined by the trained Q-table. This dataset can serve as ground truth for training a supervised learning model, like a Neural Network, to predict optimal actions.
The process_random_config function demonstrates the trained agent's behavior in a new, random scenario. It outputs the initial state's details and the sequence of actions taken, rewards received, and total steps until completion.
The agent's performance is rigorously tested against two "extreme" pre-defined scenarios:
- Taxi starts farthest from the passenger.
- Passenger is picked up but farthest from the destination.
For each, the agent runs multiple test episodes (defaulting to 100), providing average rewards, steps, and penalties, which indicate the robustness of the learned policy.
Visual insights into the learning process are provided by two plots generated at the end of the script:
- Rewards Per Episode: Illustrates how the total reward evolves over training episodes.
- Steps Per Episode: Shows the number of steps taken by the agent in each training episode.
These plots should ideally show an increasing trend for rewards and a decreasing trend for steps, signifying effective learning.
Executing taxi_q_learning.py will produce console output similar to this:
!Training the Q-learning agent!
!Generating data for neural network training!
Data saved to 'q_learning_data.csv'.
Analyzing extreme configuration 1: {'taxi_row': 0, 'taxi_col': 0, 'passenger_index': 3, 'destination_index': 0}
Testing configuration: {'taxi_row': 0, 'taxi_col': 0, 'passenger_index': 3, 'destination_index': 0}
Average Reward: 8.31
Average Steps: 12.69
Average Penalties: 0.00
Analyzing extreme configuration 2: {'taxi_row': 4, 'taxi_col': 4, 'passenger_index': 0, 'destination_index': 3}
Testing configuration: {'taxi_row': 4, 'taxi_col': 4, 'passenger_index': 0, 'destination_index': 3}
Average Reward: 8.03
Average Steps: 12.97
Average Penalties: 0.00
Starting simulation for a random configuration :
Taxi Initial Location: (row=2, col=4)
Passenger Location: 0
Destination: 3
Step 1: Action=3, Reward=-1
Step 2: Action=3, Reward=-1
Step 3: Action=3, Reward=-1
Step 4: Action=1, Reward=-1
Step 5: Action=1, Reward=-1
Step 6: Action=3, Reward=-1
Step 7: Action=4, Reward=-1
Step 8: Action=2, Reward=-1
Step 9: Action=0, Reward=-1
Step 10: Action=0, Reward=-1
Step 11: Action=2, Reward=-1
Step 12: Action=2, Reward=-1
Step 13: Action=0, Reward=-1
Step 14: Action=0, Reward=-1
Step 15: Action=5, Reward=20
Simulation complete. Total Reward: 6, Steps: 15
The accompanying plots will graphically illustrate the agent's learning curve, showing rewards generally increasing and steps decreasing over training episodes, indicating successful policy convergence and efficient problem-solving. A notable aspect is the consistent average penalties of 0.00 across extreme test configurations, confirming the agent's ability to avoid illegal actions.
- Deep Q-Network (DQN) Implementation: Extend this project by training a Neural Network using the generated
q_learning_data.csv. This would allow for handling much larger or continuous state spaces where explicit Q-tables become unmanageable. - Advanced Hyperparameter Tuning: Implement automated hyperparameter optimization techniques (e.g., Grid Search, Random Search, or Bayesian Optimization) to systematically find the most effective
alpha,gamma, andepsilondecay schedules. - Custom Environment Design: Apply the Q-Learning framework to a more complex, custom-designed grid-world or a simulated robotics environment to test its adaptability.
- Real-time Visualization: Enhance the simulation by adding real-time visual rendering of the taxi's movements using
render_mode="human"or external visualization libraries. - Robust Evaluation: Introduce more advanced evaluation metrics beyond simple averages, such as success rates, episode completion times, and convergence speed comparisons.
This project is open-source and available under the MIT License.
For any questions, collaboration opportunities, or discussions about this project, please feel free to connect:
Abhinava Sai Bugudi
- Email: abhinavasaibugudi04@gmail.com
- LinkedIn: bugudi-abhinava-sai
- GitHub: AbhinavBugudi69