# Problem Setting

- Recent developments in reinforcement learning  fueled by deep networks with raw state representations passed as input (see AlphaGo, Atari DQN)
- Growing problems settings and architectures lead to explosion of parameters
- Reasearchers do not want to handcraft features, or waste compute time on large models that are ineffective.  

# Problem Statement
"Can a reduced set of raw state features serve as a proxy, or even improve the performance of reinforcement learning models compatable with neural networks"

- Domain is Hearts, as it decomposes into small number of raw features
- 3 experiments:  
    - First, investigate setting and function approximation of raw features
    - Second, investiage performance of approximation with reduced combinations of raw state features 
    - Lastly, attempt to improve on upperbound in Hearts with RL methods

# Setting the Framework
## Game of Hearts
Game is trick winning, where highest card of the leading suit wins a trick

*Goal*: win as few points as possible (Hearts are 1 point, Qs is 13)

Winner of the last suit leads, cannot lead with hearts unless they have been broken.  Must play the card of the leading suit (if you have it)

## Formulation as an MDP:
 - States:  All combinations of hands, cards played, and scores.  Decomposes into:
    1.  cards in hand (in-hand)
    2.  card in play during current trick (in-play)
    3.  cards which have been played in current round (played-cards)
    4.  cards which have been won by each player (cards-won)
    5.  scores of each player (scores)

 - Action: Choose a card to play from your hand (must follow the rules!?!), passing cards X
 - Reward: Negative of points won (agent wants less points)
 - Time, space both discrete (trick is time step, action space discrete)
 - POMDP, Multiagent (relax these assumptions)

In [2]:
import gym
import multiprocessing
from gymhearts.Hearts import *
from gymhearts.Agent.agent_random import RandomAgent
from gymhearts.Agent.agent_mc_simple import MonteCarlo
from gymhearts.Agent.agent_mc_nn import MonteCarloNN
from gymhearts.Agent.agent_reinforce import REINFORCE_Agent
from gymhearts.Agent.utils_env import *
from gymhearts.Agent.utils_nn import *
from tqdm import tqdm_notebook



In [None]:
# ---------- EVALUATE MC SIMPLE AGENT --------------

# Number of episodes to run during model evaluation
NUM_EPISODES = 100

# Number of model evaluations to average together
NUM_TESTS = 1

# Max score for players to win the game
MAX_SCORE = 100

# Run testing on a random agent for comparison
run_random = False

# Name of the file that is saved :: {model_name}.th
model_name = 'final_mc_simple'

# Evaluation parameters for testing
mc_simple_config = {
    'print_info' : False,
    'load_model' : model_name
}

playersNameList = ['MonteCarlo', 'Rando', 'Randy', 'Randall']
agent_list = [0, 0, 0, 0]

agent_list[0] = MonteCarlo(playersNameList[0], mc_simple_config)
agent_list[1] = RandomAgent(playersNameList[1], {'print_info' : False})
agent_list[2] = RandomAgent(playersNameList[2], {'print_info' : False})
agent_list[3] = RandomAgent(playersNameList[3], {'print_info' : False})

In [None]:
# Function to test mc simple model with multiprocessing
def run_test(num_won):
    # Weird hack to make progress bars render properly
    print(' ', end='', flush=True)
    for i_ep in tqdm_notebook(range(NUM_EPISODES)):
        observation = env.reset()
        while True:
            now_event = observation['event_name']
            IsBroadcast = observation['broadcast']
            action = None
            if IsBroadcast == True:
                for agent in agent_list:
                    agent.Do_Action(observation)
            else:
                playName = observation['data']['playerName']
                for agent in agent_list:
                    if agent.name == playName:
                        action = agent.Do_Action(observation)
            if now_event == 'GameOver':
                num_won += int(observation['data']['Winner'] == 'MonteCarlo')
                break
            observation, reward, done, info = env.step(action)
    return num_won

In [None]:
env = gym.make('Hearts_Card_Game-v0')
env.__init__(playersNameList, MAX_SCORE)

mc_wins = [0] * NUM_TESTS
   
pool = multiprocessing.Pool(processes=NUM_TESTS)
mc_wins = pool.map(run_test, mc_wins)
pool.close()
pool.join()
print(f"Monte Carlo Simple won {sum(mc_wins)/len(mc_wins)} times on average :: {str(mc_wins)}")

In [None]:
# ----------- EVALUATE MC NN AGENT ---------------

# Features to include in model :: [in_hand, in_play, played_cards, won_cards, scores]
feature_list = ['in_hand', 'in_play']

# Name of the file that is saved :: {model_name}.th
model_name = 'final_mc_nn'

# Evaluation parameters for testing
mc_nn_config = {
    'print_info' : False,
    'load_model' : model_name,
    'feature_list' : feature_list
}

agent_list[0] = MonteCarloNN(playersNameList[0], mc_nn_config)


In [None]:
# Function to test mc nn model with multiprocessing
def run_test(num_won):
    # Weird hack to make progress bars render properly
    print(' ', end='', flush=True)
    for i_ep in tqdm_notebook(range(NUM_EPISODES)):
        observation = env.reset()
        while True:
            now_event = observation['event_name']
            IsBroadcast = observation['broadcast']
            action = None
            if IsBroadcast == True:
                for agent in agent_list:
                    agent.Do_Action(observation)
            else:
                playName = observation['data']['playerName']
                for agent in agent_list:
                    if agent.name == playName:
                        action = agent.Do_Action(observation)
            if now_event == 'GameOver':
                num_won += int(observation['data']['Winner'] == 'MonteCarlo')
                break
            observation, reward, done, info = env.step(action)
    return num_won

In [None]:
env = gym.make('Hearts_Card_Game-v0')
env.__init__(playersNameList, MAX_SCORE)

mc_wins = [0] * NUM_TESTS 

pool = multiprocessing.Pool(processes=NUM_TESTS)
mc_wins = pool.map(run_test, mc_wins)
pool.close()
pool.join()
print(f"Monte Carlo NN won {sum(mc_wins)/len(mc_wins)} times on average :: {str(mc_wins)}")

In [None]:
# ----------- EVALUATE REINFORCE AGENT ---------------

# Features to include in model :: [in_hand, in_play, played_cards, won_cards, scores]
feature_list = ['in_hand', 'in_play']

# Name of the file that is saved :: {model_name}.th
model_name = 'final_reinforce'

# Evaluation parameters for testing
reinforce_config = {
    'print_info' : False,
    'load_model' : model_name,
    'feature_list' : feature_list
}

playersNameList = ['REINFORCE', 'Rando', 'Randy', 'Randall']
agent_list[0] = REINFORCE_Agent(playersNameList[0], reinforce_config)

In [None]:
# Function to test reinforce model with multiprocessing
def run_test(num_won):
    # Weird hack to make progress bars render properly
    print(' ', end='', flush=True)
    for i_ep in tqdm_notebook(range(NUM_EPISODES)):
        observation = env.reset()
        while True:
            now_event = observation['event_name']
            IsBroadcast = observation['broadcast']
            action = None
            if IsBroadcast == True:
                for agent in agent_list:
                    agent.Do_Action(observation)
            else:
                playName = observation['data']['playerName']
                for agent in agent_list:
                    if agent.name == playName:
                        action = agent.Do_Action(observation)
            if now_event == 'GameOver':
                num_won += int(observation['data']['Winner'] == 'REINFORCE')
                break
            observation, reward, done, info = env.step(action)
    return num_won

In [None]:
env = gym.make('Hearts_Card_Game-v0')
env.__init__(playersNameList, MAX_SCORE)

reinforce_wins = [0] * NUM_TESTS

pool = multiprocessing.Pool(processes=NUM_TESTS)
reinforce_wins = pool.map(run_test, reinforce_wins)
pool.close()
pool.join()
print(f"REINFORCE won {sum(reinforce_wins)/len(reinforce_wins)} times on average :: {str(reinforce_wins)}")

# Results:
## Linear v NonLinear
 - NonLinear performs 5.12% difference better, is statistically significant
 - Both had hyperparameters studied, nonlinear took 10 times as long

## Feature Combinations
 - 1, 2 is the best pair (but not significant to all features) at 35% improvement
 - 1, 2, 5 is the best triplet, significant
 - 1, 2 is nearly half the size of full set, trains significantly faster less space

## REINFORCE
 - Lastly, thought the domain was more inclined to policy gradient
 - Used reinforce, 1, 2 reached ~50% win rate

# Conclusions:
 - Reduced raw feature sets can serve as proxy, and even perform better than the full state 
 - Researchers can use smaller initial models, explore raw state space without time consuming feature engineering
 - Room for future research:  Lots of assumptions are very strict, trained vs random, what is the best optimal can do?