## Reinforcement Learning Project - Interactive Fiction

Deterministic (same state and action always leads to the same value)<br>
Model-free (don't know all states/available actions)<br>
Fully observable (we know what state we're in)<br>
TD updates (update for each state, not full episode)
finite (game will end eventually)


## To do:
    -Agent methods to implement demo: With random actions, trained agent actions, walkthrough actions
    -Agent method to allow human input (? optional)
    -Outside function for training/printing every few intervals (every few training blocks)
    -Outside function to run (random agent) for a number of iterations and average their score/performance (to calc baseline)
    -Put outside functions into utils.py

In [13]:
from jericho import *
import numpy as np
import random
import re

### Utilities

In [221]:
def pretty_print_state(state):
    state = str(state)
    state = state.replace('\\\'', '\'')
    pattern = re.compile(r'\\n|b\'|b"')
    state = re.sub(pattern, ' ', state)
    state = state.strip().strip('\'').strip('\"')
    return state

### Read in an interactive fiction work into Jericho environment
Currently: Detective

In [2]:
story_file = "z-machine-games-master/jericho-game-suite/detective.z5"

### RL problem setup:

States: Text at each step<br>
Actions: Sampled text (for Detective, either cardinal direction or [verb][noun] generally)

### Playthrough of game with agent making random actions

## Note: do this over many iterations and take average for baseline performance
## Have this as part of demo method

In [144]:
env = FrotzEnv(story_file)

In [145]:
env.reset()

('\n\n\n\n[Type "help" for more information about this version]\n\nDetective\nBy Matt Barringer.\nPorted by Stuart Moore.\nStuart_Moore@my-deja.com\nRelease 1 / Serial number 000715 / Inform v6.21 Library 6/10 SD\n\n<< Chief\'s office >>\nYou are standing in the Chief\'s office. He is telling you "The Mayor was murdered yeaterday night at 12:03 am. I want you to solve it before we get any bad publicity or the FBI has to come in. "Yessir!" You reply. He hands you a sheet of paper. Once you have read it, go north or west.\n\nYou can see a piece of white paper here.\n\n[Your score has just gone up by ten points.]\n',
 {'moves': 1, 'score': 10})

In [215]:
env.get_valid_actions()

['north',
 'take note',
 'put wood down',
 'abstract wood to note',
 'abstract note to wood',
 'east']

In [216]:
env.step('take note')

('\nTaken.\n\n[Your score has just gone up by ten points.]\n',
 10,
 False,
 {'moves': 23, 'score': 70})

In [217]:
state_raw = env.get_state()[-1]
state_str = str(state_raw)

Taken.  [Your score has just gone up by ten points.] 


In [4]:
# Maximum possible score in game
print(f'Maximum possible score in game: {env.get_max_score()}')

Maximum possible score in game: 360


In [5]:
# Reset game to initial state S_0
env.reset()

while not env.game_over() or env.victory():
    # Here, I use the environment to check what are possible actions, and choose a random one.
    # I may or may not do this in the future (sample my own words for actions, or attempt that as a possible agent)
    available_actions = env.get_valid_actions()
    random_action = random.choice(available_actions)
    
    print(f'State: {env.get_state()[-1]}')
    print(f'Random action chosen: {random_action}')
    env.step(random_action)
    print()

State: b'\n\n\n\n[Type "help" for more information about this version]\n\nDetective\nBy Matt Barringer.\nPorted by Stuart Moore.\nStuart_Moore@my-deja.com\nRelease 1 / Serial number 000715 / Inform v6.21 Library 6/10 SD\n\n<< Chief\'s office >>\nYou are standing in the Chief\'s office. He is telling you "The Mayor was murdered yeaterday night at 12:03 am. I want you to solve it before we get any bad publicity or the FBI has to come in. "Yessir!" You reply. He hands you a sheet of paper. Once you have read it, go north or west.\n\nYou can see a piece of white paper here.\n\n[Your score has just gone up by ten points.]\n'
Random action chosen: east

State: b'\nYou can\'t go east from here!\n\n<< Chief\'s office >>\nYou are standing in the Chief\'s office. He is telling you "The Mayor was murdered yeaterday night at 12:03 am. I want you to solve it before we get any bad publicity or the FBI has to come in. "Yessir!" You reply. He hands you a sheet of paper. Once you have read it, go north

In [6]:
# Output final score of randomly played game
print(f'Final score of random choice game: {env.get_score()}')

Final score of random choice game: 30


Playthrough of optimal choices in game (walkthrough)

In [7]:
env.reset()

for action in env.get_walkthrough():
    print(f'State: {env.get_state()[-1]}')
    print(f'Action from walkthrough: {action}')
    env.step(action)
    print()

State: b'\n\n\n\n[Type "help" for more information about this version]\n\nDetective\nBy Matt Barringer.\nPorted by Stuart Moore.\nStuart_Moore@my-deja.com\nRelease 1 / Serial number 000715 / Inform v6.21 Library 6/10 SD\n\n<< Chief\'s office >>\nYou are standing in the Chief\'s office. He is telling you "The Mayor was murdered yeaterday night at 12:03 am. I want you to solve it before we get any bad publicity or the FBI has to come in. "Yessir!" You reply. He hands you a sheet of paper. Once you have read it, go north or west.\n\nYou can see a piece of white paper here.\n\n[Your score has just gone up by ten points.]\n'
Action from walkthrough: TAKE PAPER

State: b'\nTaken.\n\n[Your score has just gone up by ten points.]\n'
Action from walkthrough: READ PAPER

State: b"\nCONFIDENTIAL:\nDetective was created by Matt Barringer.\nHe has worked hard on this so you better enjoy it.\nI did have fun making it though. But I'd REALLY appreciate it if you were kind enough to send a postcard or..

In [8]:
env.get_score()

360

### An agent that learns from the environment

Attempting Q-learning, based on Lab 4 Tic-Tac-Toe Agent

# NOTE: To improve performance, memoize seen states and their .get_valid_actions() actions?

In [9]:
class Agent():
    '''A Q-learning agent with TD(0) updates, compatible with Jericho's FrotzEnv for interactive fiction'''
    
    def __init__(self, env, epsilon=0.1, alpha=1, gamma=0.999, default_actions=['n','s','e','w']):
        self.V = dict() # Build up the values of different states as we encounter them; Note the Markov assumption
        self.game = env  # Need to initialize as FrotzEnv("path-to-game-file.z3/5/8"); .reset() for new game
        self.epsilon = epsilon
        self.alpha = alpha  # Learning rate; proportion of updated Q-value that consists of the new Q-value
        self.gamma = gamma  # Discount factor on future rewards
        self.default_actions = default_actions  # For edge case where self.game.get_valid_actions() = [];
                                                # e.g. A direction gets agent out of loop in Detective
    
    def get_sa_value(self, state, action):
        '''Look up state-action value. If never seen state-action combo, then assume neutral.'''
        if state in self.V.keys():
            if action in self.V[state].keys():
                return self.V[state][action]
        return 0
    
    def put_sa_value(self, state, action, value):
        if state not in self.V.keys():
            self.V[state] = dict()
        self.V[state][action] = value
    
    def get_max_state_value(self, state):
        if state not in self.V.keys():  # If state not encountered yet, its max value is initial val of 0
            return 0
        else:
            return max(self.V[state].values())
    
    def find_best_action(self):
        '''Find best action when exploiting/maximizing expected value (greedy choice)'''
        state = self.game.get_state()[-1]
        max_val = self.get_max_state_value(state)
        available_actions = self.game.get_valid_actions()
        if not available_actions: available_actions = self.default_actions  # For edge case of no valid actions
        max_val_actions = [a for a in available_actions if self.get_sa_value(state, a)==max_val]
        return random.choice(max_val_actions)
        
    def learn_select_action(self):
        '''Select best action with probability 1-epsilon (exploit), select random action with probability 
        epsilon (explore)'''
        best_action = self.find_best_action()
        available_actions = self.game.get_valid_actions()
        if not available_actions: available_actions = self.default_actions  # For edge case of no valid actions
        if np.random.uniform(0, 1) < self.epsilon:
            return random.choice(available_actions)
        else:
            return best_action
    
    def learn_from_action(self):
        "Q-learning"
        state = self.game.get_state()[-1]  # s: Current state
        action = self.learn_select_action()
        
        Q_sa = self.get_sa_value(state, action)  # Q(s, a): Current state value. self.V[state][action]
        # NOTE: This steps the game forward to the next state!! Don't step again!
        state_prime, reward, _, _ = self.game.step(action)
        max_a_Q_sa = self.get_max_state_value(state_prime)  # max a  Q(s',a): Action that maxim. future value
        
        # Update the Q-value with the current Q-value + (learning rate)*(new Q-value - current Q-value)
        # Q(s, a) <- Q(s, a) + alpha*(r + gamma*(max a  Q(s',a)) - Q(s, a))
        new_Q_sa = Q_sa + self.alpha*(reward + self.gamma*max_a_Q_sa - Q_sa)
        self.put_sa_value(state, action, new_Q_sa)
        
    def learn_from_episode(self):
        "Update Values based on reward."
        self.game.reset()
        while not self.game.game_over() or self.game.victory():
            self.learn_from_action()
        # NOTE: The line below is to update the value in V for last state, after the game ends.
        # Since we get points throughout the game, and since (e.g. in Detective) you get a lot of points for
        # winning the game, and value updates for those points are captured prior to the terminal state,
        # we do not need to save a reward for the terminal state. 
        # If we do, the reward is likely 0 
        # (or current state's .get_score() - previous state's .get_score(); configure in learn_from action)
        #self.V[self.game.get_state()[-1]][''] = 0
    
    def learn_game(self, n_episodes=1_000):
        "Let's learn through complete experience to get that reward."
        for episode in range(1, n_episodes+1):
            self.learn_from_episode()
            # NOTE: CLEAN UP LATER: Adding printing while troubleshooting
            #if episode%100==0:
            print(f'Episode {episode}')
            print(f'Score of game: {self.game.get_score()}')
    
# Game demo functions
    def play_select_action():
        pass
    
    def demo_game():
        pass
    
    def interactive_game():
        pass
    
    def request_human_action():
        pass
    

In [10]:
agent = Agent(FrotzEnv(story_file))

In [11]:
agent.learn_game(10)

Episode 1
Score of game: 70
Episode 2
Score of game: 50
Episode 3
Score of game: 100
Episode 4
Score of game: 60
Episode 5
Score of game: 70
Episode 6
Score of game: 100
Episode 7
Score of game: 130
Episode 8
Score of game: 50
Episode 9
Score of game: 70
Episode 10
Score of game: 120


Go from end to end

Parking lot of ideas to implement if time<br>
    -Experience Replay: Keep working memory of what I've seen. Make updates to Q-function: Good for sparse rewards<br>
    -Use Q-learning agent for other interactive fiction stories<br>

But finish Q-learning first and have it perform above random


Rubric: 
Need at least 1 notebook (training notebook)<br>
Up to me if I want to move things into modules (scripts) and import


Explain: What is and is not possible (e.g. can't just throw into OpenAI)

### Agent from stable-baselines

Need to define custom environmemt to connect to Stable Baselines agents (which themselves are based on OpenAI Gym)
https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html

In [None]:
import gym
from gym import spaces

class IFEnv(gym.Env, FrotzEnv):
    '''Interactive Fiction that follows gym interface and inherits from Jericho's FrotzEnv'''
    metadata = {'render.modes': ['human']}

    def __init__(self, story_file, seed=None):
        self.story_file = story_file
        self._seed = seed
        super().__init__(self.story_file, self._seed)
        
        self.action_dict = dict()
        self.observation_dict = dict()
        self.n_actions = -1
        self.n_states = -1
        # Define action and observation space
        # They must be gym.spaces objects
        # Example when using discrete actions:
        self.action_space = spaces.Discrete(self.n_actions)
        self.observation_space = spaces.Discrete(self.n_states)
        # Example for using image as input:
        #self.observation_space = spaces.Box(low=0, high=255,
                                            #shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
    
    # Feature hashing: Consistently map feature to hashed value in lookup table 
    
    def update_action_dictionary(self):
        available_actions = self.get_valid_actions()
        for a in available_actions:
            if a not in self.action_dict.keys():
                self.n_actions += 1
                self.action_dict[a] = self.n_actions
    
    def update_observation_dictionary(self):
        state = self.get_state[-1]
        if state not in self.observation_dict.keys():
            self.n_states += 1
            self.
    
        
    def step(self, action):
        '''Make one time step using given action. Return next obs, reward, done (is game over), info'''
        observation, reward, done, info = self.step(action)
        return observation, reward, done, info
    
    def reset(self):
        '''Reset game to initial state'''
        self.reset()
        return self.get_state()[-1]  # reward, done, info can't be included
    
#     def render(self, mode='human'):
#     ...
    
#     def close (self):
#     ...

In [None]:
action_dictionary = {'north':0, 'south':1, 'east':2, 'west':3}
max(action_dictionary.values())

**TO DO:** Currently needs seed=int, breaks if seed=None. Need to fix

In [None]:
from stable_baselines.common.env_checker import check_env
from stable_baselines import A2C

In [None]:
env = IFEnv("z-machine-games-master/jericho-game-suite/detective.z5", 1)  

In [None]:
# It will check your custom environment and output additional warnings if needed
check_env(env)

In [None]:
# Then you can define and train a RL agent with:

# Define and Train the agent
model = A2C('CnnPolicy', env).learn(total_timesteps=1000)