In [1]:
import gym
import gym_blackjack
from monte_carlo_search import Monte_Carlo_Search

<h3>The Agent</h3>

In [2]:
# The agent is designed to be very easy
# to instantiate and begin learning. Hyperparameter
# selection and loading previous models are both
# available, but not required.

env = gym.make('blackjack-v0')
env.reset()
agent = Monte_Carlo_Search(env)

In [3]:
for i in range(1):
    agent.one_game()

In [4]:
agent.value_fn.to_csv("trained_model.csv")

<h3>Evaluation</h3>

Blackjack is a solved game with a mathematically-proven optimal strategy. We will evaluate the RL agent based on its convergence to this optimal strategy. To do this, we will create a DataFrame containing every possible state and the right correct play, then pass each state to the model to predict.

In [5]:
from itertools import product

states_list = list(product(range(12,22), range(2,11), range(2), [0]))
states_aces = list(product(range(12,22), [11], range(2), [1]))

states_list.extend(states_aces)

In [6]:
import pandas as pd

df = pd.DataFrame(data=states_list, columns=['Player Value', 'Dealer Upcard', 'Player Ace', 'Dealer Ace'])
df['State'] = df.apply(lambda x: tuple(x), axis=1)

In [7]:
df.head()

Unnamed: 0,Player Value,Dealer Upcard,Player Ace,Dealer Ace,State
0,12,2,0,0,"(12, 2, 0, 0)"
1,12,2,1,0,"(12, 2, 1, 0)"
2,12,3,0,0,"(12, 3, 0, 0)"
3,12,3,1,0,"(12, 3, 1, 0)"
4,12,4,0,0,"(12, 4, 0, 0)"


In [8]:
# Initialize the Correct Action column full of hits
df['Correct'] = 1

In [9]:
# Modify the Correct Action column to include when it 
# is appropriate to stay (no double down or splits)
# Taken from:
# https://www.blackjackapprenticeship.com/blackjack-strategy-charts/

df.loc[(df['Player Value'] >= 17) & (df['Player Ace'] == 0), 'Correct'] = 0
df.loc[(df['Player Value'] >= 13) & (df['Player Ace'] == 0) &\
       (df['Player Value'] <= 16) & (df['Dealer Upcard'] <=6), 'Correct'] = 0
df.loc[(df['Player Value'] == 12) & (df['Player Ace'] == 0) &\
       (df['Dealer Upcard'] >= 4) & (df['Dealer Upcard'] <=6), 'Correct'] = 0

df.loc[(df['Player Value'] >= 19) & (df['Player Ace'] == 1), 'Correct'] = 0
df.loc[(df['Player Value'] == 18) & (df['Player Ace'] == 1) &\
       (df['Dealer Upcard'] <= 8), 'Correct'] = 0

In [10]:
df.drop(['Player Value', 'Dealer Upcard', 'Player Ace', 'Dealer Ace'], axis=1, inplace=True)

df.head()

Unnamed: 0,State,Correct
0,"(12, 2, 0, 0)",1
1,"(12, 2, 1, 0)",1
2,"(12, 3, 0, 0)",1
3,"(12, 3, 1, 0)",1
4,"(12, 4, 0, 0)",0


In [11]:
# Prepare the agent's action choices

agent = pd.read_csv("test_monte_carlo.csv")

agent['1_ratio'] = agent['1_win'] / agent['1_count']
agent['0_ratio'] = agent['0_win'] / agent['0_count']

agent['Agent'] = 0

agent.loc[agent['1_ratio'] > agent['0_ratio'], 'Agent'] = 1

agent.head()

Unnamed: 0.1,Unnamed: 0,State,1_win,1_count,0_win,0_count,1_ratio,0_ratio,Agent
0,0,"(12, 2, 0, 0)",2684.5,7084,999.0,2837,0.378953,0.352133,1
1,1,"(12, 2, 1, 0)",258.5,448,22.0,54,0.577009,0.407407,1
2,2,"(12, 3, 0, 0)",2923.0,7549,363.0,974,0.387204,0.37269,1
3,3,"(12, 3, 1, 0)",1.0,1,1.0,1,1.0,1.0,0
4,4,"(12, 4, 0, 0)",813.5,2029,3871.0,9407,0.400936,0.411502,0


In [12]:
# Compare the generated Agent with a mathematically perfect game
# Show all places where the Agent is wrong

df['Agent'] = agent['Agent']

df[df['Correct'] != df['Agent']]

Unnamed: 0,State,Correct,Agent
3,"(12, 3, 1, 0)",1,0
8,"(12, 6, 0, 0)",0,1
109,"(18, 2, 1, 0)",0,1
117,"(18, 6, 1, 0)",0,1
193,"(18, 11, 1, 1)",1,0


<h3>Results</h3>

Using an Agent trained on one million games, we find an agent that's accurate for 97.5% of possible game states, only disagreeing on a few rare states (where the player has a low value and an ace, or a high value and an ace.) We can fix these by having the agent play more Search games when it arrives on those states. We could also fix these by having it play more games overall, but that would be less efficient because we would be relying more on getting to this state by random chance rather than dwelling on it longer when it does arise.