#### How to use
* Calculates the list of all superleaves of length 1-6 tiles (or loads this list in), loads in a log of moves from simulated games, and then calculates the expected value of each superleave based on how much was scored by each rack containing that superleave.

#### To-do
* The frequency/count estimate for superleaves is currently calculated incorrectly.


#### Changelog
* 11/27/19 - wow, it's been awhile. Stopped loading all moves into memory (yikes) and instead wrote a much faster version that can go through 50M moves on my local machine in ~3 hours.
* 1/27/19 - Determined that the speed of creation of the rack dataframes is a function of the length of the dataframe. From that, realized that we should organize leaves by least-frequent to most-frequent letter, such that sub-dataframes are created from the shortest racks possible.

In [1]:
import csv
from datetime import date
from itertools import combinations
import numpy as np
import pandas as pd
import pickle as pkl
import seaborn as sns
from string import ascii_uppercase
import time as time

%matplotlib inline

maximum_superleave_length = 1

log_file = 'log_20191205.csv'
# log_file = 'log_1m.csv'

todays_date = date.today().strftime("%Y%m%d")

In [2]:
todays_date

'20191208'

Create a dictionary of all possible 1 to 6-tile leaves. Also, add functionality for sorting by an arbitrary key - allowing us to put rarest letters first

In [3]:
# tilebag = ['A']*9+['B']*2+['C']*2+['D']*4+['E']*12+\
#           ['F']*2+['G']*3+['H']*2+['I']*9+['J']*1+\
#           ['K']*1+['L']*4+['M']*2+['N']*6+['O']*8+\
#           ['P']*2+['Q']*1+['R']*6+['S']*4+['T']*6+\
#           ['U']*4+['V']*2+['W']*2+['X']*1+['Y']*2+\
#           ['Z']*1+['?']*2

# No superleave is longer than 6 letters, and so we only need to include
# 6 each of the As, Es, Is and Os. This shortens the time it takes to find all of
# the superleaves by 50%!
truncated_tilebag = \
          ['A']*6+['B']*2+['C']*2+['D']*4+['E']*6+\
          ['F']*2+['G']*3+['H']*2+['I']*6+['J']*1+\
          ['K']*1+['L']*4+['M']*2+['N']*6+['O']*6+\
          ['P']*2+['Q']*1+['R']*6+['S']*4+['T']*6+\
          ['U']*4+['V']*2+['W']*2+['X']*1+['Y']*2+\
          ['Z']*1+['?']*2
            
tiles = [x for x in ascii_uppercase] + ['?']

# potential future improvement: calculate optimal order of letters on the fly
# rarity_key = 'ZXKJQ?HYMFPWBCVSGDLURTNAOIE'
alphabetical_key = '?ABCDEFGHIJKLMNOPQRSTUVWXYZ'
sort_func = lambda x: alphabetical_key.index(x)

On my home machine, the following code takes about 7 minutes to run in its entirety.

In [4]:
# t0 = time.time()

# leaves = {i:sorted(list(set(list(combinations(truncated_tilebag,i))))) for i in 
#           range(1,maximum_superleave_length+1)}

# # turn leaves from lists of letters into strings
# # algorithm runs faster if leaves non-alphabetical!
# for i in range(1,maximum_superleave_length+1):
#     leaves[i] = [''.join(sorted(leave, key=sort_func))
#                  for leave in leaves[i]]

# t1 = time.time()
# print('Calculated superleaves up to length {} in {} seconds'.format(
#     maximum_superleave_length,t1-t0))

# pkl.dump(leaves,open('all_leaves.p','wb'))

In [5]:
leaves = pkl.load(open('all_leaves.p','rb'))

How many superleaves are there of each length? See below:

In [6]:
for i in range(1,maximum_superleave_length+1):
    print(i,len(leaves[i]))

1 27


### Define metrics we're tallying for each subleaves
Currently, we track the following metrics with each new rack:
* Total points
* Count (how many times subleaves has appeared in data set)
* Bingo Count

In [7]:
all_leaves = []

for i in range(1,maximum_superleave_length+1):
    all_leaves.extend(leaves[i])

In [8]:
def find_subleaves(rack, min_length=1, max_length=6, duplicates_allowed = False):
    if not duplicates_allowed:
        return [''.join(sorted(x, key=sort_func)) for i in range(min_length, max_length+1) 
            for x in set(list(combinations(rack,i)))]
    else:
        return [''.join(sorted(x, key=sort_func)) for i in range(min_length, max_length+1) 
            for x in list(combinations(rack,i))]        

*tile_limit* below is the minimum number of tiles left on a rack for it to be factored into superleave calculation. The idea is that moves with the bag empty tend to be worth less, and may not reflect the value of a letter in the rest of the game (most notably, if you have the blank and the bag is empty, you often can't bingo!).

In [9]:
t0 = time.time()

tile_limit = 1
points = {x:0 for x in all_leaves}
count = {x:0 for x in all_leaves}
bingo_count = {x:0 for x in all_leaves}
total_points = 0
row_count = 0

with open(log_file,'r') as f:
    moveReader = csv.reader(f)
    next(moveReader)
    
    for i,row in enumerate(moveReader):
        if i%1000000==0:
            t = time.time()
            print('Processed {} rows in {} seconds'.format(i,t-t0))
        
        if int(row[10]) >= tile_limit:
            
            total_points += int(row[5])
            row_count += 1
            
            for subleave in find_subleaves(row[3],
                    max_length=maximum_superleave_length):
                points[subleave] += int(row[5])
                count[subleave] += 1
                bingo_count[subleave] += row[7] == '7'
                
t1 = time.time()
print('{} seconds to populate dictionaries'.format(t1-t0))

Processed 0 rows in 0.001096963882446289 seconds
Processed 1000000 rows in 10.318333864212036 seconds
Processed 2000000 rows in 20.298481941223145 seconds
Processed 3000000 rows in 30.274394750595093 seconds
Processed 4000000 rows in 40.33357286453247 seconds
Processed 5000000 rows in 50.536700963974 seconds
Processed 6000000 rows in 60.50753998756409 seconds
Processed 7000000 rows in 70.49202680587769 seconds
Processed 8000000 rows in 80.44721102714539 seconds
Processed 9000000 rows in 90.39992785453796 seconds
Processed 10000000 rows in 100.56511902809143 seconds
Processed 11000000 rows in 110.83525085449219 seconds
Processed 12000000 rows in 121.03592872619629 seconds
Processed 13000000 rows in 131.40489602088928 seconds
Processed 14000000 rows in 142.6002266407013 seconds
Processed 15000000 rows in 152.6085078716278 seconds
Processed 16000000 rows in 162.65071177482605 seconds
Processed 17000000 rows in 172.55124473571777 seconds
Processed 18000000 rows in 182.68831086158752 second

In [10]:
ev_df = pd.concat([pd.Series(points, name='points'),
                  pd.Series(count, name='count'),
                  pd.Series(bingo_count, name='bingo_count')],
                  axis=1)

In [11]:
mean_score = total_points/row_count

In [12]:
ev_df['mean'] = ev_df['points']/ev_df['count']
ev_df['bingo pct'] = 100*ev_df['bingo_count']/ev_df['count']
ev_df['pct'] = 100*ev_df['count']/len(ev_df)
ev_df['ev'] = ev_df['mean']-mean_score

In [13]:
ev_df['ev'].sort_values(ascending=False)

?    17.493787
S     5.303323
Z     4.425721
X     2.273198
E     1.925689
R     1.504390
A     1.138602
H     1.096945
T     0.879131
N     0.869176
C     0.715040
D     0.700800
M     0.580767
I     0.497500
L     0.110617
P    -0.062431
K    -0.329802
O    -0.765613
Y    -1.002356
B    -1.815365
G    -2.001864
J    -2.048146
F    -2.057772
U    -2.979840
W    -3.147393
V    -3.891581
Q    -4.093954
Name: ev, dtype: float64

#### Handle missing leave values
If a given superleave of length n is never observed in the trial games, three things can happen:
* if majority of subleaves of length n-1 are positive, take their maximum.
* if half and half, take average of subleaves.
* if majority of subleaves are negative, take their minimum.

In [14]:
ev_dict = ev_df['ev'].to_dict()

In [15]:
t0 = time.time()

for leave in all_leaves:
    if pd.isnull(ev_dict[leave]):
        subleaves = find_subleaves(leave,
                                   min_length=len(leave)-1, 
                                   max_length=len(leave)-1,
                                   duplicates_allowed=True)
        sub_evs = [ev_dict[subleave] for subleave in subleaves]
        signs = sum([x/abs(x) for x in sub_evs])
        
        if signs==0:
            ev_dict[leave] = sum(sub_evs)/len(sub_evs)
        if signs>0:
            ev_dict[leave] = max(sub_evs)
        if signs<0:
            ev_dict[leave] = min(sub_evs)
        
t1 = time.time()
print('Filled in all NaN superleaves with best guesses in {} seconds'.format(t1-t0))

Filled in all NaN superleaves with best guesses in 0.0002548694610595703 seconds


In [16]:
ev_df = ev_df.drop('ev', axis=1)
ev_df = pd.concat([ev_df,pd.Series(ev_dict,name='ev')], axis=1)

#### Calculate rack synergy
In other words, how much better is the EV of this superleave than the value of each tile on its own?

In [17]:
t0 = time.time()

synergy = {x:0 for x in all_leaves}

for leave in all_leaves:
    if len(leave)>1:
        subleaves = find_subleaves(leave, min_length=1, max_length=1, duplicates_allowed=True)
        sub_evs = [ev_dict[subleave] for subleave in subleaves]
        synergy[leave] = ev_dict[leave]-sum(sub_evs)
        
t1 = time.time()
print('Calculated "synergy" in {} seconds'.format(t1-t0))

Calculated "synergy" in 0.00025272369384765625 seconds


In [18]:
ev_df = pd.concat([ev_df,pd.Series(synergy, name='synergy')], axis=1)

In [19]:
ev_df.index.rename('superleave')

Index(['?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
      dtype='object', name='superleave')

In [20]:
ev_df

Unnamed: 0,points,count,bingo_count,mean,bingo pct,pct,ev,synergy
?,770959947,13629185,7211720,56.566841,52.913802,50478460.0,17.493787,0
A,1974585841,49104812,9884912,40.211657,20.130231,181869700.0,1.138602,0
B,454411400,12196446,1573709,37.25769,12.903013,45172020.0,-1.815365,0
C,567839046,14271582,2604518,39.788094,18.24968,52857710.0,0.71504,0
D,996651018,25057944,4756320,39.773854,18.981286,92807200.0,0.7008,0
E,2542616909,62016948,13616974,40.998743,21.956859,229692400.0,1.925689,0
F,428629303,11579793,1117336,37.015282,9.649015,42888120.0,-2.057772,0
G,735957172,19852537,3151426,37.07119,15.874173,73527910.0,-2.001864,0
H,473376806,11784337,1532401,40.169999,13.00371,43645690.0,1.096945,0
I,2086741109,52734695,10621377,39.570554,20.141156,195313700.0,0.4975,0


Save superleaves to an external file

In [21]:
ev_df['ev']

?    17.493787
A     1.138602
B    -1.815365
C     0.715040
D     0.700800
E     1.925689
F    -2.057772
G    -2.001864
H     1.096945
I     0.497500
J    -2.048146
K    -0.329802
L     0.110617
M     0.580767
N     0.869176
O    -0.765613
P    -0.062431
Q    -4.093954
R     1.504390
S     5.303323
T     0.879131
U    -2.979840
V    -3.891581
W    -3.147393
X     2.273198
Y    -1.002356
Z     4.425721
Name: ev, dtype: float64

In [24]:
ev_df['ev'].to_csv('leave_values_' + todays_date + '.csv', index=False)
ev_df.to_csv('leave_summary_' + todays_date + '.csv', index=False)

  """Entry point for launching an IPython kernel.
