#### How to use
* Calculates the list of all superleaves of length 1-6 tiles (or loads this list in), loads in a log of moves from simulated games, and then calculates the expected value of each superleave based on how much was scored by each rack containing that superleave.

#### To-do
* The frequency/count estimate for superleaves is currently calculated incorrectly (5/8 - is this still true?)
* The synergy calculation is broken.

#### Changelog
* 5/8/20 - My superleave calculation was being too short-sighted - not considering the future value of keeping a blank on your rack, and failing to recognize the awfulness of gravity wells like UVWW. I added an adjustment factor that tracks the value of your leftover tiles when you make a play from a rack containing that superleave, which will hopefully help with recognizing the value of holding a ? and not holding awful combinations/the Q.
* 11/27/19 - wow, it's been awhile. Stopped loading all moves into memory (yikes) and instead wrote a much faster version that can go through 50M moves on my local machine in ~3 hours.
* 1/27/19 - Determined that the speed of creation of the rack dataframes is a function of the length of the dataframe. From that, realized that we should organize leaves by least-frequent to most-frequent letter, such that sub-dataframes are created from the shortest racks possible.

In [1]:
import csv
from datetime import date
from itertools import combinations
import numpy as np
import pandas as pd
import pickle as pkl
import seaborn as sns
from string import ascii_uppercase
import time as time

%matplotlib inline

maximum_superleave_length = 6

log_file = '../logs/log_20200514.csv'
# log_file = '../logs/log_1m.csv'

todays_date = date.today().strftime("%Y%m%d")

In [2]:
todays_date

'20200515'

Create a dictionary of all possible 1 to 6-tile leaves. Also, add functionality for sorting by an arbitrary key - allowing us to put rarest letters first

In [3]:
# tilebag = ['A']*9+['B']*2+['C']*2+['D']*4+['E']*12+\
#           ['F']*2+['G']*3+['H']*2+['I']*9+['J']*1+\
#           ['K']*1+['L']*4+['M']*2+['N']*6+['O']*8+\
#           ['P']*2+['Q']*1+['R']*6+['S']*4+['T']*6+\
#           ['U']*4+['V']*2+['W']*2+['X']*1+['Y']*2+\
#           ['Z']*1+['?']*2

# No superleave is longer than 6 letters, and so we only need to include
# 6 each of the As, Es, Is and Os. This shortens the time it takes to find all of
# the superleaves by 50%!
truncated_tilebag = \
          ['A']*6+['B']*2+['C']*2+['D']*4+['E']*6+\
          ['F']*2+['G']*3+['H']*2+['I']*6+['J']*1+\
          ['K']*1+['L']*4+['M']*2+['N']*6+['O']*6+\
          ['P']*2+['Q']*1+['R']*6+['S']*4+['T']*6+\
          ['U']*4+['V']*2+['W']*2+['X']*1+['Y']*2+\
          ['Z']*1+['?']*2
            
tiles = [x for x in ascii_uppercase] + ['?']

# potential future improvement: calculate optimal order of letters on the fly
# rarity_key = 'ZXKJQ?HYMFPWBCVSGDLURTNAOIE'
alphabetical_key = '?ABCDEFGHIJKLMNOPQRSTUVWXYZ'
sort_func = lambda x: alphabetical_key.index(x)

On my home machine, the following code takes about 7 minutes to run in its entirety.

In [4]:
# t0 = time.time()

# leaves = {i:sorted(list(set(list(combinations(truncated_tilebag,i))))) for i in 
#           range(1,maximum_superleave_length+1)}

# # turn leaves from lists of letters into strings
# # algorithm runs faster if leaves non-alphabetical!
# for i in range(1,maximum_superleave_length+1):
#     leaves[i] = [''.join(sorted(leave, key=sort_func))
#                  for leave in leaves[i]]

# t1 = time.time()
# print('Calculated superleaves up to length {} in {} seconds'.format(
#     maximum_superleave_length,t1-t0))

# pkl.dump(leaves,open('all_leaves.p','wb'))

In [5]:
leaves = pkl.load(open('all_leaves.p','rb'))

How many superleaves are there of each length? See below:

In [6]:
for i in range(1,maximum_superleave_length+1):
    print(i,len(leaves[i]))

1 27
2 373
3 3509
4 25254
5 148150
6 737311


### Define metrics we're tallying for each subleaves
Currently, we track the following metrics with each new rack:
* Total points
* Count (how many times subleaves has appeared in data set)
* Bingo Count

In [7]:
all_leaves = []

for i in range(1,maximum_superleave_length+1):
    all_leaves.extend(leaves[i])

In [8]:
def find_subleaves(rack, min_length=1, max_length=6, duplicates_allowed = False):
    if not duplicates_allowed:
        return [''.join(sorted(x, key=sort_func)) for i in range(min_length, max_length+1) 
            for x in set(list(combinations(rack,i)))]
    else:
        return [''.join(sorted(x, key=sort_func)) for i in range(min_length, max_length+1) 
            for x in list(combinations(rack,i))]        

*tile_limit* below is the minimum number of tiles left on a rack for it to be factored into superleave calculation. The idea is that moves with the bag empty tend to be worth less, and may not reflect the value of a letter in the rest of the game (most notably, if you have the blank and the bag is empty, you often can't bingo!). Moves are tend to be worth a little bit less at the beginning of a game when there are fewer juicy spots to play.

In [9]:
t0 = time.time()

tile_limit = 1

bingo_count = {x:0 for x in all_leaves}
count = {x:0 for x in all_leaves}
equity = {x:0 for x in all_leaves}
points = {x:0 for x in all_leaves}
row_count = 0
total_equity = 0
total_points = 0

with open(log_file,'r') as f:
    moveReader = csv.reader(f)
    next(moveReader)
    
    for i,row in enumerate(moveReader):
        if i%1000000==0:
            t = time.time()
            print('Processed {} rows in {} seconds'.format(i,t-t0))
        
#         if i<10:
#             print(i,row)
            
        try:    
            if int(row[10]) >= tile_limit:

                total_equity += float(row[9])
                total_points += int(row[5])
                row_count += 1

                for subleave in find_subleaves(row[3],
                        max_length=maximum_superleave_length):
                    bingo_count[subleave] += row[7] == '7'
                    count[subleave] += 1
                    equity[subleave] += float(row[9])
                    points[subleave] += int(row[5])
        except:
            print(i,row)

t1 = time.time()
print('{} seconds to populate dictionaries'.format(t1-t0))

Processed 0 rows in 0.6810328960418701 seconds
Processed 1000000 rows in 299.34818387031555 seconds
Processed 2000000 rows in 567.5604648590088 seconds
Processed 3000000 rows in 837.5055487155914 seconds
Processed 4000000 rows in 1107.6123669147491 seconds
Processed 5000000 rows in 1378.0669567584991 seconds
Processed 6000000 rows in 1652.3204958438873 seconds
Processed 7000000 rows in 1923.1418929100037 seconds
Processed 8000000 rows in 2194.7655749320984 seconds
Processed 9000000 rows in 2466.54762673378 seconds
Processed 10000000 rows in 2738.933168888092 seconds
Processed 11000000 rows in 3011.0271508693695 seconds
Processed 12000000 rows in 3283.3966426849365 seconds
Processed 13000000 rows in 3557.1026887893677 seconds
Processed 14000000 rows in 3830.2991468906403 seconds
Processed 15000000 rows in 4103.08988070488 seconds
Processed 16000000 rows in 4375.417726993561 seconds
Processed 17000000 rows in 4648.062791824341 seconds
Processed 18000000 rows in 4920.394510746002 seconds


In [10]:
ev_df = pd.concat([pd.Series(points, name='points'),
                  pd.Series(equity, name='equity'),
                  pd.Series(count, name='count'),
                  pd.Series(bingo_count, name='bingo_count')],
                  axis=1)

In [11]:
mean_score = total_points/row_count
mean_equity = total_equity/row_count

In [12]:
ev_df['mean_score'] = ev_df['points']/ev_df['count']
ev_df['mean_equity'] = ev_df['equity']/ev_df['count']
ev_df['bingo pct'] = 100*ev_df['bingo_count']/ev_df['count']
ev_df['pct'] = 100*ev_df['count']/len(ev_df)
ev_df['adjusted_mean_score'] = ev_df['mean_score']-mean_score
ev_df['ev'] = ev_df['mean_equity']-mean_equity

#### Smoothed superleaves
We calculate superleaves by having Macondo play itself millions of times, and then seeing the difference between how much plays score that contain that superleave versus the average of all other plays (the "leave value"). However, some of the lower probability superleaves get observed very infrequently, and so end up with inaccurate superleave values (for instance, if the one time you have DLPQX? you played QUADPLEX for 300+, you're going to incorrectly think that's a dream leave!).

To compensate this, we "smooth out" the superleaves for any superleave that was observed less than a cutoff number of times (maybe 50 or 100). We sum up over the statistics for all neighboring leaves (all leaves that are only different by 1 tile and contain the same number of blanks). The proper way of doing this is really with a superior model like a neural net, but this gets pretty close and prevents "gravity wells" (when a superleave is valued way too high, and the fast player will keep trying to keep that superleave at all costs).

In [66]:
# avoid tampering with ev_df above
summary_df = pd.read_csv('leave_summary_' + todays_date +'.csv').rename(
    columns={'Unnamed: 0':'leave'}).set_index('leave')

KeyError: "None of ['leave'] are in the columns"

In [None]:
count_dict = summary_df['count'].to_dict()
equity_dict = summary_df['equity'].to_dict()
mean_equity_dict = summary_df['mean_equity'].to_dict()
summary_df = summary_df.reset_index()
summary_df['leave_len'] = summary_df['leave'].apply(lambda x: len(x))

In [None]:
child_leaves = {leave:[''.join(sorted(leave+letter, key=sort_func)) for letter in alphabetical_key]
                for i in range(1,6) for leave in leaves[i]}
child_leaves[''] = [x for x in alphabetical_key]

In [None]:
def get_neighboring_leaves(original_leave):
    t0 = time.time()
    subleaves = [''.join(x) for x in combinations(original_leave, len(original_leave)-1)]
    t1 = time.time()
    
    neighbors = []
    for leave in subleaves:
        neighbors += child_leaves[leave]
        
    t2 = time.time()
    
    # filter neighbors to make sure they have the same number of blanks
    blank_count = sum([x=='?' for x in original_leave])
    
    t3 = time.time()
    
    neighbors = [leave for leave in neighbors if(sum([x=='?' for x in leave])==blank_count)]
    
    t4 = time.time()
    
    return neighbors


def calculate_smoothed_superleave(superleave):
    neighbors = get_neighboring_leaves(superleave)
    
    neighboring_equity = 0
    neighboring_count = 0
    equity_list = []
        
    for neighbor_leave in neighbors:
        neighboring_equity += equity_dict.get(neighbor_leave, 0)
        neighboring_count += count_dict.get(neighbor_leave, 0)
        equity_list.append(mean_equity_dict.get(neighbor_leave))
                
    equity_list = [x for x in equity_list if pd.notnull(x)]
    
#     print('Original:')
#     print(summary_df.loc[summary_df['leave']==superleave])
#     print(neighboring_equity, neighboring_count, neighboring_equity/neighboring_count)
#     print(np.mean(equity_list))
#     print(equity_list)
    
    return neighboring_equity/neighboring_count

Shows how many 6-tile superleaves were never seen, and how many were seen less than 10 times

In [None]:
print(pd.notnull(summary_df.loc[summary_df['leave_len']==5])['ev'].value_counts())
print((summary_df.loc[summary_df['leave_len']==5]['count']<10).value_counts())

print(pd.notnull(summary_df.loc[summary_df['leave_len']==6])['ev'].value_counts())
print((summary_df.loc[summary_df['leave_len']==6]['count']<10).value_counts())

Show the strongest superleaves in your lexicon. If your superleaves are unsmoothed, you'll likely see some weird superleaves at the top of this list with low count.

In [None]:
summary_df.loc[summary_df['leave_len']==5].sort_values('ev', ascending=False)[:10]

In [None]:
summary_df['smoothed_ev'] = summary_df['ev']
summary_df['point_equity_diff'] = (summary_df['points']-summary_df['equity'])/summary_df['count']

If there's a big delta between the equity scored with a given leave and the average points, that can be a sign that your existing ev for a superleave is too high.

In [None]:
summary_df.loc[summary_df['leave_len']==5].sort_values('point_equity_diff')[:10]

In [None]:
# What's the minimum number of times you want to see a superleave before you'll take the
# value as is, without smoothing?
five_tile_superleave_cutoff = 100
six_tile_superleave_cutoff = 50

In [67]:
leaves_to_smooth = list(summary_df.loc[(summary_df['leave_len']==5) & 
    (summary_df['count']<five_tile_superleave_cutoff)]['leave'].values)
print(len(leaves_to_smooth))

leaves_to_smooth += list(summary_df.loc[(summary_df['leave_len']==6) &
    (summary_df['count']<six_tile_superleave_cutoff)]['leave'].values)
print(len(leaves_to_smooth))

KeyError: 'leave'

In [None]:
summary_df = summary_df.set_index('leave')
smooth_ev_dict = summary_df['ev'].to_dict()
ev_dict = summary_df['ev'].to_dict()

In [None]:
t0 = time.time()

for i,leave in enumerate(leaves_to_smooth):
    if (i+1)%1000==0:
        print(i, time.time()-t0)
    
    smooth_ev_dict[leave] = calculate_smoothed_superleave(leave) - mean_equity

In [68]:
smooth_ev_series = pd.Series(smooth_ev_dict)

#### Calculate rack synergy
In other words, how much better is the EV of this superleave than the value of each tile on its own? 

In [78]:
t0 = time.time()

synergy = {leave: smooth_ev_dict[leave]-sum([smooth_ev_dict[letter] for letter in leave]) 
    for leave in all_leaves}
        
t1 = time.time()
print('Calculated synergy in {} seconds'.format(t1-t0))

Calculated "synergy" in 1.1174757480621338 seconds


In [79]:
ev_df = pd.concat([ev_df,pd.Series(synergy, name='synergy')], axis=1)

Save superleaves to an external file

In [81]:
ev_df['ev'].to_csv('leave_values_' + todays_date + '_unsmoothed.csv')
ev_df.reset_index().to_csv('leave_summary_' + todays_date + '.csv', index=False)
smooth_ev_series.to_csv('leave_values_' + todays_date + '.csv')

  """Entry point for launching an IPython kernel.
  This is separate from the ipykernel package so we can avoid doing imports until


In [84]:
ev_df['synergy'].sort_values().to_csv('leave_synergies.csv')

  """Entry point for launching an IPython kernel.


In [85]:
ev_df['synergy'].sort_values(ascending=False)[:100].to_csv('leave_synergies_top100.csv')
ev_df['synergy'].sort_values(ascending=True)[:100].to_csv('leave_synergies_bottom100.csv')

  """Entry point for launching an IPython kernel.
  
