# train_card2vec

This notebook covers the full card2vec workflow for creation of card embeddings including:
* Data download
* Preprocessing
* Model training (creating embeddings)


## !!! BEFORE YOU BEGIN !!!
1. Clone this repo.
2. Download 17Lands draft game data
    * https://www.17lands.com/public_datasets
    * (Must be the data from the 'Game Data' column, in a 'PremierDraft' row)
    * Extract it, e.g., with [7zip](https://www.7-zip.org/download.html)
    * Save the csv file in your repo 'data/draft/' folder
3. Specify the file name in the cell below (don't forget to run the cell after specifying)

In [1]:
# this is just an example file name, change it to the true name of your csv file
file_name = 'game_data_public.SNC.PremierDraft.csv'
set_name = 'SNC' # name this whatever you want (but short), it only impacts the file name of saved models

In [8]:
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
import pandas as pd
import random
from DeckCorpus import DeckCorpus
from card2vec_preprocess import card2vec_preprocess
from card_ints_to_list import card_ints_to_list
from LossCallback import LossCallback
from draft_game_cleaner import draft_game_cleaner
from os import getcwd
from datetime import datetime

### Process 17Lands data
There are 2 data cleaning / preprocessing steps
1. Clean into multi-purpose form (could be used for other purposes, not just card2vec)

In [3]:
# This will save the cleaned output in your data folder, so no need to run more than once per 17lands data file
# May take ~30s, maybe longer on slower machines
file_path = getcwd() + '\\data\\draft\\' + file_name
draft_game_cleaner(file_path)

2. Some further trimming down of the data into only what is required for card2vec (Dataframe where each row is a deck, each column is a card, and integer values represent card counts)

In [4]:
# load decks for use in Word2Vec
# cleaned file will have 'CLEANED_' prefix
clean_path = file_path.replace(file_name,'CLEANED_' + file_name) 

df = card2vec_preprocess(clean_path)
unique_cards = len(df.columns)

Dropping  462  duplicate decks
126493  unique decks remain.


## Model Training (Creating Card Embeddings)
This relies on DeckCorpus, which is a generator that processes decks into word2vec compatible form before passing them to the model.
* (converts from rows of integer card counts into lists of card names (strings). e.g., output decks will be in the format:
    - ["Mountain, "Mountain", "Shock", ... ]

In [6]:
# hyperparameters - you may want to experiment
window_size = 40 # skipgram / CBOW window size
vector_size = 100 # size of resulting card embeddings
epochs = 5
skipgram = 1 # uses CBOW if 0
data_share = 1 #if you want to test on a small share of data, reduce this. e.g., 0.3 = 30% of data (not randomized) 

# These lines just implement the idea of the data_share variable above
last_idx = int(len(df)*data_share)
df_less = df.iloc[0:last_idx,:]

# Corpus (generator that yields decks)
deck_corpus = DeckCorpus(data=df_less, shuffle=True)

In [7]:
# This creates and trains the gensim word2vec model.
# Will take time, so try with 1 epoch first to get a sense.
model = Word2Vec(sentences = deck_corpus,
                 vector_size = vector_size,
                 window = window_size,
                 sg = skipgram,
                 callbacks = [LossCallback()], #Note that this is a gensim way of reporting training loss
                 compute_loss = True,
                 epochs = epochs,
                )

Loss after epoch 0: 26947114.0
Loss after epoch 1: 14490494.0
Loss after epoch 2: 10625660.0
Loss after epoch 3: 10687884.0
Loss after epoch 4: 4600136.0


### TODO - Example results (different jupyter notebook loading pretrained embeddings)

In [10]:
#save model
now = datetime.now()
dt_string = now.strftime("%d-%m-%Y_%H-%M-%S")

save_name = 'card2vec-' + set_name + '-w' + str(window_size) + '-v' + str(vector_size) + '-e' + str(epochs) + dt_string +'.model'
model.save("models\\" + save_name)

In [None]:
model.wv.most_similar('Black Market Tycoon', topn=5)

In [None]:
vect = model.wv['Call In a Professional'] - model.wv['Mountain'] + model.wv['Swamp']

In [None]:
model.wv.similar_by_vector(vect)

In [None]:
# deck_cols_dict = {deck_cols[i]:i for i in range(0,len(deck_cols))}
# deck_cols_dict_rev = {i:deck_cols[i] for i in range(0,len(deck_cols))}

In [None]:
vect = model.wv['Riveteers Initiate'] - model.wv['Mountain'] + model.wv['Swamp']