# train_card2vec

This notebook covers the full card2vec workflow for creation of card embeddings including:
* Data download
* Preprocessing
* Model training (creating embeddings)

**Before you begin:**
1. Clone this repo.
2. specify the set you want to work with in the cell below. (Data will be auto downloaded)

In [None]:
set_abbreviation = 'ONE' # 3 letter abbreviate for the set to work with.

In [None]:
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
import pandas as pd
import random
from DeckCorpus import DeckCorpus
from LossCallback import LossCallback
from os import getcwd
from datetime import datetime
import SetTools
import polars as pl
import os

In [None]:
file_name = 'game_data_public.' +set_abbreviation+'.PremierDraft.csv.gz'
file_path = getcwd() + '/data/' + set_abbreviation + '/' + file_name

In [None]:
# Please be kind to 17lands servers and don't overuse this. Will skip download if finds existing .gz files
SetTools.download_game_data(set_abbreviation)
SetTools.gz_to_parquet(set_abbreviation) # convert gzipped csv to parquet

In [None]:
df = SetTools.card2vec_preprocess(SetTools.parquet_path(set_abbreviation))

## Model Training (Creating Card Embeddings)
This relies on DeckCorpus, which is a generator that processes decks into word2vec compatible form before passing them to the model.
* (converts from rows of integer card counts into lists of card names (strings). e.g., output decks will be in the format:
    - ["Mountain, "Mountain", "Shock", ... ]

In [None]:
# hyperparameters - you may want to experiment
window_size = 40 # skipgram / CBOW window size
vector_size = 256 # size of resulting card embeddings
epochs = 
skipgram = 1 # uses CBOW if 0
data_share = 1 #if you want to test on a small share of data, reduce this. e.g., 0.3 = 30% of data (not randomized) 

# These lines just implement the idea of the data_share variable above
last_idx = int(len(df)*data_share)
df_less = df.iloc[0:last_idx,:]

# Corpus (generator that yields decks)
deck_corpus = DeckCorpus(data=df_less, shuffle=True)

In [None]:
# This creates and trains the gensim word2vec model.
# Will take time, so try with 1 epoch first to get a sense.
model = Word2Vec(sentences = deck_corpus,
                 vector_size = vector_size,
                 window = window_size,
                 sg = skipgram,
                 callbacks = [LossCallback('loss.log')], #Note that this is a gensim way of reporting training loss
                 compute_loss = True,
                 epochs = epochs,
                )

In [None]:
#save embeddings as csv
pd.DataFrame(model.wv[model.wv.index_to_key], index=model.wv.index_to_key).to_csv(f'{set_abbreviation}_embeddings.csv')

In [None]:
#save model
now = datetime.now()
dt_string = now.strftime("%d-%m-%Y_%H-%M-%S")

save_name = 'card2vec-' + set_abbreviation + '-v' + str(vector_size) + '-e' + dt_string +'.model'
model.save('models\\' + save_name)

In [None]:
model.wv.most_similar('Black Market Tycoon', topn=5)

In [None]:
vect = model.wv['Call In a Professional'] - model.wv['Mountain'] + model.wv['Swamp']

In [None]:
model.wv.similar_by_vector(vect)

In [None]:
# deck_cols_dict = {deck_cols[i]:i for i in range(0,len(deck_cols))}
# deck_cols_dict_rev = {i:deck_cols[i] for i in range(0,len(deck_cols))}

In [None]:
vect = model.wv['Riveteers Initiate'] - model.wv['Mountain'] + model.wv['Swamp']