# train_card2vec

This notebook covers the full card2vec workflow for creation of card embeddings including:
* Data download
* Preprocessing
* Model training (creating embeddings)

**Before you begin:**
1. Clone this repo.
2. specify the set you want to work with in the cell below. (Data will be auto downloaded)

In [None]:
set_abbreviation = 'ONE' # 3 letter abbreviate for the set to work with.

In [None]:
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
import pandas as pd
from DeckCorpus import DeckCorpus
from LossCallback import LossCallback
import SetTools
import os

In [None]:
# import requests
# response = requests.get("https://api.scryfall.com/cards/search/?q=e=" + set_abbreviation)
# response_json = response.json()

In [None]:
# import json
# response_json

In [None]:
# with open('one.json', 'w') as f:
#     json.dump(response_json, f)

In [None]:
# Please be kind to 17lands servers and don't overuse this. Will skip download if finds existing .gz files
SetTools.download_game_data(set_abbreviation)
SetTools.gz_to_parquet(set_abbreviation) # convert gzipped csv to parquet

In [None]:
df = SetTools.card2vec_preprocess(SetTools.parquet_path(set_abbreviation))

## Model Training (Creating Card Embeddings)
This relies on DeckCorpus, which is a generator that processes decks into word2vec compatible form before passing them to the model.
* (converts from rows of integer card counts into lists of card names (strings). e.g., output decks will be in the format:
    - ["Mountain, "Mountain", "Shock", ... ]

In [None]:
# hyperparameters
epochs = 5
window_size = 40 # skipgram / CBOW window size
vector_size = 256 # size of resulting card embeddings
skipgram = 1 # uses CBOW if 0

# Corpus (generator that yields decks)
deck_corpus = DeckCorpus(data=df, shuffle=True)

model = Word2Vec(sentences = deck_corpus,
                 vector_size = vector_size,
                 window = window_size,
                 sg = skipgram,
                 callbacks = [LossCallback('loss.log')], #Note that this is a gensim way of reporting training loss
                 compute_loss = True,
                 epochs = epochs,
                )

In [None]:
#save embeddings as csv
embed_dir = os.getcwd() + '/embeddings/' + set_abbreviation

# Create the local directory if it doesn't exist
if not os.path.exists(embed_dir): os.makedirs(embed_dir)

pd.DataFrame(model.wv[model.wv.index_to_key], index=model.wv.index_to_key).to_csv(f'{embed_dir}/{set_abbreviation}_embeddings.csv')
model.wv.save_word2vec_format(f'{embed_dir}/embed_gensim.txt')

In [None]:
#save model
model_dir = os.getcwd() + '/models/' + set_abbreviation

# Create the local directory if it doesn't exist
if not os.path.exists(model_dir): os.makedirs(model_dir)

save_name = f'{set_abbreviation}.model'
model.save(model_dir + '/' +save_name)