# Card vectorization
## Introduction
In this notebook we will show howto vectorize cards in 3 ways:
1. Bag of Words
2. Tf-idf
3. Custom word2vec

We will show it along 100 cards (it's the same with all the cards, just much longer)

## Preparation
Execute `python -m spacy download en_core_web_sm` on the command line to install the English language module and import the libraries:

In [10]:
import psycopg2, re, string, gzip
from spacy import load
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

## Loading the data
I load it from a loacal Postgres DB where `mtg_local.sql.zip` has been loaded since `docker-compose run --rm db-util yarn initialize` doesn't work by me...

In [11]:
conn = psycopg2.connect(database="mtg", user="postgres", password="postgres", port=5432, host='localhost')
cur = conn.cursor()
cur.execute("select name,type_line,oracle_text from cards where exists( select 1 from jsonb_each_text(cards.legalities) j where j.value not like '%not_legal%') and lang='en' limit 100;")

cards = []
card = cur.fetchone()
 
while card is not None:
    card = cur.fetchone()
    cards.append(card)
 
cur.close()

## Preprocessing

We preprocess the cards by joining their `name`, `type_line` and `oracle_text`, extracting the words, removing the punctuation and the stop words:

In [12]:
def preprocess_card(card):
    card = ' '.join(card)
    card = re.split(r'\W+', card)
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    card = [word.translate(table) for word in card]
    # To lower case
    card = [word.lower() for word in card if word != '']
    # Remove stopwords
    card = list(set(card) - STOP_WORDS)
    card = ' '.join(card)
    return card

In [13]:
preprocessed_cards = []
for card in cards:
    if card is None:
        continue
    card = preprocess_card(card)
    preprocessed_cards.append(card)

## Vectorization
We will vectorize the cards through 3 distinct methods: bag of words, Tf-Idf and word2vec. For bag of words and tf-idf, the pipeline is roughly 1. train/test split, 2. fit on train 3. predict on test. For word2vec it's more free.
### Bag of words

In [14]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(preprocessed_cards)

print(X_train_counts.shape)

(99, 329)


### Tf-Idf
Tf-Idf is applied on the bag of word vectorization computed above:

In [15]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

print(X_train_tf.shape)

(99, 329)


## word2vec
*word2vec* is a method to represent words by vectors such that their cosine proximity reflects their semantic similarity (simpler: the closer the meaning of 2 words, the closer their vectors). There exists already files containing word2vec vectors for English words trained on large corpora. Here we will train a word2vec representation of the words in the cards based on the text in the cards. So we get a fully customized word2vec representation for *MGT*:

In [16]:
path = get_tmpfile("./data/word2vec.model")

model = Word2Vec(preprocessed_cards, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("../../data/word2vec.txt")

# gzip the model
f_in = open('../../data/word2vec.txt', 'rb')
f_out = gzip.open('../../data/word2vec.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()

create model by executing `python3.6 -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc data/word2vec.txt.gz` on the command line.

Result:

```
✔ Successfully created model
33it [00:00, 15872.94it/s]a/word2vec.txt.gz
✔ Loaded vectors from data/word2vec.txt.gz
✔ Sucessfully compiled vocab
499 entries, 33 vectors

```
`

And vectorize the cards:

In [9]:
nlp_mtg = load('../../data/spacy.word2vec.model')
card_vectors = []
for preprocessed_card in preprocessed_cards:
    card_vector = nlp_mtg(preprocessed_card)
    card_vectors.append(card_vector)