# Card vectorization
## Introduction
In this notebook we will show howto vectorize cards in 3 ways:
1. Bag of Words
2. Tf-idf
3. Custom word2vec

## Preparation
Execute `python -m spacy download en_core_web_sm` on the command line to install the English language module and import the libraries:

In [17]:
import psycopg2, re, string, gzip
from numpy import array, mean
from numpy.random import choice
from os import listdir
from os.path import join
from spacy import load
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
from random import shuffle

## Loading the data
I load it from a loacal Postgres DB where `mtg_local.sql.zip` has been loaded since `docker-compose run --rm db-util yarn initialize` doesn't work by me...

In [2]:
conn = psycopg2.connect(database="mtg", user="postgres", password="postgres", port=5432, host='localhost')
cur = conn.cursor()
cur.execute("select oracle_id, name, type_line, oracle_text from cards where exists (select 1 from jsonb_each_text(cards.legalities) j where j.value not like '%not_legal%') and lang='en';")

cards = []
card = cur.fetchone()
 
while card is not None:
    card = cur.fetchone()
    cards.append(card)
 
cur.close()

cards = cards[:-1]

## Preprocessing

### joining the cards with their tag

In [3]:
PATH_TAGS = join(*['..', '..', 'data', 'cards-tags'])

oracleid2tag = {}
for filename in listdir(PATH_TAGS):
    tag_name = filename.split('.')[0]
    with open(join(PATH_TAGS, filename), 'r') as file:
        for i, line in enumerate(file):
            if i == 0:  # drop the first line of the file, because it just contains the header
                continue
            oracleid2tag[line[:-1]] = tag_name  # the last character in carriage return, remove

how many from the cards we fetched from the DB have a tag?

In [4]:
tagged_cards_id = list(set(oracleid2tag.keys()) & set(card[0] for card in cards))
print(f'{len(tagged_cards_id)}/{len(cards)} cards tagged')

3370/46147 cards tagged


We preprocess the cards by joining their `name`, `type_line` and `oracle_text`, extracting the words, removing the punctuation and the stop words:

In [5]:
def preprocess_card(card, all=True):
    """ if all the fields (True) or just the 'oracle_text' (False) is considered"""
    first_field = 1 if all else 3
    card = (c[first_field:] for c in card if c is not None)
    card = ' '.join(card)
    card = re.split(r'\W+', card)
    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    card = [word.translate(table) for word in card]
    # To lower case
    card = [word.lower() for word in card if word != '']
    # Remove stopwords
    card = list(set(card) - STOP_WORDS)
    card = ' '.join(card)
    return card

In [6]:
preprocessed_cards = {}
for card in cards:
    if card is None:
        continue
    id = card[0]
    card = preprocess_card(card)
    preprocessed_cards[id] = card

## Vectorization
We will vectorize the cards through 3 distinct methods: bag of words, Tf-Idf and word2vec. For bag of words and tf-idf, the pipeline is roughly 1. train/test split, 2. fit on train 3. predict on test. For word2vec it's more free.
### Bag of words

In [7]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(preprocessed_cards.values())

print(X_train_counts.shape)

(19279, 85166)


### Tf-Idf
Tf-Idf is applied on the bag of word vectorization computed above:

In [8]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

print(X_train_tf.shape)

(19279, 85166)


### word2vec
*word2vec* is a method to represent words by vectors such that their cosine proximity reflects their semantic similarity (simpler: the closer the meaning of 2 words, the closer their vectors). There exists already files containing word2vec vectors for English words trained on large corpora. Here we will train a word2vec representation of the words in the cards based on the text in the cards. So we get a fully customized word2vec representation for *MGT*:

In [13]:
path = get_tmpfile("./data/word2vec.model")

model = Word2Vec(preprocessed_cards, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("../../data/word2vec.txt")

# gzip the model
f_in = open('../../data/word2vec.txt', 'rb')
f_out = gzip.open('../../data/word2vec.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()

create model by executing `python3.6 -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc data/word2vec.txt.gz` on the command line.

Result:

```
✔ Successfully created model
33it [00:00, 15872.94it/s]a/word2vec.txt.gz
✔ Loaded vectors from data/word2vec.txt.gz
✔ Sucessfully compiled vocab
499 entries, 33 vectors

```
`

And vectorize the cards:

In [14]:
nlp_mtg = load('../../data/spacy.word2vec.model')
card_vectors = []
for preprocessed_card in preprocessed_cards:
    card_vector = nlp_mtg(preprocessed_card)
    card_vectors.append(card_vector)

## Modelization

In [9]:
tag2label = {tag: i for i, tag in enumerate(list(set(oracleid2tag.values())))}

X = [preprocessed_cards[card_id] for card_id in tagged_cards_id]
y = [tag2label[oracleid2tag[card_id]] for card_id in tagged_cards_id]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)

Define a pipeline bag-of-words => tf-idf => classifier

In [11]:
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', MultinomialNB()),
])

Fit it on the train data:

In [12]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [13]:
predicted = text_clf.predict(X_test)
perf = mean(predicted == y_test)

print(f'{round(100 * perf, 2)}% of the cards in the test data correctly classified')

56.51% of the cards in the test data correctly classified


What is the baseline (the performance if we classify the cards randomly)?

In [20]:
labels = list(tag2label.values())
random_perfs = []
for _ in range(1000):
    predicted_random = choice(labels, size=len(y_test))
    random_perf = mean(predicted_random == y_test)
    random_perfs.append(random_perf)
random_baseline = mean(random_perfs)

print(f'baseline: random performance: {round(100 * random_baseline, 2)}%')

baseline: random performance: 16.67%


**TODO**
1. Train to tune the model in various ways (preprocess the cards with `all=False` or `all=True`, try systematically different hyperparemeters for the vectorizer, tf-idf, try different classifiers with different classifiers)
2. Diagnostic: which cards are misclassified? Why?
3. Try modelization with word2vec vectors