# Card vectorization
## Introduction
In this notebook we will:
1. Grab parsed oracle text for cards
2. vectorize oracle text using word2vec 
3. train XGBoost on the vecorizations

## Preparation
1. Use requirements.txt to install dependencies
2. Execute `python -m spacy download en_core_web_sm` on the command line to install the English language module 
3. Setup the database following the instructions in the readme
4. Run `python mtg-ml.py preprocess` to parse cards in the db

### Notebook Dependencies

In [14]:
import sys
sys.path.append('../src/') # needed for nlp import

import psycopg2, re, string, gzip
from numpy import array, mean
from numpy.random import choice
from os import listdir
from os.path import join
from nlp.oracle_text_parser import OracleTextParser
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
from random import shuffle
import pprint

print("Imported modules successfully")
pp = pprint.PrettyPrinter()

Imported modules successfully


### Loading the data
We query the database directly, assuming it was already populated and preprocessed.
We are only working on a subset of all MTG cards which are legal in all types and also which are english.

In [15]:
conn = psycopg2.connect(database="mtg_local", user="mtg", password="mtg_pass", port=5432, host='localhost')
cur = conn.cursor()

# This pulls oracle_id, title, type, and actions (cost/effect pairs) for each card
cur.execute("select cards.id, cards.name, actions.cost, actions.effect from cards, actions where exists( select 1 from jsonb_each_text(cards.legalities) j where j.value not like '%not_legal%') and lang='en' and cards.id = actions.card_id;")
card_actions = cur.fetchmany(-1)
cards = {} 
for raw_action in card_actions:
    id = raw_action[0]
    name = raw_action[1]
    cost = raw_action[2]
    effect = raw_action[3]
    card_action = { 'name': name, 'cost': cost, 'effect': effect }
    if id not in cards.keys():
        cards[id] = [card_action]
    else:
        cards[id].append(card_action)
        
    
print(f"Number of cards in our dataset: {len(cards)}")
for card_id in cards:
    card = cards[card_id]
    print(f"{pp.pprint(card)}")
    break
# done with db connection
cur.close()
conn.close()

Number of cards in our dataset: 28265
[{'cost': "{'red': 0, 'blue': 0, 'black': 0, 'green': 0, 'white': 0, "
          "'colorless': 0, 'generic': 0, 'life': False, 'discard': False, "
          "'loyalty': False, 'sacrifice': False, 'hybrid': False, 'tap': "
          "False, 'untap': False}",
  'effect': "{'bigrams': [], 'effect': Indestructible, 'tokens': "
            "['Indestructible'], 'nouns': [], 'verbs': [], 'phrases': []}",
  'name': 'Athreos, Shroud-Veiled'},
 {'cost': "{'red': 0, 'blue': 0, 'black': 0, 'green': 0, 'white': 0, "
          "'colorless': 0, 'generic': 0, 'life': False, 'discard': False, "
          "'loyalty': False, 'sacrifice': False, 'hybrid': False, 'tap': "
          "False, 'untap': False}",
  'effect': "{'bigrams': ['As long', 'long devotion', 'devotion to', 'to "
            "white', 'white black', 'black less', 'less Athreos', 'Athreos "
            "creature'], 'effect': As long as your devotion to white and black "
            "is less than seven, 

## Preprocessing

### joining the cards with their tag

In [20]:
PATH_TAGS = join(*['..', 'data', 'cards-tags'])

oracleid2tag = {}
for filename in listdir(PATH_TAGS):
    tag_name = filename.split('.')[0]
    tag_count = 0
    with open(join(PATH_TAGS, filename), 'r') as file:
        for i, line in enumerate(file):
            if i == 0:  # drop the first line of the file, because it just contains the header
                continue
            oracleid2tag[line[:-1]] = tag_name  # the last character in carriage return, remove
            tag_count += 1
    print(f"Number of cards tagged with {tag_name}: {tag_count}")

Number of cards tagged with discard-outlet: 409
Number of cards tagged with draw: 75
Number of cards tagged with ramp: 641
Number of cards tagged with removal: 1778
Number of cards tagged with sacrifice-outlet: 156
Number of cards tagged with sweeper: 495


Question: how many total cards in our corpus have a tag?

In [16]:
num_tagged_cards = len(set(oracleid2tag.keys()))
print(f'{num_tagged_cards} of {len(cards)} cards tagged')

3425 of 46177 cards tagged


We use the preprocessed card data stored in the database, parsed out as cost and effect pairs for each action found in the oracle text

In [29]:
preprocessed_cards = {}
for card_index in range(len(cards)):
    card = cards[card_index]
    if card is None:
        continue
    if card_index % 100 == 0:
        print(f"Preprocessed {card_index} cards so far...")
    id = card[0]
    card = preprocess_card(card)
    preprocessed_cards[id] = card

print(f"Preprocessed {len(preprocessed_cards)} cards")

Preprocessed 0 cards so far...
Preprocessed 100 cards so far...
Preprocessed 200 cards so far...
Preprocessed 300 cards so far...
Preprocessed 400 cards so far...
Preprocessed 500 cards so far...
Preprocessed 600 cards so far...
Preprocessed 700 cards so far...
Preprocessed 800 cards so far...
Preprocessed 900 cards so far...
Preprocessed 1000 cards so far...
Preprocessed 1100 cards so far...
Preprocessed 1200 cards so far...
Preprocessed 1300 cards so far...
Preprocessed 1400 cards so far...
Preprocessed 1500 cards so far...


AttributeError: 'NoneType' object has no attribute 'split'

## Vectorization
We will vectorize the cards through 3 distinct methods: bag of words, Tf-Idf and word2vec. For bag of words and tf-idf, the pipeline is roughly 1. train/test split, 2. fit on train 3. predict on test. For word2vec it's more free.
### Bag of words

In [None]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(preprocessed_cards.values())

print(X_train_counts.shape)

### Tf-Idf
Tf-Idf is applied on the bag of word vectorization computed above:

In [None]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

print(X_train_tf.shape)

### word2vec
*word2vec* is a method to represent words by vectors such that their cosine proximity reflects their semantic similarity (simpler: the closer the meaning of 2 words, the closer their vectors). There exists already files containing word2vec vectors for English words trained on large corpora. Here we will train a word2vec representation of the words in the cards based on the text in the cards. So we get a fully customized word2vec representation for *MGT*:

In [None]:
path = get_tmpfile("./data/word2vec.model")

model = Word2Vec(preprocessed_cards, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("../../data/word2vec.txt")

# gzip the model
f_in = open('../../data/word2vec.txt', 'rb')
f_out = gzip.open('../../data/word2vec.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()

create model by executing `python3.6 -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc data/word2vec.txt.gz` on the command line.

Result:

```
✔ Successfully created model
33it [00:00, 15872.94it/s]a/word2vec.txt.gz
✔ Loaded vectors from data/word2vec.txt.gz
✔ Sucessfully compiled vocab
499 entries, 33 vectors

```
`

And vectorize the cards:

In [None]:
nlp_mtg = load('../../data/spacy.word2vec.model')
card_vectors = []
for preprocessed_card in preprocessed_cards:
    card_vector = nlp_mtg(preprocessed_card)
    card_vectors.append(card_vector)

## Modelization

In [None]:
tag2label = {tag: i for i, tag in enumerate(list(set(oracleid2tag.values())))}

X = [preprocessed_cards[card_id] for card_id in tagged_cards_id]
y = [tag2label[oracleid2tag[card_id]] for card_id in tagged_cards_id]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)

Define a pipeline bag-of-words => tf-idf => classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', MultinomialNB()),
])

Fit it on the train data:

In [None]:
text_clf.fit(X_train, y_train)

In [None]:
predicted = text_clf.predict(X_test)
perf = mean(predicted == y_test)

print(f'{round(100 * perf, 2)}% of the cards in the test data correctly classified')

What is the baseline (the performance if we classify the cards randomly)?

In [None]:
labels = list(tag2label.values())
random_perfs = []
for _ in range(1000):
    predicted_random = choice(labels, size=len(y_test))
    random_perf = mean(predicted_random == y_test)
    random_perfs.append(random_perf)
random_baseline = mean(random_perfs)

print(f'baseline: random performance: {round(100 * random_baseline, 2)}%')

**TODO**
1. Train to tune the model in various ways (preprocess the cards with `all=False` or `all=True`, try systematically different hyperparemeters for the vectorizer, tf-idf, try different classifiers with different classifiers)
2. Diagnostic: which cards are misclassified? Why?
3. Try modelization with word2vec vectors