# Finding Similar Magic: the Gathering Cards Using TF-IDF
This program finds Magic: the Gathering cards that contain similar text to a given card, using sci-kit learn's TF-IDF module.  
  
TF-IDF stands for "Term frequency * inverse document frequency." It is an algorithm for classifying texts based on weighted word counts, where relatively rare words are weighted more heavily than common ones when they appear in a document. Words that appear often in a card's text but rarely in other cards are treated as the most descriptive keywords for that card, while common words are safely ignored. In this way, we can describe a card's text as a vector representing occurrences of each word and their importance to the text, which can then be easily compared to other vectors to find similar cards.  
  
You can learn more about TF-IDF here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf and here: https://monkeylearn.com/blog/what-is-tf-idf/.

## Imports
Uses scikit-learn for TF-IDF, and NLTK for pre-processing text.

In [142]:
import json
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.stem.porter import PorterStemmer

## Get Card Texts from MTGJSON
You can find complete information on every magic card thanks to the open source project MTGSJSON. We use the 'AtomicCards' json file found here: https://mtgjson.com/downloads/all-files/#atomiccards. I made a smaller version of this file with only the necessary data, called AtomicCards_Small.json, since the full file surpasses GitHub's file size limit. But it should work with the full AtomicCards.json file just as well.  
  
We could analyze the other properties of a card, but for now we only care about the text similarity.  
  
The code below shows how to access the raw text of a card:

In [177]:
# Read the mtgJSON file 
with open('AtomicCards_Small.json', encoding="utf8") as atomicFile:
    cards_dict = json.load(atomicFile)

# Print some of the json file
jsonString = json.dumps(newJsonObject, indent=4, sort_keys=True)
print(jsonString[:1123])

{
    "data": {
        "\"Ach! Hans, Run!\"": [
            {
                "legalities": {},
                "printings": [
                    "UNH"
                ],
                "text": "At the beginning of your upkeep, you may say \"Ach! Hans, run! It's the . . .\" and the name of a creature card. If you do, search your library for a card with that name, put it onto the battlefield, then shuffle your library. That creature gains haste. Exile it at the beginning of the next end step.",
                "types": [
                    "Enchantment"
                ]
            }
        ],
        "\"Rumors of My Death . . .\"": [
            {
                "legalities": {},
                "printings": [
                    "UST"
                ],
                "text": "{3}{B}, Exile a permanent you control with a League of Dastardly Doom watermark: Return a permanent card with a League of Dastardly Doom watermark from your graveyard to the battlefield.",
              

In [155]:
# Print the text for "Fire // Ice" (both sides)
print("CARD NAME: FIRE // ICE:\n")
print("FACE 1 TEXT: ")
print( cards_dict['data']['Fire // Ice'][0]['text'] + "\n")
print("FACE 2 TEXT: ")
print( cards_dict['data']['Fire // Ice'][1]['text'] )

CARD NAME: FIRE // ICE:

FACE 1 TEXT: 
Fire deals 2 damage divided as you choose among one or two targets.

FACE 2 TEXT: 
Tap target permanent.
Draw a card.


In [156]:
# Returns the raw text of a given card.
def getRawText(cardName_str):
    # Get text from all faces of the card and concatenate it all together.
    cardEntry = cards_dict['data'][cardName_str]
    return " ".join([ face['text'] for face in cardEntry if 'text' in face.keys()])

## Pre-processing Card Text
Text is always messy, so we need to clean it up.  
  
We remove most punctuation, except for meaningful ones: in Magic cards, '{}'s denote a symbol (tap, mana, etc.), '/'s are meaningful for power/toughness, and +/- are also meaningful. We also remove the card's own name from the text; for example, "Lightning Bolt" and "Lightning Mauler" both contain the rare word "Lightning" in their text, so they would be treated as similar even though they are very different cards. We don't want this. Finally, we make all words lowercase for consistency.  

In [157]:
# Punctuation removal: https://www.semicolonworld.com/question/62188/how-to-strip-string-from-punctuation-except-apostrophes-for-nlp
punct = "!\"#$%&'()*,.:;<=>?@[\]^_`|~"
translator = str.maketrans('', '', punct)

def preprocessText(input_text, cardName):
    # Delete card's name from card text (or both names if it's a split card)
    namelessText = input_text
    names = cardName.split(" // ")
    for name in names:
        namelessText = namelessText.replace(name, '')
    # Make text lowercase
    lowerText = namelessText.lower()
    # Remove punctuation, keeping {}s. In magic cards, {}s denote a symbol (tap, mana, etc.)
    noPunctText = lowerText.translate(translator)
    return noPunctText


# Example:
print("CLEANED TEXT FOR FIRE // ICE: ")
rawText = getRawText('Fire // Ice')
print(preprocessText(rawText, 'Fire // Ice'))

CLEANED TEXT FOR FIRE // ICE: 
 deals 2 damage divided as you choose among one or two targets tap target permanent
draw a card


We also want to exclude certain cards from our search. Magic has some weird card sets that we should disregard, like the "Un-" sets and 'Plane' cards from the Planechase sets. You could also just filter by format legality if desired ('commander', 'duel', 'legacy', 'modern', 'pauper', 'vintage').

In [158]:
def isExcluded(cardName_str):
    cardEntry = cards_dict['data'][cardName_str][0]
    # Filter out 'Plane' and 'Scheme' cards
    if 'Plane' in cardEntry['types'] or 'Scheme' in cardEntry['types']: return True
    # Filter out weird sets
    excludedSets = {"PVAN", "PCEL", "CMB1", "UNH", "UST", "UND", "UGL"}
    if not set(cardEntry['printings']).isdisjoint(excludedSets): return True
    # Alternatively, filter by format
    #if not cardEntry['legalities']['modern'] == 'Legal': return True
    return False

# Example:
print("Exclude the card 'Chaos Confetti' from the 'Unglued' set?")
print(isExcluded('Chaos Confetti'))
print("Exclude the card 'Lightning Bolt'?")
print(isExcluded('Lightning Bolt'))

Exclude the card 'Chaos Confetti' from the 'Unglued' set?
True
Exclude the card 'Lightning Bolt'?
False


## Construct a Corpus
Now we construct our corpus: two parallel lists, where one holds the title of the document (card name) and the other holds the corresponding document (card text). *Note: A dictionary structure might be more organized, but harder to sort.*

In [159]:
def constructCorpus():
    # Parallel lists
    cardNames = []  # aka "titles" of our documents (list of strings)
    cardTexts = []  # aka our "documents" (list of strings)

    # For each card, clean up the text, then put its name & text into parallel lists
    card_names = cards_dict["data"].keys()
    for cardName in card_names:
        # Skip unwanted cards
        if isExcluded(cardName): continue

        # Get the card text from the card (this will be in a list to account for multi-face cards)
        text = getRawText(cardName)

        # Clean the text
        cleanText = preprocessText(text, cardName) 

        # Add card names and text to parallel lists
        cardNames.append(cardName)
        cardTexts.append(cleanText)
        
    return cardNames, cardTexts

# Make the corpus
cardNames, cardTexts = constructCorpus()

# Debugging method for accessing the processed text of a card
def getCleanText(cardName):
    return cardTexts[cardNames.index(cardName)]

# Example - print some of the corpus
for i in range(3):
    print("CARD: " + cardNames[i])
    print("TEXT: " + cardTexts[i] + "\n")
print("etc...")

CARD: Abandon Hope
TEXT: as an additional cost to cast this spell discard x cards
look at target opponents hand and choose x cards from it that player discards those cards

CARD: Abandon Reason
TEXT: up to two target creatures each get +1/+0 and gain first strike until end of turn
madness {1}{r} if you discard this card discard it into exile when you do cast it for its madness cost or put it into your graveyard

CARD: Abandoned Outpost
TEXT:  enters the battlefield tapped
{t} add {w}
{t} sacrifice  add one mana of any color

etc...


## Train TF-IDF Model
We can now tokenize our text, stem each token, and then use these tokens to train our TF-IDF model.
Try experimenting with the parameters to fine-tune the model (things like min/max doc freq, stop words, etc.).  
Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html  
Reference Tutorial: https://www2.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html  
*Note: this usually takes a few seconds.*

In [160]:
# Helper for tfidf - breaks text into list of tokens.
def tokenize(text):
    stemmer = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

# Helper for tokenize - stemming means removing the suffix from a word to reduce it to its root word.
def stem_tokens(tokens, stemmer):
    return [stemmer.stem(item) for item in tokens]

# We can ignore some of the more common words in magic cards that don't add much to the overall meaning of a card. Handmade so this might be missing some.
my_stop_words = ["a","an","and","and/or","are","as","at","be","by","could","dont","for","if","in","is","it","its","may","of","on","or","that","thats","the","their","them","then","they","thi","those","to","wa","where","with","you","your","—"]

# Train the tf-idf model: get vectors representing each card
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words=my_stop_words)
tfs = tfidf.fit_transform(cardTexts)

## Find Similar Cards
Finally, we can search for the most similar cards. Since each card has a TF-IDF vector, we can use cosine similarity to find vectors that are most similar to each other. 

In [161]:
# Reference: https://stackoverflow.com/a/12128777 
def getSimilarTexts(cardText, n):
    # Calculate all cosine similarities with the inputText
    response = tfidf.transform([cardText])
    cosine_similarities = cosine_similarity(response, tfs).flatten()

    # Sort all the cosine similarities and keep the top n closest
    related_docs_indices = cosine_similarities.argsort()[:-n-1:-1]
    closest = [cardNames[idx] for idx in related_docs_indices]
    return closest #list of card names (strings)

def getSimilarCards(cardName, n):
    cardText = preprocessText(getRawText(cardName), cardName)
    return getSimilarTexts(cardText, n)

def printNClosest(cardName, n):
    print("CLOSEST " + str(n) + " CARDS TO " + cardName + ": \n")
    for name in getSimilar(cardName, n):
        print("CARD: " + name + "\nTEXT: " + getRawText(name) + "\n")

# Example:
print("RESULTS: ", getSimilarCards('Fire // Ice', 20), "and so on...\n")
printNClosest('Fire // Ice', 5)

RESULTS:  ['Fire // Ice', 'Electrolyze', 'Forked Bolt', "Chandra's Pyrohelix", 'Twin Bolt', 'Flames of the Firebrand', 'Arc Lightning', 'Forked Lightning', 'Aerial Volley', 'Fire at Will', 'Deft Dismissal', 'Gang of Devils', 'Arc Mage', 'Ignite Disorder', 'Rolling Thunder', 'Pyrotechnics', 'Boulderfall', 'Hail of Arrows', 'Spreading Flames', 'Inferno Titan'] and so on...

CLOSEST 5 CARDS TO Fire // Ice: 

CARD: Fire // Ice
TEXT: Fire deals 2 damage divided as you choose among one or two targets. Tap target permanent.
Draw a card.

CARD: Electrolyze
TEXT: Electrolyze deals 2 damage divided as you choose among one or two targets.
Draw a card.

CARD: Forked Bolt
TEXT: Forked Bolt deals 2 damage divided as you choose among one or two targets.

CARD: Chandra's Pyrohelix
TEXT: Chandra's Pyrohelix deals 2 damage divided as you choose among one or two targets.

CARD: Twin Bolt
TEXT: Twin Bolt deals 2 damage divided as you choose among one or two targets.



## Examples

### Ex. 1: Giant Growth

In [162]:
printNClosest('Giant Growth', 5)

CLOSEST 5 CARDS TO Giant Growth: 

CARD: Bounty of Might
TEXT: Target creature gets +3/+3 until end of turn.
Target creature gets +3/+3 until end of turn.
Target creature gets +3/+3 until end of turn.

CARD: Giant Growth
TEXT: Target creature gets +3/+3 until end of turn.

CARD: Brute Force
TEXT: Target creature gets +3/+3 until end of turn.

CARD: Seal of Strength
TEXT: Sacrifice Seal of Strength: Target creature gets +3/+3 until end of turn.

CARD: Sudden Strength
TEXT: Target creature gets +3/+3 until end of turn.
Draw a card.



### Ex. 2: Young Pyromancer

In [163]:
printNClosest("Young Pyromancer", 5)

CLOSEST 5 CARDS TO Young Pyromancer: 

CARD: Young Pyromancer
TEXT: Whenever you cast an instant or sorcery spell, create a 1/1 red Elemental creature token.

CARD: Blaze Commando
TEXT: Whenever an instant or sorcery spell you control deals damage, create two 1/1 red and white Soldier creature tokens with haste.

CARD: Tempt with Vengeance
TEXT: Tempting offer — Create X 1/1 red Elemental creature tokens with haste. Each opponent may create X 1/1 red Elemental creature tokens with haste. For each opponent who does, create X 1/1 red Elemental creature tokens with haste.

CARD: Scampering Scorcher
TEXT: When Scampering Scorcher enters the battlefield, create two 1/1 red Elemental creature tokens. Elementals you control gain haste until end of turn. (They can attack and {T} this turn.)

CARD: Seasoned Pyromancer
TEXT: When Seasoned Pyromancer enters the battlefield, discard two cards, then draw two cards. For each nonland card discarded this way, create a 1/1 red Elemental creature token.

### Ex. 3: A Made-up Card
This also works with arbitrary card text - doesn't have to be a real card.  
Here I invented a card with the ability: "Flip a coin; if heads, you win the game immediately. If tails, you lose the game."

In [164]:
madeupCardText = "Flip a coin; if heads, you win the game immediately. If tails, you lose the game."
closest = getSimilarTexts(madeupCardText, 5)

print("CLOSEST 5 CARDS TO 'MY_MADEUP_CARD': \n")
for name in closest:
    print("CARD: " + name + "\nTEXT: " + getRawText(name) + "\n")

CLOSEST 5 CARDS TO 'MY_MADEUP_CARD': 

CARD: Platinum Angel
TEXT: Flying
You can't lose the game and your opponents can't win the game.

CARD: Abyssal Persecutor
TEXT: Flying, trample
You can't win the game and your opponents can't lose the game.

CARD: Platinum Angel Avatar
TEXT: If you control an artifact, a creature, an enchantment, and a land, you can't lose the game and your opponents can't win the game.

CARD: Mana Clash
TEXT: You and target opponent each flip a coin. Mana Clash deals 1 damage to each player whose coin comes up tails. Repeat this process until both players' coins come up heads on the same flip.

CARD: Amulet of Quoz
TEXT: Remove Amulet of Quoz from your deck before playing if you're not playing for ante.
{T}, Sacrifice Amulet of Quoz: Target opponent may ante the top card of their library. If they don't, you flip a coin. If you win the flip, that player loses the game. If you lose the flip, you lose the game. Activate this ability only during your upkeep.



## Thanks for checking out my project!