# MTG Price Predictor

## About/Goals: 

The idea of this project is to create an ML model that can take a card's data and return the value of the card should be, based on previous cards it has analysed. This requires NLP processing for the text box, and utilizes tensorflow to build the model.

Cards are evaluated purely based on what a person looking at it for the first time can see - year, text, color, etc.. nothing about format legalities, special flags, or anythingn else of the sort. Also, price is based on standard, nonfoil variant only.

# Data Importation and Cleaning

#### All imports cell

In [1447]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import re
import os
from sklearn.preprocessing import RobustScaler

In [1448]:
def save(file):
    file.to_pickle('data/updated_data.pkl')

def load():
    df = pd.read_pickle('data/updated_data.pkl')
    return df # band-aid work-around

In [1449]:
df = pd.read_json("data/default_cards_08_05_2025.json")

In [1450]:
df.columns

Index(['object', 'id', 'oracle_id', 'multiverse_ids', 'mtgo_id', 'arena_id',
       'tcgplayer_id', 'cardmarket_id', 'name', 'lang', 'released_at', 'uri',
       'scryfall_uri', 'layout', 'highres_image', 'image_status', 'image_uris',
       'mana_cost', 'cmc', 'type_line', 'oracle_text', 'colors',
       'color_identity', 'keywords', 'produced_mana', 'legalities', 'games',
       'reserved', 'game_changer', 'foil', 'nonfoil', 'finishes', 'oversized',
       'promo', 'reprint', 'variation', 'set_id', 'set', 'set_name',
       'set_type', 'set_uri', 'set_search_uri', 'scryfall_set_uri',
       'rulings_uri', 'prints_search_uri', 'collector_number', 'digital',
       'rarity', 'card_back_id', 'artist', 'artist_ids', 'illustration_id',
       'border_color', 'frame', 'full_art', 'textless', 'booster',
       'story_spotlight', 'prices', 'related_uris', 'purchase_uris',
       'mtgo_foil_id', 'power', 'toughness', 'flavor_text', 'edhrec_rank',
       'penny_rank', 'all_parts', 'promo_types

can drop so many of these columns, only need ones that are useful.

### Column Cleaning

In [1451]:
len(df)

108955

In [1452]:
df.value_counts("variation")
#df.value_counts("oversized")

variation
False    108867
True         88
Name: count, dtype: int64

In [1453]:
df = df[(df["variation"]==False) & (df["reprint"]==False) & (df["oversized"]==False) & (df["promo"]==False) & (df["full_art"]==False) & (df["textless"] == False) & (df["content_warning"]!=True)]

In [1454]:
len(df) # Dropped about 40k entries

41387

In [1455]:
doubles = df["card_faces"].dropna()
doubles.iloc[0] ## this is gonna be tough to work with

[{'object': 'card_face',
  'name': "Obyra's Attendants",
  'mana_cost': '{4}{U}',
  'type_line': 'Creature — Faerie Wizard',
  'oracle_text': 'Flying',
  'power': '3',
  'toughness': '4',
  'flavor_text': "Obyra's devoted servants shrieked as their sleeping mistress slashed at them, unseeing.",
  'artist': 'Andreas Zafiratos',
  'artist_id': 'e2f13a9a-57c5-40de-81d4-3b0723899cdf',
  'illustration_id': 'd1ea5321-62e2-4894-a79f-03b792daf2c8'},
 {'object': 'card_face',
  'name': 'Desperate Parry',
  'mana_cost': '{1}{U}',
  'type_line': 'Instant — Adventure',
  'oracle_text': 'Target creature gets -4/-0 until end of turn. (Then exile this card. You may cast the creature later from exile.)',
  'artist': 'Andreas Zafiratos',
  'artist_id': 'e2f13a9a-57c5-40de-81d4-3b0723899cdf'}]

For first iteration of model, will be removing the multi-faced cards

In [1456]:
df = df[df["card_faces"].isna()]

In [1457]:
def is_legal(x):
    if 'legal' in x.values():
        return True
    else:
        return False

In [1458]:
df["playable"] = df["legalities"].apply(is_legal)

In [1459]:
df = df[df["playable"] == True] # remove unnplayable cards

In [1460]:
unneeded = ['object', 'id', 'oracle_id', 'multiverse_ids', 'mtgo_id', 'arena_id',
       'tcgplayer_id', 'cardmarket_id', #'name',
         'lang', 'uri',
       'scryfall_uri', 'layout', 'highres_image', 'image_status', 'image_uris', 'legalities', 'games',
       'reserved', 'game_changer', 'finishes', 'oversized',
       'promo', 'reprint', 'variation', 'set_id', 'set', 'set_name',
       'set_type', 'set_uri', 'set_search_uri', 'scryfall_set_uri',
       'rulings_uri', 'prints_search_uri', 'collector_number', 'digital', 'card_back_id', 'artist', 'artist_ids', 'illustration_id',
       'border_color', 'frame', 'full_art', 'textless', 'booster',
       'story_spotlight', 'related_uris', 'purchase_uris',
       'mtgo_foil_id', 'flavor_text', 'edhrec_rank',
       'penny_rank', 'all_parts', 'promo_types', 'security_stamp', 'preview', 'watermark', 'frame_effects', 'loyalty',
       'printed_name', 'tcgplayer_etched_id', 'flavor_name',
       'attraction_lights', 'color_indicator', 'printed_type_line',
       'printed_text', 'variation_of', 'life_modifier', 'hand_modifier',
       'content_warning', 'defense', 'card_faces', 'foil', 'nonfoil'
       , 'playable', 'color_identity']
df = df.drop(unneeded, axis=1)

now that I've cleaned out cards and columns that aren't needed, I need to figure out the best way to transform this data into something that the ML model can actually use. 

For instance, the "type_line" column will have to be split up into various super and subtypes, probably using categorical encoding.

### Type Labeling + Encoding

In [1461]:
df["type_line"].describe() # has 4246 unique types currently

count       35098
unique       3120
top       Instant
freq         3592
Name: type_line, dtype: object

In [1462]:
"""
df["creature_type"] = df["type_line"].apply(lambda x: x[10:] if "Creature" in x else "NaN")
df["planeswalker_type"] = df["type_line"].apply(lambda x: x[24:] if "Planeswalker" in x else "NaN")
df["kindred_type"] = df["type_line"].apply(lambda x: x.split()[-1] if "Kindred" in x or "Tribal" in x else "NaN")"""
# Found a better way!

'\ndf["creature_type"] = df["type_line"].apply(lambda x: x[10:] if "Creature" in x else "NaN")\ndf["planeswalker_type"] = df["type_line"].apply(lambda x: x[24:] if "Planeswalker" in x else "NaN")\ndf["kindred_type"] = df["type_line"].apply(lambda x: x.split()[-1] if "Kindred" in x or "Tribal" in x else "NaN")'

In [1463]:
def filter_subtype(x):
    if "—" in x:
        ind = x.index("—")
        types = x[ind+1:].split()
        return types
    else:
        return []

In [1464]:
def filter_maintype(x):
    types = ["Artifact", "Land", 
             #"Battle", 
             "Creature", "Enchantment", "Planeswalker", "Instant", "Sorcery"]
    cur = []
    for type in types:
        if type in x:
            cur.append(type)
    return cur

In [1465]:
df = df[~df['type_line'].str.contains('Basic')] # remove basic lands

In [1466]:
df["legendary"] = df["type_line"].apply(lambda x: 1 if "Legendary" in x else 0)
df["subtype"] = df["type_line"].apply(filter_subtype)
df["main_type"]=df["type_line"].apply(filter_maintype)

In [1467]:
df["price"] = df["prices"].str["usd"].astype(float) 
df = df.drop("prices", axis=1)

In [1468]:
df = df.dropna(subset=["price"])
df = df[df["main_type"].map(len)>0]  # had to filter out bad cards with other types such as conspiracies and stickers

In [1469]:
#le = LabelEncoder()
#le.fit(["Artifact", "Land", "Battle", "Creature", "Enchantment", "Planeswalker", "Instant", "Sorcery"])

In [1470]:
#df["main_type"] = df["main_type"].apply(lambda x: le.transform(x)) 

#### Changed to One-Hot Encoding!

In [1471]:
from sklearn.preprocessing import MultiLabelBinarizer

In [1472]:
mlb = MultiLabelBinarizer()

In [1473]:
test = pd.DataFrame(mlb.fit_transform(df["main_type"]), columns=mlb.classes_, index=df.index)

In [1474]:
df = df.drop('main_type', axis=1)
df = df.join(test)

In [1475]:
le2 = LabelEncoder()

In [1476]:
df["subtype"].values

array([list(['Sliver']), list(['Kor', 'Soldier']),
       list(['Siren', 'Pirate']), ..., list([]),
       list(['Faerie', 'Rogue']), list(['Vampire', 'Soldier'])],
      dtype=object)

In [1477]:
subtypes = []
for unique in df["subtype"].values:
    if unique != []:
        for val in unique:
            if val not in subtypes:
                subtypes.append(val)

le2.fit(subtypes)


In [1478]:
df["subtype"] = df["subtype"].apply(lambda x: le2.transform(x)) # slow, probably better way to do this 

In [1479]:
df = df.drop("type_line", axis=1)

### Date -> Year

In [1480]:
df["year"] = df["released_at"].apply(lambda x: x.year)
df = df.drop("released_at", axis=1)

### Mana Cost Breakdown

- Number of Pips
- Is X spell?

In [1481]:
df["is_x"] = df["mana_cost"].apply(lambda x: 1 if r"{X}" in x else 0)
#df.loc[df["is_x"] == 1]

In [1482]:
#print("{1}{W/R}{G}{G}".replace("{", "").replace("}", " ").split())  ->  df

def pip_counter(x):
    new = x.replace("{", "").replace("}", " ").split()
    count = 0
    for x in new:
        if x.isdigit() == False and x != "X":
            count += 1

    return count

In [1483]:
df["pip_count"] = df["mana_cost"].apply(pip_counter)
df = df.drop("mana_cost", axis=1)
#df.sort_values(by="pip_count", ascending=False)

### Oracle Text Breakdown -> NLP ? Or can try a parsing method to turn text into columns 

- activated ability?
- etb effect?

In [1484]:
df["oracle_text"] = df["oracle_text"].apply(lambda x: x.lower().replace("\n", ". "))

#### Main Phrases

using vectorizer to figure out most common substrings

In [1485]:
vectorizer = CountVectorizer(ngram_range=(4, 7), lowercase=True, stop_words=None)

In [1486]:
X = vectorizer.fit_transform(df["oracle_text"])

In [1487]:
sum_words = X.sum(axis=0)
word_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
word_freq = sorted(word_freq, key=lambda x: x[1], reverse=True)

In [1488]:
common_phrases = pd.DataFrame(word_freq, columns=['phrase', 'count'])
common_phrases.iloc[:10]

Unnamed: 0,phrase,count
0,until end of turn,5990
1,at the beginning of,3261
2,when this creature enters,3049
3,gets until end of,2131
4,gets until end of turn,2130
5,the beginning of your,1869
6,at the beginning of your,1814
7,card from your graveyard,1593
8,creature gets until end,1570
9,creature gets until end of,1570


In [1489]:
"""
PHRASES = { # starter phrases
    r"when.*enters": "etb",
    r"until end of turn": "eot",
    r"beginning of .* upkeep": "b_o_u",
    r"search .* library": "tutor",
    r"without paying": "free",
    r"whenever .* attacks": "a_t",
    r"deals combat damage": "c_d_t",
    r"look at the tzo.*p": "s_s_t",
    r"return.*graveyard.*battlefield": "reanimate",
    r"when.*(this|a).*dies": "o_d_t",
    r"when.*(this|a).*leaves.": "l_b_t",
}"""

'\nPHRASES = { # starter phrases\n    r"when.*enters": "etb",\n    r"until end of turn": "eot",\n    r"beginning of .* upkeep": "b_o_u",\n    r"search .* library": "tutor",\n    r"without paying": "free",\n    r"whenever .* attacks": "a_t",\n    r"deals combat damage": "c_d_t",\n    r"look at the tzo.*p": "s_s_t",\n    r"return.*graveyard.*battlefield": "reanimate",\n    r"when.*(this|a).*dies": "o_d_t",\n    r"when.*(this|a).*leaves.": "l_b_t",\n}'

In [1490]:
#def canonicalize_text(str):
#    for phr, rep in PHRASES.items():
#        if re.search(phr, str) != None:
#            str = re.sub(phr, rep, str)
#    return str

In [1491]:
#df["oracle_text"] = df["oracle_text"].apply(canonicalize_text)

In [1492]:
def extract_math_expression_with_variables(input_string):
  pattern = r'[\d\s.+\-*/(){}:a-zA-Z]'  # Matches digits, dot, plus, minus, asterisk, slash, parentheses, and letters
  
  # Use re.findall() to find all non-overlapping matches of the pattern
  matches = re.findall(pattern, input_string)
  
  # Join the matched characters back into a string
  return "".join(matches)

In [None]:
df["oracle_text"] = df["oracle_text"].apply(lambda x: extract_math_expression_with_variables(x))

In [1494]:
def find_activated_abilities(input_string):
  patterns = r"pay {|:"  
    
    # Use re.findall() to find all non-overlapping matches of the pattern

  match = re.search(patterns, input_string)
  if match:
    str = input_string
  else:
    str = "None"

  return str

In [1495]:
def cost_calc(input):
    if input != "None":
        matches = re.findall(r'pay \{.*\}+', input)
        if matches == []:
            matches = re.findall(r'((?:\{\d?[a-z]?\/?[a-z]?+\}\s?+)+:)|((?:\{\d?[a-z]?\/?[a-z]?+\}\s?+)[\da-z\s].+?):', input)
                #r'(?:((?:\{\d?[a-z]?\/?[a-z]?+\})\s?)+:)|(?:((?:\{\d?[a-z]?\/?[a-z]?+\})\s?)+[\da-z\s].+?:)', input)
            if matches == []:
                matches = ["{0}"]
            else:
                i = 0
                while i <= len(matches)-1:
                    matches[i] = [capture for capture in matches[i] if capture != ''][0]
                    i += 1
                   
        else:
            matches = [matches[0][4:]]
    else:   
        matches = []
    return matches

In [1496]:
def cost_to_num(input):
    sum = []
    if input != []:
        for capt in input:
            i = 0
            parens = capt.find("(")
            if parens != -1:
                i = parens
            j = i + 1
            cost = 0
            while j <= len(capt)-1:
                if capt[i] == "{":
                    while capt[j] != "}":
                        j += 1
                    i += 1
                    if capt[i] == "t":
                        cost += 0.5
                    elif capt[i].isdigit() == False:
                        if capt[i] != "x":
                            cost += 1
                    else:
                        num = ''
                        while i != j:
                            if capt[i].isdigit():
                                num += capt[i]
                                i += 1
                            else:
                                i = j
                        cost += int(num)
                j += 1
                i = j-1
            sum.append(cost)
    return sum


In [1497]:
df["activated_ability"] = df["oracle_text"].apply(lambda x: find_activated_abilities(x))
df["has_activated_ability"] = df["activated_ability"].apply(lambda x: 1 if x != "None" else 0)

In [1498]:
df["activated_ability_cost"] = df["activated_ability"].apply(lambda x: cost_calc(x))

In [1499]:
def remove_cost(input):
    if input["has_activated_ability"] == 1:
        costs = input["activated_ability_cost"]
        for cost in costs:
            cap = input["oracle_text"].find(cost)
            end = cap + len(cost)
            input["oracle_text"] = input["oracle_text"][:cap] + input["oracle_text"][end:]
    return input

In [1500]:
df = df.apply(remove_cost, axis=1)

In [1501]:
df["activated_ability_cost"] = df["activated_ability_cost"].apply(lambda x: cost_to_num(x))

In [1502]:
text_vectorizer = TextVectorization(
    max_tokens=4000, # increased vocab size
    output_mode='int',
    ngrams = (2,6),
    encoding='utf-8'
)

In [1503]:
text_vectorizer.adapt(df["oracle_text"])

In [1504]:
df = df.drop("activated_ability", axis=1)

### Rarity Encoding

In [1505]:
df["rarity"].value_counts()

rarity
rare        11445
common       9878
uncommon     8967
mythic       1987
special         2
Name: count, dtype: int64

In [1506]:
df["rarity"] = df["rarity"].apply(lambda x: "common" if x == "special" else x) # the two special cards have common rarity on scryfall

In [1507]:
def rarity_encoding(x):
    if x == "common":
        return 0
    elif x == "uncommon":
        return 1
    elif x == "rare":
        return 2
    else:
        return 3

In [1508]:
df["rarity"] = df["rarity"].apply(lambda x: rarity_encoding(x))

### Keyword Encoding

In [1509]:
keywrd_encoder = LabelEncoder()

In [1510]:
keywords = [] # taking code from earlier
for unique in df["keywords"].values:
    if unique != []:
        for val in unique:
            if val not in keywords:
                keywords.append(val)

keywrd_encoder.fit(keywords)


In [1511]:
len(keywords)

604

In [1512]:
df["keywords"] = df["keywords"].apply(lambda x: keywrd_encoder.transform(x))

### Color Identity Encoding (one more time!)

definitely could've done a for loop for each column i wanted to encode but oh well it's a little late for that

In [1513]:
c_i = LabelEncoder()

In [1514]:
c_i.fit(["W", "G", "R", "B", "U"])

In [1515]:
c_i.transform(["W", "U"])

array([4, 3])

In [1516]:
df["colors"] = df["colors"].apply(lambda x: c_i.transform(x))

### Produced Mana 

changing this to binary value if it does(n't)

In [1517]:
df["produced_mana"] = df["produced_mana"].replace(pd.NA, 0)

In [1518]:
df["produced_mana"] = df["produced_mana"].apply(lambda x: 1 if x != 0 else x)

### Final Step: Replace Power/Toughness NaN with -1

In [1519]:
df["power"] = pd.to_numeric(df["power"], errors="coerce").fillna(-1)
df["toughness"] = pd.to_numeric(df["toughness"], errors="coerce").fillna(-1)

### Final Data Filtering

In [1520]:
df = df.loc[(df["price"] <= 500) & (df["price"] >= 0.05) & (df["year"] >= 2000)] # filter out cards that are too expensive/cheap/old

In [1521]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,name,cmc,oracle_text,colors,keywords,produced_mana,rarity,power,toughness,legendary,...,Enchantment,Instant,Land,Planeswalker,Sorcery,year,is_x,pip_count,has_activated_ability,activated_ability_cost
0,Fury Sliver,6.0,all sliver creatures have double strike.,[2],[],0,1,3.0,3.0,0,...,0,0,0,0,0,2006,0,1,0,[]
1,Kor Outfitter,2.0,when this creature enters you may attach targe...,[4],[],0,0,2.0,2.0,0,...,0,0,0,0,0,2009,0,2,0,[]
2,Siren Lookout,3.0,flying. when this creature enters it explores....,[3],"[230, 198]",0,0,1.0,2.0,0,...,0,0,0,0,0,2017,0,1,0,[]
3,Surge of Brilliance,2.0,paradox draw a card for each spell youve cast...,[3],"[390, 238]",0,1,-1.0,-1.0,0,...,0,1,0,0,0,2023,0,1,1,[2]
4,Venerable Knight,1.0,when this creature dies put a +1/+1 counter on...,[4],[],0,1,2.0,1.0,0,...,0,0,0,0,0,2019,0,1,0,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25060,Morkrut Banshee,5.0,morbid when this creature enters if a creatur...,[0],[366],0,1,4.0,4.0,0,...,0,0,0,0,0,2011,0,2,0,[]
25061,Deeproot Historian,4.0,merfolk and druid cards in your graveyard have...,[1],[],0,2,3.0,3.0,0,...,0,0,0,0,0,2023,0,1,0,[]
25062,Aggressive Biomancy,2.0,create x tokens that are copies of target crea...,"[1, 3]",[215],0,2,-1.0,-1.0,0,...,0,0,0,0,1,2024,1,2,0,[]
25063,Faerie Bladecrafter,3.0,flying. whenever one or more faeries you contr...,[0],[230],0,2,2.0,2.0,0,...,0,0,0,0,0,2023,0,1,0,[]


### Additional Rows Added Later

In [1522]:
df["num_colors"] = df["colors"].apply(lambda x: len(x))
#df["num_keywords"] = df["keywords"].apply(lambda x: len(x))
#df["num_subtypes"] = df["subtype"].apply(lambda x: len(x))
#df["num_maintypes"] = df["main_type"].apply(lambda x: len(x))
df["oracle_length"] = df["oracle_text"].apply(lambda x: len(x))
df["length per cmc"] = df["oracle_length"]/df["cmc"]
df["length per cmc"] = df["length per cmc"].replace(np.inf, 0)
df["length per cmc"] = df["length per cmc"].replace(np.nan, 0)

In [1523]:
df.drop_duplicates(subset="name", keep="first", inplace=True)

In [1524]:
save(df)

## Model Building

### Numerical Data

In [1525]:
numerical = df.drop(["name", "oracle_text", "colors", "keywords", "subtype", "price", "activated_ability_cost"], axis=1)

In [1526]:
numerical.columns

Index(['cmc', 'produced_mana', 'rarity', 'power', 'toughness', 'legendary',
       'Artifact', 'Creature', 'Enchantment', 'Instant', 'Land',
       'Planeswalker', 'Sorcery', 'year', 'is_x', 'pip_count',
       'has_activated_ability', 'num_colors', 'oracle_length',
       'length per cmc'],
      dtype='object')

In [1535]:
len(numerical)

21309

In [1527]:
scaler = RobustScaler()
scaled_data = scaler.fit_transform(numerical)

In [1528]:
number_inputs = keras.Input(shape=(20,), name="numerical")
normalized = keras.layers.Normalization()(number_inputs)

In [1583]:
price_scaler = RobustScaler()

In [1584]:
price_scaled = price_scaler.fit_transform(pd.DataFrame(df["price"]))

In [1585]:
price_scaled

array([[ 0.20833333],
       [-0.125     ],
       [-0.25      ],
       ...,
       [ 0.41666667],
       [-0.22916667],
       [-0.29166667]])

### Array Data

describing code below:
have to pad all the array inputs, tell keras the shape of the input, and create an embedding layer to vectorize said input

input_dim is equal to number of unique ids

In [None]:
colors_padded = keras.preprocessing.sequence.pad_sequences(df['colors'])
colors_input = keras.Input(shape=(5,), dtype="float32", name="colors")
colors_embed = keras.layers.Embedding(input_dim=5, output_dim=4)(colors_input)
#colors_norm = keras.layers.Normalization()(colors_embed)
colors_pooled = keras.layers.GlobalAveragePooling1D()(colors_embed)

keywords_padded = keras.preprocessing.sequence.pad_sequences(df['keywords'])
keywords_input = keras.Input(shape=(10,), dtype="float32", name="keywords")
keywords_embed = keras.layers.Embedding(input_dim=604, output_dim=8)(keywords_input)
#keywords_norm = keras.layers.Normalization()(keywords_embed)
keywords_pooled = keras.layers.GlobalAveragePooling1D()(keywords_embed)

subtypes_padded = keras.preprocessing.sequence.pad_sequences(df['subtype'])
subtypes_input = keras.Input(shape=(4,), dtype="float32", name="subtypes")
subtypes_embed = keras.layers.Embedding(input_dim=393, output_dim=8)(subtypes_input)
#subtypes_norm = keras.layers.Normalization()(subtypes_embed)
subtypes_pooled = keras.layers.GlobalAveragePooling1D()(subtypes_embed)

aac_padded = keras.preprocessing.sequence.pad_sequences(df['activated_ability_cost'])
aac_input = keras.Input(shape=(5,), dtype="float32", name="aac")
aac_embed = keras.layers.Embedding(input_dim=199, output_dim=8)(aac_input)
#aac_norm = keras.layers.Normalization()(aac_embed)
aac_pooled = keras.layers.GlobalAveragePooling1D()(aac_embed)

#main_type_padded = keras.preprocessing.sequence.pad_sequences(df['main_type'])
#main_type_input = keras.Input(shape=(2,), dtype="int32", name="main_type")
#main_type_embed = keras.layers.Embedding(input_dim=8, output_dim=4)(main_type_input)
#main_type_pooled = keras.layers.GlobalAveragePooling1D()(main_type_embed)


### Tokenized Data

In [1587]:
text_input = keras.Input(shape=(), dtype=tf.string, name="oracle_text")
oracle_vector = text_vectorizer(text_input)
oracle_embed = keras.layers.Embedding(input_dim=4000, output_dim=512)(oracle_vector) #increased output_dim from 256 -> 512
oracle_pooled = keras.layers.GlobalAveragePooling1D()(oracle_embed)

In [87]:
#oracle_array = text_vectorizer(df["oracle_text"].values).numpy()

In [1588]:
x_train = {
        "numerical": scaled_data[:18000],
        "oracle_text": df["oracle_text"].values[:18000],
        "keywords": keywords_padded[:18000],
        "colors": colors_padded[:18000],
        "subtypes": subtypes_padded[:18000],
        "aac": aac_padded[:18000]
    }
#y_train = df["price"][:18000]
y_train = price_scaled[:18000]
x_test = {
        "numerical": scaled_data[18000:],
        "oracle_text": df["oracle_text"].values[18000:],
        "keywords": keywords_padded[18000:],
        "colors": colors_padded[18000:],
        "subtypes": subtypes_padded[18000:],
        "aac": aac_padded[18000:]
    }
#y_test = df["price"][18000:]
y_test = price_scaled[18000:]

### Model Compiling

In [1589]:
all_inputs = [
    number_inputs,
    text_input,
    colors_input,
    keywords_input,
    subtypes_input,
    aac_input
]

all_features = keras.layers.concatenate([
    normalized,
    oracle_pooled,
    colors_pooled,
    keywords_pooled,
    subtypes_pooled,
    aac_pooled
])

In [1058]:
x = keras.layers.Dense(64, activation="tanh")(all_features)
x = keras.layers.Normalization()(x)
x = keras.layers.Dense(32, activation="tanh")(x)
x = keras.layers.Normalization()(x)
x = keras.layers.Dense(16, activation="tanh")(x)
output = keras.layers.Dense(1, activation="linear", name="price")(x)

In [1059]:
model = keras.Model(inputs=all_inputs, outputs=output)
model.compile(optimizer="adamax", loss=keras.losses.MeanSquaredError(), metrics=[keras.metrics.MeanAbsoluteError()])

In [1032]:
early_stopping_cb = keras.callbacks.EarlyStopping(patience=25, start_from_epoch=20, monitor="val_loss", mode='min')
#model_checkpoint_cb = keras.callbacks.ModelCheckpoint(f"models\\best_model", save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "logs", "run_{:03d}".format(run_index))
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, 
             #model_checkpoint_cb,
               tensorboard_cb]

In [None]:
model.fit(x_train, y_train, epochs=75, # long ahh runtime fr
    batch_size=32,
    validation_split=0.25,
    callbacks=callbacks,
    verbose=1
)

In [1061]:
model.evaluate(x_test, y_test)

[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 41.9360 - mean_absolute_error: 2.1640


[41.93601608276367, 2.1640121936798096]

#### Initial Results:

Mean Absolute Error: 1.03

Rerun With One-Hot Encoding: literally the same (1.025)

In [1062]:
model.save("models\\best_model.keras")

In [1063]:
text_vectorizer.save_assets("models\\text_vectorizer")

## Model Improvement

In [1590]:
model2 = keras.models.load_model(
    "models\\best_model.keras",
    custom_objects={
        "TextVectorization": TextVectorization,
    }
)

In [1591]:
#numerical["price"] = df["price"]
numerical["price"] = np.concatenate((y_train,y_test))

In [1592]:
numerical.corr().sort_values(by="price", ascending=False)["price"]

price                    1.000000
rarity                   0.292915
legendary                0.152600
Land                     0.104526
produced_mana            0.083137
oracle_length            0.075567
Planeswalker             0.059103
Artifact                 0.051043
cmc                      0.030353
has_activated_ability    0.026905
Enchantment              0.010717
toughness                0.006657
is_x                     0.005707
power                    0.003431
year                    -0.004188
pip_count               -0.008071
length per cmc          -0.008794
Sorcery                 -0.031628
Instant                 -0.038415
Creature                -0.046203
num_colors              -0.053613
Name: price, dtype: float64

Might have to drop everything below cmc unless i can figure out a way to make power and toughness more relevant

### Changes:

1. Changed activation function on dense layers from relu -> tanh
2. tried adding new columns (num_subtypes, num_keywords) and removed them because low correlation
3. Added oracle_length/cmc to try and show that low cmc cards are often better and can be evaluated higher? 
4. Changed optimizer from adam to adamax
5. Removed non-mathematic symbols from oracle_text
6. Changed main_type array column to one-hot encoded columns
7. Added activated abilities columns - has_ + cost_
8. Testing increasing oracle_embed dims to 512
9. Tried scaling price (target variable)

#### Thinking of things:

- Drop categories with poor correlation
- Better categories? (Pow + Tough?)

####  For Loop Running

In [1593]:
def buildModel(testing, verbose=0, all_features=all_features, all_inputs=all_inputs, early_stop=False):
    num_nrns = testing["nrns"]
    activation = testing["act"]
    batch_size = testing["batch"]
    if_drop = testing["drop"]
    optimizer = testing["opt"]
    norm = testing["norm"]
    x = keras.layers.Dense(num_nrns[0], activation=activation)(all_features)
    if if_drop[0]:
        x = keras.layers.Dropout(rate=if_drop[1])(all_features)
    for num in num_nrns[1:]:
        x = keras.layers.Dense(num, activation=activation)(x)
        if norm == True:
            x = keras.layers.BatchNormalization()(x)
    output = keras.layers.Dense(1, activation="linear", name="price")(x)

    model = keras.Model(inputs=all_inputs, outputs=output)
    model.compile(optimizer=optimizer, loss=keras.losses.MeanSquaredError(), metrics=[keras.metrics.MeanAbsoluteError()])

    if early_stop:
        early_stopping_cb = keras.callbacks.EarlyStopping(patience=25, start_from_epoch=75, monitor="val_loss", mode='min')
        model_checkpoint_cb = keras.callbacks.ModelCheckpoint(f"models\\final_model.keras", save_best_only=True)
        run_index = 1
        run_logdir = os.path.join(os.curdir, "logs", "run_{:03d}".format(run_index))
        tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
        callbacks = [early_stopping_cb, 
                    model_checkpoint_cb,
                    tensorboard_cb]

    model.fit(x_train, y_train,
    epochs=testing["eps"],
    batch_size=batch_size,
    validation_split=0.25,
    verbose=verbose,
    callbacks=callbacks
    )
    return model, model.evaluate(x_test, y_test)


In [None]:
bad_grid_search = [
     {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [False], "norm": True, "opt":"adamax", "eps": 50}, # base model
     {"nrns": [64, 32, 16], "act": "tanh", "batch": 128, "drop": [False],"norm": True, "opt":"adamax", "eps": 50},
     {"nrns": [64, 32, 16], "act": "relu", "batch": 32, "drop": [False],"norm": True, "opt":"adamax", "eps": 50},
     {"nrns": [256, 128, 64, 32, 16], "act": "tanh", "batch": 32, "drop": [False],"norm": True, "opt":"adamax", "eps": 50},
     {"nrns": [128, 64, 32, 16, 8], "act": "tanh", "batch": 32, "drop": [False],"norm": True, "opt":"adamax", "eps": 50},
     {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": True, "opt":"adamax", "eps": 50},
     {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [True, 0.4],"norm": True, "opt":"adamax", "eps": 50},
     {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [False], "norm": False, "opt":"adamax", "eps": 50},
]

In [None]:
scores = []
for tests in bad_grid_search:
    result = buildModel(tests)
    scores.append([tests,result])

[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 60.6505 - mean_absolute_error: 2.2429
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 74.1374 - mean_absolute_error: 2.5901
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 108.6173 - mean_absolute_error: 2.3501
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 86.2647 - mean_absolute_error: 2.5868
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - loss: 69.0666 - mean_absolute_error: 2.2747
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 62.4984 - mean_absolute_error: 2.0689
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - loss: 61.8308 - mean_absolute_error: 2.1096
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 46.6315 - mean_absolute_error: 2.0909


Best: normal with either normalization off or drop = 0.2

In [1079]:
bad_grid_search = [{"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": False, "opt":"adamax", "eps": 50},
                   {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [False], "norm": False,
                     "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 50},
                    {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [False], "norm": False,
                     "opt": keras.optimizers.Adam(learning_rate=0.0001), "eps": 50},
                     {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": True,
                     "opt": keras.optimizers.Adam(learning_rate=0.0001), "eps": 50}
    ]

scores = []
for tests in bad_grid_search:
    result = buildModel(tests)
    scores.append(result)

[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 46.6985 - mean_absolute_error: 2.2765
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 39.8670 - mean_absolute_error: 1.5929
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 42.0418 - mean_absolute_error: 2.1846
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 49.1113 - mean_absolute_error: 2.0438


In [1080]:
scores

[(<Functional name=functional_22, built=True>,
  [46.698486328125, 2.2765324115753174]),
 (<Functional name=functional_23, built=True>,
  [39.866981506347656, 1.5928760766983032]),
 (<Functional name=functional_24, built=True>,
  [42.041786193847656, 2.1845853328704834]),
 (<Functional name=functional_25, built=True>,
  [49.111331939697266, 2.043816328048706])]

In [1081]:
scores.sort(key=lambda x: x[1][1])

In [1082]:
scores[0]

(<Functional name=functional_23, built=True>,
 [39.866981506347656, 1.5928760766983032])

Best: {"nrns": [64, 32, 16], "act": "tanh", "batch": 32, "drop": [False], "norm": False,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 50}

In [1087]:
scores[0][0].save("models\\best_model.keras")

In [None]:
model = buildModel({"nrns": [256, 128], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": False,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 100})

[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 10.2612 - mean_absolute_error: 0.9792


In [1332]:
model[0].save("models\\100_eps.keras")

### Text Vectorization Testing

In [1594]:
def buildVectors(tokens, ngrams, dim):
    text_vectorizer = TextVectorization(
        max_tokens=tokens, # increased vocab size
        output_mode='int',
        ngrams = ngrams,
        encoding='utf-8'
    )
    text_vectorizer.adapt(df["oracle_text"])

    text_input = keras.Input(shape=(), dtype=tf.string, name="oracle_text")
    oracle_vector = text_vectorizer(text_input)
    oracle_embed = keras.layers.Embedding(input_dim=tokens, output_dim=dim)(oracle_vector) 
    oracle_pooled = keras.layers.GlobalAveragePooling1D()(oracle_embed)

    all_inputs = [
    number_inputs,
    text_input,
    colors_input,
    keywords_input,
    subtypes_input,
    aac_input
]

    all_features = keras.layers.concatenate([
    normalized,
    oracle_pooled,
    colors_pooled,
    keywords_pooled,
    subtypes_pooled,
    aac_pooled
    ])
    return text_vectorizer, all_features, all_inputs

In [1372]:
ts = [1000, 500]
ngram = [(2,7), (2,8)]
dim = [64, 128]

tests = []
for t in ts:
    for ng in ngram:
        for d in dim:
            tests.append([t, ng, d])
len(tests)

8

In [1373]:
models = []
for i, build in enumerate(tests):
    tv = buildVectors(build[0], build[1], build[2])
    text_vectorizer = tv[0]
    features = tv[1]
    inputs = tv[2]
    result = buildModel({"nrns": [256, 128], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": False,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 10}, verbose=1, all_features=features, all_inputs=inputs)
    models.append([build, text_vectorizer, result[1]])
    print(i)

Epoch 1/10
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 19.4270 - mean_absolute_error: 1.7225 - val_loss: 13.3685 - val_mean_absolute_error: 1.5779
Epoch 2/10
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 18.3709 - mean_absolute_error: 1.6940 - val_loss: 13.3852 - val_mean_absolute_error: 1.5834
Epoch 3/10
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 17.6067 - mean_absolute_error: 1.6749 - val_loss: 13.5112 - val_mean_absolute_error: 1.4997
Epoch 4/10
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - loss: 17.3966 - mean_absolute_error: 1.6056 - val_loss: 13.6741 - val_mean_absolute_error: 1.6169
Epoch 5/10
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - loss: 17.1610 - mean_absolute_error: 1.5855 - val_loss: 13.5512 - val_mean_absolute_error: 1.4193
Epoch 6/10
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m

In [1374]:
models.sort(key=lambda x: x[2][1])
models[0]
#models[0][1].save_assets("models\\text_vectorizer_tested")

[[1000, (2, 8), 64],
 <TextVectorization name=text_vectorization_113, built=True>,
 [10.76006031036377, 1.1727124452590942]]

Best: [[3000, (2, 7), 128],
 <TextVectorization name=text_vectorization_90, built=True>,
 [10.209338188171387, 1.020315408706665]]

### Final Tests:

In [1595]:
tv = buildVectors(3000, (2,7), 128)
text_vectorizer = tv[0]
features = tv[1]
inputs = tv[2]

In [1378]:
final_options = [{"nrns": [256, 128], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": False,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 15},
      {"nrns": [512, 256], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": False,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 15},
      {"nrns": [256, 128], "act": "tanh", "batch": 32, "drop": [False], "norm": True,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 15},
      {"nrns": [512, 256], "act": "tanh", "batch": 32, "drop": [False], "norm": True,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 15},
      {"nrns": [256, 128], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": True,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 15},
      {"nrns": [512, 256, 128], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": False,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 15}
      ]
models = []
for opt in final_options:
    model = buildModel(opt, verbose=0, all_features=features, all_inputs=inputs)
    models.append(model)

[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 10.7473 - mean_absolute_error: 1.3294
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 12.8841 - mean_absolute_error: 1.7112
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 12.6426 - mean_absolute_error: 1.0533
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 13.9944 - mean_absolute_error: 1.2350
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 14.1156 - mean_absolute_error: 1.2351
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 10.9325 - mean_absolute_error: 1.0893


In [1596]:
final_model = buildModel({"nrns": [512, 256, 128], "act": "tanh", "batch": 32, "drop": [True, 0.2], "norm": False,
      "opt": keras.optimizers.Adam(learning_rate=0.001), "eps": 200}, verbose=1, all_features=features, all_inputs=inputs, early_stop=True)

Epoch 1/200
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - loss: 85.2111 - mean_absolute_error: 3.6141 - val_loss: 59.6167 - val_mean_absolute_error: 3.6689
Epoch 2/200
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - loss: 83.5866 - mean_absolute_error: 3.6286 - val_loss: 57.1025 - val_mean_absolute_error: 2.8989
Epoch 3/200
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - loss: 81.8530 - mean_absolute_error: 3.4457 - val_loss: 58.6450 - val_mean_absolute_error: 2.9697
Epoch 4/200
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - loss: 80.3817 - mean_absolute_error: 3.2109 - val_loss: 56.2052 - val_mean_absolute_error: 2.6881
Epoch 5/200
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - loss: 77.6023 - mean_absolute_error: 3.1100 - val_loss: 55.9655 - val_mean_absolute_error: 3.2873
Epoch 6/200
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0

In [1597]:
final = keras.models.load_model("models\\final_model.keras", custom_objects={
        "TextVectorization": TextVectorization,
    })

In [1598]:
preds=final.predict(x_test)

[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step


In [1630]:
x_test["oracle_text"][1108]

'flying. if damage would be dealt to this creature prevent that damage. when damage is prevented this way this creature deals that much damage to any other target.'

In [1631]:
unscaled = price_scaler.inverse_transform(preds)
unscaled[1108]

array([1.4336947], dtype=float32)

In [1633]:
import joblib

In [1636]:
joblib.dump(price_scaler, 'models\\price_scaler.gz')
joblib.dump(le2, "models\\SubtypeEncoder.gz")
joblib.dump(mlb,'models\\MultiLabelB.gz')
joblib.dump(keywrd_encoder, 'models\\keywordEncoder.gz')
joblib.dump(c_i, 'models\\c_i_encoder.gz')

['models\\c_i_encoder.gz']

In [1637]:
price_scaler.inverse_transform(y_test[1108].reshape(1,-1))

array([[2.21]])