# MTG Price Predictor

## About/Goals: 

The idea of this project is to create an ML model that can take a card's data and return the value of the card should be, based on previous cards it has analysed. This requires NLP processing for the text box, and utilizes tensorflow to build the model.

Cards are evaluated purely based on what a person looking at it for the first time can see - year, text, color, etc.. nothing about format legalities, special flags, or anythingn else of the sort. Also, price is based on standard, nonfoil variant only.

# Data Importation and Cleaning

#### All imports cell

In [485]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization
from sklearn.preprocessing import MinMaxScaler
import re
import os
from sklearn.preprocessing import RobustScaler

In [486]:
def save(file):
    file.to_pickle('data/updated_data.pkl')

def load():
    df = pd.read_pickle('data/updated_data.pkl')
    return df # band-aid work-around

In [487]:
df = pd.read_json("data/default_cards_08_05_2025.json")

In [488]:
df.columns

Index(['object', 'id', 'oracle_id', 'multiverse_ids', 'mtgo_id', 'arena_id',
       'tcgplayer_id', 'cardmarket_id', 'name', 'lang', 'released_at', 'uri',
       'scryfall_uri', 'layout', 'highres_image', 'image_status', 'image_uris',
       'mana_cost', 'cmc', 'type_line', 'oracle_text', 'colors',
       'color_identity', 'keywords', 'produced_mana', 'legalities', 'games',
       'reserved', 'game_changer', 'foil', 'nonfoil', 'finishes', 'oversized',
       'promo', 'reprint', 'variation', 'set_id', 'set', 'set_name',
       'set_type', 'set_uri', 'set_search_uri', 'scryfall_set_uri',
       'rulings_uri', 'prints_search_uri', 'collector_number', 'digital',
       'rarity', 'card_back_id', 'artist', 'artist_ids', 'illustration_id',
       'border_color', 'frame', 'full_art', 'textless', 'booster',
       'story_spotlight', 'prices', 'related_uris', 'purchase_uris',
       'mtgo_foil_id', 'power', 'toughness', 'flavor_text', 'edhrec_rank',
       'penny_rank', 'all_parts', 'promo_types

can drop so many of these columns, only need ones that are useful.

### Column Cleaning

In [489]:
len(df)

108955

In [490]:
df.value_counts("variation")
#df.value_counts("oversized")

variation
False    108867
True         88
Name: count, dtype: int64

In [491]:
df = df[(df["variation"]==False) & (df["reprint"]==False) & (df["oversized"]==False) & (df["promo"]==False) & (df["full_art"]==False) & (df["textless"] == False) & (df["content_warning"]!=True)]

In [492]:
len(df) # Dropped about 40k entries

41387

In [493]:
doubles = df["card_faces"].dropna()
doubles.iloc[0] ## this is gonna be tough to work with

[{'object': 'card_face',
  'name': "Obyra's Attendants",
  'mana_cost': '{4}{U}',
  'type_line': 'Creature — Faerie Wizard',
  'oracle_text': 'Flying',
  'power': '3',
  'toughness': '4',
  'flavor_text': "Obyra's devoted servants shrieked as their sleeping mistress slashed at them, unseeing.",
  'artist': 'Andreas Zafiratos',
  'artist_id': 'e2f13a9a-57c5-40de-81d4-3b0723899cdf',
  'illustration_id': 'd1ea5321-62e2-4894-a79f-03b792daf2c8'},
 {'object': 'card_face',
  'name': 'Desperate Parry',
  'mana_cost': '{1}{U}',
  'type_line': 'Instant — Adventure',
  'oracle_text': 'Target creature gets -4/-0 until end of turn. (Then exile this card. You may cast the creature later from exile.)',
  'artist': 'Andreas Zafiratos',
  'artist_id': 'e2f13a9a-57c5-40de-81d4-3b0723899cdf'}]

For first iteration of model, will be removing the multi-faced cards

In [494]:
df = df[df["card_faces"].isna()]

In [495]:
def is_legal(x):
    if 'legal' in x.values():
        return True
    else:
        return False

In [496]:
df["playable"] = df["legalities"].apply(is_legal)

In [497]:
df = df[df["playable"] == True] # remove unnplayable cards

In [498]:
unneeded = ['object', 'id', 'oracle_id', 'multiverse_ids', 'mtgo_id', 'arena_id',
       'tcgplayer_id', 'cardmarket_id', #'name',
         'lang', 'uri',
       'scryfall_uri', 'layout', 'highres_image', 'image_status', 'image_uris', 'legalities', 'games',
       'reserved', 'game_changer', 'finishes', 'oversized',
       'promo', 'reprint', 'variation', 'set_id', 'set', 'set_name',
       'set_type', 'set_uri', 'set_search_uri', 'scryfall_set_uri',
       'rulings_uri', 'prints_search_uri', 'collector_number', 'digital', 'card_back_id', 'artist', 'artist_ids', 'illustration_id',
       'border_color', 'frame', 'full_art', 'textless', 'booster',
       'story_spotlight', 'related_uris', 'purchase_uris',
       'mtgo_foil_id', 'flavor_text', 'edhrec_rank',
       'penny_rank', 'all_parts', 'promo_types', 'security_stamp', 'preview', 'watermark', 'frame_effects', 'loyalty',
       'printed_name', 'tcgplayer_etched_id', 'flavor_name',
       'attraction_lights', 'color_indicator', 'printed_type_line',
       'printed_text', 'variation_of', 'life_modifier', 'hand_modifier',
       'content_warning', 'defense', 'card_faces', 'foil', 'nonfoil'
       , 'playable', 'color_identity']
df = df.drop(unneeded, axis=1)

now that I've cleaned out cards and columns that aren't needed, I need to figure out the best way to transform this data into something that the ML model can actually use. 

For instance, the "type_line" column will have to be split up into various super and subtypes, probably using categorical encoding.

### Type Labeling + Encoding

In [499]:
df["type_line"].describe() # has 4246 unique types currently

count       35098
unique       3120
top       Instant
freq         3592
Name: type_line, dtype: object

In [500]:
"""
df["creature_type"] = df["type_line"].apply(lambda x: x[10:] if "Creature" in x else "NaN")
df["planeswalker_type"] = df["type_line"].apply(lambda x: x[24:] if "Planeswalker" in x else "NaN")
df["kindred_type"] = df["type_line"].apply(lambda x: x.split()[-1] if "Kindred" in x or "Tribal" in x else "NaN")"""
# Found a better way!

'\ndf["creature_type"] = df["type_line"].apply(lambda x: x[10:] if "Creature" in x else "NaN")\ndf["planeswalker_type"] = df["type_line"].apply(lambda x: x[24:] if "Planeswalker" in x else "NaN")\ndf["kindred_type"] = df["type_line"].apply(lambda x: x.split()[-1] if "Kindred" in x or "Tribal" in x else "NaN")'

In [501]:
def filter_subtype(x):
    if "—" in x:
        ind = x.index("—")
        types = x[ind+1:].split()
        return types
    else:
        return []

In [502]:
def filter_maintype(x):
    types = ["Artifact", "Land", 
             #"Battle", 
             "Creature", "Enchantment", "Planeswalker", "Instant", "Sorcery"]
    cur = []
    for type in types:
        if type in x:
            cur.append(type)
    return cur

In [503]:
df = df[~df['type_line'].str.contains('Basic')] # remove basic lands

In [504]:
df["legendary"] = df["type_line"].apply(lambda x: 1 if "Legendary" in x else 0)
df["subtype"] = df["type_line"].apply(filter_subtype)
df["main_type"]=df["type_line"].apply(filter_maintype)

In [505]:
df["price"] = df["prices"].str["usd"].astype(float) 
df = df.drop("prices", axis=1)

In [506]:
df = df.dropna(subset=["price"])
df = df[df["main_type"].map(len)>0]  # had to filter out bad cards with other types such as conspiracies and stickers

In [507]:
#le = LabelEncoder()
#le.fit(["Artifact", "Land", "Battle", "Creature", "Enchantment", "Planeswalker", "Instant", "Sorcery"])

In [508]:
#df["main_type"] = df["main_type"].apply(lambda x: le.transform(x)) 

#### Changed to One-Hot Encoding!

In [509]:
from sklearn.preprocessing import MultiLabelBinarizer

In [510]:
mlb = MultiLabelBinarizer()

In [511]:
test = pd.DataFrame(mlb.fit_transform(df["main_type"]), columns=mlb.classes_, index=df.index)

In [512]:
df = df.drop('main_type', axis=1)
df = df.join(test)

In [513]:
le2 = LabelEncoder()

In [514]:
df["subtype"].values

array([list(['Sliver']), list(['Kor', 'Soldier']),
       list(['Siren', 'Pirate']), ..., list([]),
       list(['Faerie', 'Rogue']), list(['Vampire', 'Soldier'])],
      dtype=object)

In [515]:
subtypes = []
for unique in df["subtype"].values:
    if unique != []:
        for val in unique:
            if val not in subtypes:
                subtypes.append(val)

le2.fit(subtypes)


In [516]:
df["subtype"] = df["subtype"].apply(lambda x: le2.transform(x)) # slow, probably better way to do this 

In [517]:
df = df.drop("type_line", axis=1)

### Date -> Year

In [518]:
df["year"] = df["released_at"].apply(lambda x: x.year)
df = df.drop("released_at", axis=1)

### Mana Cost Breakdown

- Number of Pips
- Is X spell?

In [519]:
df["is_x"] = df["mana_cost"].apply(lambda x: 1 if r"{X}" in x else 0)
#df.loc[df["is_x"] == 1]

In [520]:
#print("{1}{W/R}{G}{G}".replace("{", "").replace("}", " ").split())  ->  df

def pip_counter(x):
    new = x.replace("{", "").replace("}", " ").split()
    count = 0
    for x in new:
        if x.isdigit() == False and x != "X":
            count += 1

    return count

In [521]:
df["pip_count"] = df["mana_cost"].apply(pip_counter)
df = df.drop("mana_cost", axis=1)
#df.sort_values(by="pip_count", ascending=False)

### Oracle Text Breakdown -> NLP ? Or can try a parsing method to turn text into columns 

- activated ability?
- etb effect?

In [522]:
df["oracle_text"] = df["oracle_text"].apply(lambda x: x.lower().replace("\n", ". "))

#### Main Phrases

using vectorizer to figure out most common substrings

In [523]:
vectorizer = CountVectorizer(ngram_range=(4, 7), lowercase=True, stop_words=None)

In [524]:
X = vectorizer.fit_transform(df["oracle_text"])

In [525]:
sum_words = X.sum(axis=0)
word_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
word_freq = sorted(word_freq, key=lambda x: x[1], reverse=True)

In [526]:
common_phrases = pd.DataFrame(word_freq, columns=['phrase', 'count'])
common_phrases.iloc[:10]

Unnamed: 0,phrase,count
0,until end of turn,5990
1,at the beginning of,3261
2,when this creature enters,3049
3,gets until end of,2131
4,gets until end of turn,2130
5,the beginning of your,1869
6,at the beginning of your,1814
7,card from your graveyard,1593
8,creature gets until end,1570
9,creature gets until end of,1570


In [527]:
"""
PHRASES = { # starter phrases
    r"when.*enters": "etb",
    r"until end of turn": "eot",
    r"beginning of .* upkeep": "b_o_u",
    r"search .* library": "tutor",
    r"without paying": "free",
    r"whenever .* attacks": "a_t",
    r"deals combat damage": "c_d_t",
    r"look at the tzo.*p": "s_s_t",
    r"return.*graveyard.*battlefield": "reanimate",
    r"when.*(this|a).*dies": "o_d_t",
    r"when.*(this|a).*leaves.": "l_b_t",
}"""

'\nPHRASES = { # starter phrases\n    r"when.*enters": "etb",\n    r"until end of turn": "eot",\n    r"beginning of .* upkeep": "b_o_u",\n    r"search .* library": "tutor",\n    r"without paying": "free",\n    r"whenever .* attacks": "a_t",\n    r"deals combat damage": "c_d_t",\n    r"look at the tzo.*p": "s_s_t",\n    r"return.*graveyard.*battlefield": "reanimate",\n    r"when.*(this|a).*dies": "o_d_t",\n    r"when.*(this|a).*leaves.": "l_b_t",\n}'

In [528]:
#def canonicalize_text(str):
#    for phr, rep in PHRASES.items():
#        if re.search(phr, str) != None:
#            str = re.sub(phr, rep, str)
#    return str

In [529]:
#df["oracle_text"] = df["oracle_text"].apply(canonicalize_text)

In [530]:

def extract_math_expression_with_variables(input_string):
  pattern = r'[\d\s.+\-*/()a-zA-Z]'  # Matches digits, dot, plus, minus, asterisk, slash, parentheses, and letters
  
  # Use re.findall() to find all non-overlapping matches of the pattern
  matches = re.findall(pattern, input_string)
  
  # Join the matched characters back into a string
  return "".join(matches)

In [531]:
df["oracle_text"] = df["oracle_text"].apply(lambda x: extract_math_expression_with_variables(x))
df

Unnamed: 0,name,cmc,oracle_text,colors,keywords,produced_mana,rarity,power,toughness,legendary,...,Artifact,Creature,Enchantment,Instant,Land,Planeswalker,Sorcery,year,is_x,pip_count
1,Fury Sliver,6.0,all sliver creatures have double strike.,[R],[],,uncommon,3,3,0,...,0,1,0,0,0,0,0,2006,0,1
2,Kor Outfitter,2.0,when this creature enters you may attach targe...,[W],[],,common,2,2,0,...,0,1,0,0,0,0,0,2009,0,2
4,Siren Lookout,3.0,flying. when this creature enters it explores....,[U],"[Flying, Explore]",,common,1,2,0,...,0,1,0,0,0,0,0,2017,0,1
7,Surge of Brilliance,2.0,paradox draw a card for each spell youve cast...,[U],"[Paradox, Foretell]",,uncommon,,,0,...,0,0,0,1,0,0,0,2023,0,1
9,Venerable Knight,1.0,when this creature dies put a +1/+1 counter on...,[W],[],,uncommon,2,1,0,...,0,1,0,0,0,0,0,2019,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108944,Morkrut Banshee,5.0,morbid when this creature enters if a creatur...,[B],[Morbid],,uncommon,4,4,0,...,0,1,0,0,0,0,0,2011,0,2
108947,Deeproot Historian,4.0,merfolk and druid cards in your graveyard have...,[G],[],,rare,3,3,0,...,0,1,0,0,0,0,0,2023,0,1
108949,Aggressive Biomancy,2.0,create x tokens that are copies of target crea...,"[G, U]",[Fight],,rare,,,0,...,0,0,0,0,0,0,1,2024,1,2
108952,Faerie Bladecrafter,3.0,flying. whenever one or more faeries you contr...,[B],[Flying],,rare,2,2,0,...,0,1,0,0,0,0,0,2023,0,1


In [532]:
text_vectorizer = TextVectorization(
    max_tokens=4000, # increased vocab size
    output_mode='int',
    ngrams = (2,6),
    encoding='utf-8'
)

In [533]:
text_vectorizer.adapt(df["oracle_text"])

### Rarity Encoding

In [534]:
df["rarity"].value_counts()

rarity
rare        11445
common       9878
uncommon     8967
mythic       1987
special         2
Name: count, dtype: int64

In [535]:
df["rarity"] = df["rarity"].apply(lambda x: "common" if x == "special" else x) # the two special cards have common rarity on scryfall

In [536]:
def rarity_encoding(x):
    if x == "common":
        return 0
    elif x == "uncommon":
        return 1
    elif x == "rare":
        return 2
    else:
        return 3

In [537]:
df["rarity"] = df["rarity"].apply(lambda x: rarity_encoding(x))

### Keyword Encoding

In [538]:
keywrd_encoder = LabelEncoder()

In [539]:
keywords = [] # taking code from earlier
for unique in df["keywords"].values:
    if unique != []:
        for val in unique:
            if val not in keywords:
                keywords.append(val)

keywrd_encoder.fit(keywords)


In [540]:
len(keywords)

604

In [541]:
df["keywords"] = df["keywords"].apply(lambda x: keywrd_encoder.transform(x))

### Color Identity Encoding (one more time!)

definitely could've done a for loop for each column i wanted to encode but oh well it's a little late for that

In [542]:
df

Unnamed: 0,name,cmc,oracle_text,colors,keywords,produced_mana,rarity,power,toughness,legendary,...,Artifact,Creature,Enchantment,Instant,Land,Planeswalker,Sorcery,year,is_x,pip_count
1,Fury Sliver,6.0,all sliver creatures have double strike.,[R],[],,1,3,3,0,...,0,1,0,0,0,0,0,2006,0,1
2,Kor Outfitter,2.0,when this creature enters you may attach targe...,[W],[],,0,2,2,0,...,0,1,0,0,0,0,0,2009,0,2
4,Siren Lookout,3.0,flying. when this creature enters it explores....,[U],"[230, 198]",,0,1,2,0,...,0,1,0,0,0,0,0,2017,0,1
7,Surge of Brilliance,2.0,paradox draw a card for each spell youve cast...,[U],"[390, 238]",,1,,,0,...,0,0,0,1,0,0,0,2023,0,1
9,Venerable Knight,1.0,when this creature dies put a +1/+1 counter on...,[W],[],,1,2,1,0,...,0,1,0,0,0,0,0,2019,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108944,Morkrut Banshee,5.0,morbid when this creature enters if a creatur...,[B],[366],,1,4,4,0,...,0,1,0,0,0,0,0,2011,0,2
108947,Deeproot Historian,4.0,merfolk and druid cards in your graveyard have...,[G],[],,2,3,3,0,...,0,1,0,0,0,0,0,2023,0,1
108949,Aggressive Biomancy,2.0,create x tokens that are copies of target crea...,"[G, U]",[215],,2,,,0,...,0,0,0,0,0,0,1,2024,1,2
108952,Faerie Bladecrafter,3.0,flying. whenever one or more faeries you contr...,[B],[230],,2,2,2,0,...,0,1,0,0,0,0,0,2023,0,1


In [543]:
c_i = LabelEncoder()

In [544]:
c_i.fit(["W", "G", "R", "B", "U"])

In [545]:
c_i.transform(["W", "U"])

array([4, 3])

In [546]:
df["colors"] = df["colors"].apply(lambda x: c_i.transform(x))

### Produced Mana 

changing this to binary value if it does(n't)

In [547]:
df["produced_mana"] = df["produced_mana"].replace(pd.NA, 0)

In [548]:
df["produced_mana"] = df["produced_mana"].apply(lambda x: 1 if x != 0 else x)

### Final Step: Replace Power/Toughness NaN with -1

In [549]:
df["power"] = pd.to_numeric(df["power"], errors="coerce").fillna(-1)
df["toughness"] = pd.to_numeric(df["toughness"], errors="coerce").fillna(-1)

### Final Data Filtering

In [550]:
df = df.loc[(df["price"] <= 500) & (df["price"] >= 0.10) & (df["year"] >= 2000)] # filter out cards that are too expensive or too cheap

In [551]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,name,cmc,oracle_text,colors,keywords,produced_mana,rarity,power,toughness,legendary,...,Artifact,Creature,Enchantment,Instant,Land,Planeswalker,Sorcery,year,is_x,pip_count
0,Fury Sliver,6.0,all sliver creatures have double strike.,[2],[],0,1,3.0,3.0,0,...,0,1,0,0,0,0,0,2006,0,1
1,Kor Outfitter,2.0,when this creature enters you may attach targe...,[4],[],0,0,2.0,2.0,0,...,0,1,0,0,0,0,0,2009,0,2
2,Surge of Brilliance,2.0,paradox draw a card for each spell youve cast...,[3],"[390, 238]",0,1,-1.0,-1.0,0,...,0,0,0,1,0,0,0,2023,0,1
3,Venerable Knight,1.0,when this creature dies put a +1/+1 counter on...,[4],[],0,1,2.0,1.0,0,...,0,1,0,0,0,0,0,2019,0,1
4,Wall of Vipers,3.0,defender (this creature cant attack.). 3 destr...,[0],[139],0,1,2.0,4.0,0,...,0,1,0,0,0,0,0,2000,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19842,White Sun's Twilight,2.0,you gain x life. create x 1/1 colorless phyrex...,[4],[],0,2,-1.0,-1.0,0,...,0,0,0,0,0,0,1,2023,1,2
19843,Tezzeret's Gambit,4.0,(u/p can be paid with either u or 2 life.). dr...,[3],[415],0,1,-1.0,-1.0,0,...,0,0,0,0,0,0,1,2011,0,1
19844,Deeproot Historian,4.0,merfolk and druid cards in your graveyard have...,[1],[],0,2,3.0,3.0,0,...,0,1,0,0,0,0,0,2023,0,1
19845,Aggressive Biomancy,2.0,create x tokens that are copies of target crea...,"[1, 3]",[215],0,2,-1.0,-1.0,0,...,0,0,0,0,0,0,1,2024,1,2


### Additional Rows Added Later

In [552]:
df["num_colors"] = df["colors"].apply(lambda x: len(x))
#df["num_keywords"] = df["keywords"].apply(lambda x: len(x))
#df["num_subtypes"] = df["subtype"].apply(lambda x: len(x))
#df["num_maintypes"] = df["main_type"].apply(lambda x: len(x))
df["oracle_length"] = df["oracle_text"].apply(lambda x: len(x))
df["length per cmc"] = df["oracle_length"]/df["cmc"]
df["length per cmc"] = df["length per cmc"].replace(np.inf, 0)
df["length per cmc"] = df["length per cmc"].replace(np.nan, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["num_colors"] = df["colors"].apply(lambda x: len(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["oracle_length"] = df["oracle_text"].apply(lambda x: len(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["length per cmc"] = df["oracle_length"]/df["cmc"]
A value is trying to be set on a

In [553]:
save(df)

## Model Building

### Numerical Data

In [456]:
numerical = df.drop(["name", "oracle_text", "colors", "keywords", "subtype", "price"], axis=1)

In [457]:
numerical.columns

Index(['cmc', 'produced_mana', 'rarity', 'power', 'toughness', 'legendary',
       'Artifact', 'Creature', 'Enchantment', 'Instant', 'Land',
       'Planeswalker', 'Sorcery', 'year', 'is_x', 'pip_count', 'num_colors',
       'oracle_length', 'length per cmc'],
      dtype='object')

In [458]:
scaler = RobustScaler()
scaled_data = scaler.fit_transform(numerical)

In [459]:
number_inputs = keras.Input(shape=(19,), name="numerical")
normalized = keras.layers.Normalization()(number_inputs)

### Array Data

describing code below:
have to pad all the array inputs, tell keras the shape of the input, and create an embedding layer to vectorize said input

input_dim is equal to number of unique ids

In [566]:
colors_padded = keras.preprocessing.sequence.pad_sequences(df['colors'])
colors_input = keras.Input(shape=(5,), dtype="int32", name="colors")
colors_embed = keras.layers.Embedding(input_dim=5, output_dim=4)(colors_input)
colors_pooled = keras.layers.GlobalAveragePooling1D()(colors_embed)

keywords_padded = keras.preprocessing.sequence.pad_sequences(df['keywords'])
keywords_input = keras.Input(shape=(10,), dtype="int32", name="keywords")
keywords_embed = keras.layers.Embedding(input_dim=604, output_dim=8)(keywords_input)
keywords_pooled = keras.layers.GlobalAveragePooling1D()(keywords_embed)

subtypes_padded = keras.preprocessing.sequence.pad_sequences(df['subtype'])
subtypes_input = keras.Input(shape=(4,), dtype="int32", name="subtypes")
subtypes_embed = keras.layers.Embedding(input_dim=393, output_dim=8)(subtypes_input)
subtypes_pooled = keras.layers.GlobalAveragePooling1D()(subtypes_embed)

#main_type_padded = keras.preprocessing.sequence.pad_sequences(df['main_type'])
#main_type_input = keras.Input(shape=(2,), dtype="int32", name="main_type")
#main_type_embed = keras.layers.Embedding(input_dim=8, output_dim=4)(main_type_input)
#main_type_pooled = keras.layers.GlobalAveragePooling1D()(main_type_embed)


### Tokenized Data

In [567]:
text_input = keras.Input(shape=(), dtype=tf.string, name="oracle_text")
oracle_vector = text_vectorizer(text_input)
oracle_embed = keras.layers.Embedding(input_dim=4000, output_dim=256)(oracle_vector)
oracle_pooled = keras.layers.GlobalAveragePooling1D()(oracle_embed)

In [87]:
#oracle_array = text_vectorizer(df["oracle_text"].values).numpy()

### Model Compiling

In [568]:
all_inputs = [
    number_inputs,
    text_input,
    colors_input,
    keywords_input,
    subtypes_input,
    #main_type_input
]

all_features = keras.layers.concatenate([
    normalized,
    oracle_pooled,
    colors_pooled,
    keywords_pooled,
    subtypes_pooled,
    #main_type_pooled
])

In [569]:
x = keras.layers.Dense(64, activation="tanh")(all_features)
x = keras.layers.Normalization()(x)
x = keras.layers.Dense(32, activation="tanh")(x)
x = keras.layers.Normalization()(x)
x = keras.layers.Dense(16, activation="tanh")(x)
output = keras.layers.Dense(1, activation="linear", name="price")(x)

In [570]:
model = keras.Model(inputs=all_inputs, outputs=output)
model.compile(optimizer="adamax", loss=keras.losses.MeanSquaredError(), metrics=[keras.metrics.MeanAbsoluteError()])

In [572]:
early_stopping_cb = keras.callbacks.EarlyStopping(patience=25, start_from_epoch=20, monitor="val_loss", mode='min')
#model_checkpoint_cb = keras.callbacks.ModelCheckpoint(f"models\\best_model", save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "logs", "run_{:03d}".format(run_index))
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, 
             #model_checkpoint_cb,
               tensorboard_cb]

In [469]:
model.fit(
    {
        "numerical": scaled_data,
        "oracle_text": df["oracle_text"].values,
        "keywords": keywords_padded,
        "colors": colors_padded,
        "subtypes": subtypes_padded,
        #"main_type": main_type_padded
    },
    y=df["price"],
    epochs=75, # long ahh runtime fr
    batch_size=32,
    validation_split=0.25,
    callbacks=callbacks
)

Epoch 1/75
[1m466/466[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - loss: 22.1311 - mean_absolute_error: 1.8813 - val_loss: 22.8824 - val_mean_absolute_error: 1.9238
Epoch 2/75
[1m466/466[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - loss: 21.6031 - mean_absolute_error: 1.8190 - val_loss: 22.4618 - val_mean_absolute_error: 1.8023
Epoch 3/75
[1m466/466[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - loss: 21.1460 - mean_absolute_error: 1.7608 - val_loss: 22.2447 - val_mean_absolute_error: 1.8929
Epoch 4/75
[1m466/466[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - loss: 20.6726 - mean_absolute_error: 1.7192 - val_loss: 21.7112 - val_mean_absolute_error: 1.6848
Epoch 5/75
[1m466/466[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 8ms/step - loss: 20.1481 - mean_absolute_error: 1.6728 - val_loss: 21.4762 - val_mean_absolute_error: 1.7875
Epoch 6/75
[1m466/466[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m

<keras.src.callbacks.history.History at 0x20bfa208980>

#### Initial Results:

Mean Absolute Error: 1.03

Rerun With One-Hot Encoding: literally the same (1.025)

In [470]:
model.save("models\\best_model.keras")

In [471]:
text_vectorizer.save_assets("models\\text_vectorizer")

## Model Improvement

In [482]:
model2 = keras.models.load_model(
    "models\\best_model.keras",
    custom_objects={
        "TextVectorization": TextVectorization,
    }
)

In [575]:
len(scaled_data)

19847

In [576]:
model2.evaluate({
        "numerical": scaled_data[18000:],
        "oracle_text": df["oracle_text"].values[18000:],
        "keywords": keywords_padded[18000:],
        "colors": colors_padded[18000:],
        "subtypes": subtypes_padded[18000:],
        #"main_type": main_type_padded
    },
    y=df["price"][18000:])

[1m58/58[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 18.0292 - mean_absolute_error: 1.5781


[18.029186248779297, 1.5780930519104004]

In [554]:
df = load()

In [557]:
df.iloc[2021]

name                                                Boon Reflection
cmc                                                             5.0
oracle_text       if you would gain life you gain twice that muc...
colors                                                          [4]
keywords                                                         []
produced_mana                                                     0
rarity                                                            2
power                                                          -1.0
toughness                                                      -1.0
legendary                                                         0
subtype                                                          []
price                                                         11.04
Artifact                                                          0
Creature                                                          0
Enchantment                                     

In [558]:
model2.predict({
    "numerical": scaled_data[2021:2022],
    "oracle_text": np.array([df["oracle_text"].values[2021]], dtype=object),
    "keywords": np.array([keywords_padded[2021]]),
    "colors": np.array([colors_padded[2021]]),
    "subtypes": np.array([subtypes_padded[2021]]),
})

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step


array([[25.963308]], dtype=float32)

In [559]:
numerical["price"] = df["price"]

In [560]:
numerical.corr().sort_values(by="price", ascending=False)["price"]

price             1.000000
rarity            0.258572
legendary         0.117534
Land              0.093359
produced_mana     0.083089
Artifact          0.058638
oracle_length     0.050182
Planeswalker      0.039558
year              0.037574
cmc               0.026806
toughness         0.012695
power             0.009655
Enchantment       0.003491
is_x             -0.007006
length per cmc   -0.021276
pip_count        -0.029423
Creature         -0.035072
Instant          -0.036775
Sorcery          -0.040772
num_colors       -0.067478
Name: price, dtype: float64

Might have to drop everything below cmc unless i can figure out a way to make power and toughness more relevant

### Changes:

1. Changed activation function on dense layers from relu -> tanh
2. tried adding new columns (num_subtypes, num_keywords) and removed them because low correlation
3. Added oracle_length/cmc to try and show that low cmc cards are often better and can be evaluated higher? 
4. Changed optimizer from adam to adamax
5. Removed non-mathematic symbols from oracle_text
6. Changed main_type array column to one-hot encoded columns

#### Thinking of things:

- Drop categories with poor correlation
- Better categories? (Pow + Tough?)

####  For Loop Running

In [588]:
x_train = {
        "numerical": scaled_data[:16000],
        "oracle_text": df["oracle_text"].values[:16000],
        "keywords": keywords_padded[:16000],
        "colors": colors_padded[:16000],
        "subtypes": subtypes_padded[:16000],
        #"main_type": main_type_padded
    }
y_train = df["price"][:16000]
x_test = {
        "numerical": scaled_data[16000:],
        "oracle_text": df["oracle_text"].values[16000:],
        "keywords": keywords_padded[16000:],
        "colors": colors_padded[16000:],
        "subtypes": subtypes_padded[16000:],
        #"main_type": main_type_padded
    }
y_test = df["price"][16000:]

In [619]:
def buildModel(testing):
    num_nrns = testing[0]
    activation = testing[1]
    batch_size = testing[2]
    if_drop = testing[3]
    optimizer = testing[4]
    x = keras.layers.Dense(num_nrns[0], activation=activation)(all_features)
    if if_drop[0]:
        x = keras.layers.Dropout(rate=if_drop[1])(all_features)
    for num in num_nrns[1:]:
        x = keras.layers.Dense(num, activation=activation)(x)
        x = keras.layers.Normalization()(x)
    output = keras.layers.Dense(1, activation="linear", name="price")(x)

    model = keras.Model(inputs=all_inputs, outputs=output)
    model.compile(optimizer=optimizer, loss=keras.losses.MeanSquaredError(), metrics=[keras.metrics.MeanAbsoluteError()])

    model.fit(x_train, y_train,
    epochs=testing[5],
    batch_size=batch_size,
    validation_split=0.25)
    return model.evaluate(x_test, y_test)


In [None]:
bad_grid_search = [[[64, 32, 16], "tanh", 32, [False], "adamax", 20], # base model
         [[64, 32, 16], "tanh", 64, [False], "adamax", 20],
         [[64, 32, 16], "tanh", 128, [False], "adamax",20],
         [[64, 32, 16], "relu", 32, [False], "adamax",20],
         [[256, 128, 64, 32, 16], "tanh", 32, [False], "adamax",20],
         [[128, 64, 32, 16, 8], "tanh", 32, [False], "adamax",20],
         [[64, 32, 16], "tanh", 32, [True, 0.2], "adamax",20],
         [[64, 32, 16], "tanh", 32, [True, 0.3], "adamax",20]
    ]

In [None]:
scores = []
for tests in enumerate(bad_grid_search):
    result = buildModel(tests)
    scores.append([tests,result])

Epoch 1/15
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 9ms/step - loss: 24.0013 - mean_absolute_error: 1.9356 - val_loss: 18.3057 - val_mean_absolute_error: 1.9015
Epoch 2/15
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 21.8360 - mean_absolute_error: 1.8075 - val_loss: 17.5055 - val_mean_absolute_error: 1.7694
Epoch 3/15
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 20.3591 - mean_absolute_error: 1.6278 - val_loss: 16.8571 - val_mean_absolute_error: 1.7007
Epoch 4/15
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 19.2827 - mean_absolute_error: 1.4722 - val_loss: 16.6060 - val_mean_absolute_error: 1.6867
Epoch 5/15
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - loss: 18.4677 - mean_absolute_error: 1.3988 - val_loss: 16.3400 - val_mean_absolute_error: 1.5373
Epoch 6/15
[1m375/375[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m

In [596]:
scores

[[0, [21.26566505432129, 1.6448439359664917]],
 [1, [21.5150146484375, 1.5950630903244019]],
 [2, [21.484586715698242, 1.5725528001785278]],
 [3, [18.76144790649414, 1.6255497932434082]],
 [4, [21.38492202758789, 1.5958555936813354]],
 [5, [21.660072326660156, 1.6740405559539795]],
 [6, [21.42899513244629, 1.6170096397399902]],
 [7, [21.21075439453125, 1.5883523225784302]]]

Best: [[64, 32, 16], "tanh", 128, [False], "adamax"]

In [None]:
bad_grid_search = [[[64, 32, 16], "tanh", 128, [False], "adamax",20],
    [[64, 32, 16], "tanh", 128, [True, 0.3], "adamax",20],
    [[64, 32, 16], "tanh", 128, [True, 0.4], "adamax",20],
    [[128, 64, 32], "tanh", 128, [True, 0.3], "adamax",20],
    [[128, 64, 32], "tanh", 128, [False], "adamax",20],
    [[512, 256, 32], "tanh", 128, [False], "adamax",20],
    [[64, 32, 16], "tanh", 128, [False], keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999),20],
    [[64, 32, 16], "tanh", 128, [False], keras.optimizers.Adam(learning_rate=0.005, beta_1=0.9, beta_2=0.999),20],
    [[64, 32, 16], "tanh", 128, [False], keras.optimizers.Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999),20]
    ]

scores = []
for tests in bad_grid_search:
    result = buildModel(tests)
    scores.append([tests, result])

Epoch 1/20
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 21ms/step - loss: 24.8248 - mean_absolute_error: 1.9834 - val_loss: 18.9123 - val_mean_absolute_error: 1.9131
Epoch 2/20
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 20ms/step - loss: 22.6983 - mean_absolute_error: 1.8401 - val_loss: 17.9671 - val_mean_absolute_error: 1.8519
Epoch 3/20
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 20ms/step - loss: 21.3851 - mean_absolute_error: 1.7478 - val_loss: 17.4954 - val_mean_absolute_error: 1.6873
Epoch 4/20
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 20ms/step - loss: 20.3519 - mean_absolute_error: 1.6143 - val_loss: 16.9864 - val_mean_absolute_error: 1.6524
Epoch 5/20
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 20ms/step - loss: 19.4754 - mean_absolute_error: 1.5062 - val_loss: 16.7028 - val_mean_absolute_error: 1.5904
Epoch 6/20
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 

In [607]:
scores

[[[[64, 32, 16], 'tanh', 128, [False], 'adamax'],
  [21.796321868896484, 1.6607264280319214]],
 [[[64, 32, 16], 'tanh', 128, [True, 0.3], 'adamax'],
  [20.942089080810547, 1.6713695526123047]],
 [[[64, 32, 16], 'tanh', 128, [True, 0.4], 'adamax'],
  [21.265302658081055, 1.660178303718567]],
 [[[128, 64, 32], 'tanh', 128, [True, 0.3], 'adamax'],
  [20.586200714111328, 1.6721023321151733]],
 [[[128, 64, 32], 'tanh', 128, [False], 'adamax'],
  [22.609840393066406, 1.7944376468658447]],
 [[[512, 256, 32], 'tanh', 128, [False], 'adamax'],
  [20.753948211669922, 1.5533264875411987]],
 [[[64, 32, 16],
   'tanh',
   128,
   [False],
   <keras.src.optimizers.adam.Adam at 0x20b0f67dc10>],
  [20.608652114868164, 1.6331120729446411]],
 [[[64, 32, 16],
   'tanh',
   128,
   [False],
   <keras.src.optimizers.adam.Adam at 0x20b0f66dc70>],
  [22.389524459838867, 1.7746586799621582]],
 [[[64, 32, 16],
   'tanh',
   128,
   [False],
   <keras.src.optimizers.adam.Adam at 0x20b0f67fbc0>],
  [21.3890190124

In [613]:
scores.sort(key=lambda x: x[1][1])

In [614]:
scores[0]

[[[64, 32, 16],
  'tanh',
  128,
  [False],
  <keras.src.optimizers.adam.Adam at 0x20b0f67fbc0>],
 [21.389019012451172, 1.5364429950714111]]

In [None]:
scores[0][0][4].get_config() 

{'name': 'adam',
 'learning_rate': 9.999999747378752e-05,
 'weight_decay': None,
 'clipnorm': None,
 'global_clipnorm': None,
 'clipvalue': None,
 'use_ema': False,
 'ema_momentum': 0.99,
 'ema_overwrite_frequency': None,
 'loss_scale_factor': None,
 'gradient_accumulation_steps': None,
 'beta_1': 0.9,
 'beta_2': 0.999,
 'epsilon': 1e-07,
 'amsgrad': False}

Best: [[64, 32, 16],
  'tanh',
  128,
  [False],
  keras.optimizers.Adam(learning_rate=0.0001)]

In [621]:
def buildModelUpdated(testing):
    num_nrns = testing[0]
    activation = testing[1]
    batch_size = testing[2]
    if_drop = testing[3]
    optimizer = testing[4]
    x = keras.layers.Dense(num_nrns[0], activation=activation)(all_features)
    if if_drop[0]:
        x = keras.layers.Dropout(rate=if_drop[1])(all_features)
    for num in num_nrns[1:]:
        x = keras.layers.Dense(num, activation=activation)(x)
        x = keras.layers.Normalization()(x)
    output = keras.layers.Dense(1, activation="linear", name="price")(x)

    early_stopping_cb = keras.callbacks.EarlyStopping(patience=25, start_from_epoch=20, monitor="val_loss", mode='min')
    #model_checkpoint_cb = keras.callbacks.ModelCheckpoint(f"models\\best_model", save_best_only=True)
    run_index = 1 # increment every time you train the model
    run_logdir = os.path.join(os.curdir, "logs", "run_{:03d}".format(run_index))
    tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
    callbacks = [early_stopping_cb, 
                #model_checkpoint_cb,
               tensorboard_cb]

    model = keras.Model(inputs=all_inputs, outputs=output)
    model.compile(optimizer=optimizer, loss=keras.losses.MeanSquaredError(), metrics=[keras.metrics.MeanAbsoluteError()])

    model.fit(x_train, y_train,
    epochs=testing[5],
    batch_size=batch_size,
    validation_split=0.25,
    callbacks=callbacks)

    return model, model.evaluate(x_test, y_test)


In [623]:
model = buildModelUpdated([[64, 32, 16],'tanh',128,[False],keras.optimizers.Adam(learning_rate=0.0001),75])

Epoch 1/75
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 22ms/step - loss: 25.9893 - mean_absolute_error: 2.0447 - val_loss: 20.3734 - val_mean_absolute_error: 2.0298
Epoch 2/75
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 21ms/step - loss: 24.4228 - mean_absolute_error: 1.9745 - val_loss: 19.4856 - val_mean_absolute_error: 1.8084
Epoch 3/75
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 21ms/step - loss: 23.2118 - mean_absolute_error: 1.8400 - val_loss: 18.9083 - val_mean_absolute_error: 1.7703
Epoch 4/75
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 21ms/step - loss: 22.3780 - mean_absolute_error: 1.7927 - val_loss: 18.5876 - val_mean_absolute_error: 1.7665
Epoch 5/75
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 20ms/step - loss: 21.8055 - mean_absolute_error: 1.7662 - val_loss: 18.2785 - val_mean_absolute_error: 1.7736
Epoch 6/75
[1m94/94[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 

In [624]:
model[0].save("models\\best_model.keras")
print(model[1])

[20.646961212158203, 1.5887808799743652]
