# Web Scraping scryfall.com

In [None]:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

# A Better Way

So, I tried a certian approach when I first started, which you can see at the end. I'm leaving it in for academic purposes. But, I've since discovered a better way, which you can see here.

The sets page has some good stuff! Like number of cards and release date! I need to get that!!!

Let's inspect the page in Chrome to see which tag has the info I want.

It looks like each row is contained within a \<tr> tag. So let's dissect one of these.

In [None]:
def clean_text(text):
    '''A regex to clean up text, by removing formatting characters and leading or trailing whitespace'''

#     text = re.sub(r'([\n])|[\r]|(^\s+)|(\s+$)', '', text) # Meant to be a combined regex, not working
    text = re.sub(r'[\n]', '', text) # Delete line breaks
    text = re.sub(r'[\r]', '', text) # Delete carriage returns with a space
    text = re.sub(r'(^\s+)|(\s+$)', '', text) # Delete all leading or trailing whitespace
    return text

# Ya, I could probably combine all these into one regex, and compile it. But I'm in a hurry

In [None]:
r = requests.get(r'https://www.scryfall.com/sets')

sets_data = r.text

sets_soup = BeautifulSoup(sets_data)

In [None]:
# I read that its MUCH more efficient to put all the data into a list or dict, then append to df or create a df

lst = []

for row in sets_soup.find_all('tr')[1:]:
    boxes = row.find_all('td')
    info = boxes[0].text.rsplit(' ', 1) + [boxes[1].text, boxes[2].text, boxes[1].find('a').get('href')]
    info = [clean_text(text) for text in info]
    lst.append(info)

lst

In [None]:
sets = pd.DataFrame(lst, columns = ['name', 'abbreviation', 'num_cards', 'release_date', 'url'])
sets.num_cards = sets.num_cards.astype('int64')
sets.release_date = pd.to_datetime(sets.release_date)
sets

There's a few entries that don't have the full urls, so let's fix that.

In [None]:
mask = sets.url.apply(lambda url: url[:5] != 'https')
sets.loc[mask, 'url'] = sets.loc[mask, 'url'].apply(lambda url: 'https://scryfall.com' + url)

I could add info about which are "subsets", like promos and tokens, as those tags have an "indent class". But I won't worry about that now.

Once I learn SQL, that might be a better way to keep track of this stuff

# UND

Ok, let's try to grab all the info we can for one set. Then we'll generalize that technique for all sets. I guess we'll store each in a seperate data frame for now. Though I really am starting to think I need to use SQL.

Each set page has pictures of every card (by default). I can activate a checklist view by adding '?as=checklist', which makes it easier to get the data I want.

We end up with a similar html structure to the sets page, so I'll reuse most of that code.

In [None]:
r = requests.get(r'https://scryfall.com/sets/und?as=checklist')

data = r.text

soup = BeautifulSoup(data)

In [None]:
lst = []

for row in soup.find_all('tr')[1:]:
    boxes = row.find_all('td')
    info = [boxes[i].text for i in range(11)]
    info = [clean_text(text) for text in info]
    lst.append(info)

lst

In [None]:
cards = pd.DataFrame(lst, columns = ['set', 'collectors_num', 'name', 'cost', 'type', 'rarity', 'language', 'artist', 'usd', 'eur', 'tix'])
cards

We did it!! Now, time to get ALL THE CARDS!!!

# ALL THE CARDS!

In [None]:
start_time = time.time()

lst = []
for index, site in sets.url.iteritems():
    r = requests.get(f'{site}/?as=checklist')
    data = r.text
    soup = BeautifulSoup(data)
    
    for row in soup.find_all('tr')[1:]:
        boxes = row.find_all('td')
        info = [boxes[i].text for i in range(11)]
        info = [clean_text(text) for text in info]
        lst.append(info)
        
cards = pd.DataFrame(lst, columns = ['set', 'collectors_num', 'name', 'cost', 'type', 'rarity', 'language', 'artist', 'usd', 'eur', 'tix'])   
run_time = time.time() - start_time
print(f'--- This program took {run_time} seconds to run. ---')

In [None]:
cards.to_csv('D:/code/Data/scryfall/cards.csv', header = True)

In [None]:
cards

# Data Cleaning and Exploration

Let's get our data into the correct data types, and clean up some things like weird symbols

In [51]:
cards = pd.read_csv('D:/code/Data/scryfall/cards.csv')

In [52]:
cards.isnull().sum()

Unnamed: 0            0
set                   0
collectors_num        0
name                  0
cost               7962
type                 12
rarity                0
language              0
artist               95
usd                9966
eur               15569
tix               23108
dtype: int64

In [53]:
cards.type.value_counts()

Creature            20365
Instant              6427
Sorcery              5941
Enchantment          5284
Artifact             3652
                    ...  
יצור                    1
Enchantment Land        1
伝説のクリーチャー・エンチャント        1
アーティファクト                1
Animal                  1
Name: type, Length: 95, dtype: int64

In [54]:
cards.groupby('artist').count().name.sort_values(ascending = False)

artist
John Avon                                  1103
Kev Walker                                 1027
Mark Tedin                                  685
Dan Frazier                                 677
Greg Staples                                627
                                           ... 
Susan Garfield                                1
Svetlin Velinov & Jared Blando                1
Sydney Adams                                  1
Christopher Moeller & Anthony S. Waters       1
Nick Bartoletti                               1
Name: name, Length: 915, dtype: int64

In [55]:
def to_float(x):
    """Convert the strings in the price columns to float"""
    try:
        return float(re.split('[\$]|€', x)[-1])
    except:
        print(x)

In [56]:
# Clean up data types
cards.loc[:, 'usd'] = cards.loc[:, 'usd'].apply(to_float)
cards.loc[:, 'eur'] = cards.loc[:, 'eur'].apply(to_float)

$0.14
$0.13
$0.16
$0.26
$0.58
$0.13
$0.14
$0.13
$0.19
$0.16
$0.12
$0.12
$0.17
$0.18
$0.30
$0.51
$0.30
$1.00
$0.15
$0.42
$0.16
$0.15
$1.89
$0.11
$0.15
$1.70
$0.16
$0.14
$0.19
$0.15
$0.40
$0.17
$0.19
$0.13
$0.13
$0.32
$0.16
$0.27
$0.30
$0.35
$0.14
$0.16
$0.13
$0.18
$0.16
$0.12
$0.17
$0.97
$0.15
$0.22
$0.24
$0.13
$0.21
$0.18
$0.09
$0.12
$0.19
$0.82
$0.13
$0.37
$0.25
$0.13
$0.13
$0.13
$0.14
$0.14
$0.38
$0.18
$0.16
$0.15
$0.60
$0.47
$0.37
$0.16
$0.45
$0.17
$0.19
$0.82
$0.13
$0.19
$0.27
$0.19
$0.19
$0.57
$0.17
$0.20
$0.16
$0.16
$0.28
$0.29
$0.27
$0.27
$0.18
$0.18
$0.29
$0.28
nan
nan
nan
nan
nan
nan
$0.34
$0.05
$1.29
$0.20
$0.37
$0.03
$0.07
$0.04
$0.26
$0.04
$0.03
$0.35
$2.37
$5.33
$0.03
$0.04
$0.03
$15.49
$2.98
$0.04
$0.06
$0.02
$0.03
$3.01
$0.04
$0.12
$0.07
$0.03
$0.05
$0.05
$0.06
$0.03
$0.06
$0.04
$0.04
$0.12
$1.58
$0.03
$0.16
$0.05
$0.04
$0.06
$0.28
$0.03
$0.08
$0.04
$0.04
$0.05
$0.03
$0.06
$0.02
$2.24
$0.10
$0.03
$0.86
$0.04
$0.04
$0.25
$0.06
$0.20
$0.03
$0.06
$0.02
$0.05
$0.04
$0.03
$0.

In [57]:
cards.loc[:, 'collectors_num'] = cards.loc[:, 'collectors_num'].apply(lambda x: int(x))

ValueError: invalid literal for int() with base 10: '347★'

In [None]:
cards

In [None]:


mask = [not type(x) == 'int' for x in cards.collectors_num]

cards.loc[mask]

In [None]:
cards.usd.plot.hist(bins = 100)

In [None]:
cards.dtypes

# My original, not that great, approach

In [None]:
url = r'https://www.scryfall.com'

r = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

In [None]:
print(soup.prettify()[:1000])

In [None]:
links = soup.find_all('a')

display(f'There are {len(links)} links')

for link in links:
    print(link.get('href'))

Ok, that's a lot of pages. The "sets" page looks promising. I think that will have every card. Let's go there!

In [None]:
r = requests.get(f'{url}/sets')

sets_data = r.text

sets_soup = BeautifulSoup(sets_data)

In [None]:
# print(sets_soup.prettify())

In [None]:
sets_links = sets_soup.find_all('a')

display(f'There are {len(sets_links)} links')

for link in sets_links[:100]:
    print(link.get('href'))

Hmm, a lot of repeats and a lot of different formats. Let's fix all the links to be the full link, then create a list without duplicates.

In [None]:
def fix_link(link):
    try:
        if link[:5] == 'https':
            return link
        elif link[0] == '/':
            return f'{url}{link}'
    except:
        print('Empty Link')

In [None]:
sets_list = [link.get('href') for link in sets_links]

fixed_sets_list = [fix_link(string) for string in sets_list]
fixed_sets_list = list(set(fixed_sets_list))
fixed_sets_list

Looking good. A few more things to work on:
1. Some have www and some don't. Let's add that to all of them.
2. There will be more duplicates after we do that. (I suspect that we could just drop all that don't have www, but it doesn't hurt to be careful.
3. There are a few bad ones, like reddit and none. Let's lose those as well.

In [None]:
fixed_sets_list.remove(None)

In [None]:
fixed_sets_list

In [None]:
# Join them all into one string
# Seems easer to do regex when it's all one string. 
# Should investigate this more. Am I just bad, or is this a good way to do it?

csv = ','.join(fixed_sets_list)

csv = re.sub(r'/scryfall', r'/www.scryfall', csv)

fixed_sets_list = csv.split(',')

In [None]:
fixed_sets_list = list(set(fixed_sets_list))
fixed_sets_list

Ok, almost there. There's still a few unwanted sites in there. Let's get rid of every site which doesn't have a "www.scryfall.com/sets" in it

In [None]:
csv = ','.join(fixed_sets_list)

regex = r'[^,]*www.scryfall.com/sets[^,]*'

csv = re.findall(regex, csv)

fixed_sets_list = csv
fixed_sets_list

Now, I know this isn't yet correct, because on that page it says that there's 579 sets. So what's gone wrong?

Well, some of the set pages have additional "/" after the set designation. So it's some kind of subpage. 

Oh, I see, it's for each language. Well, I believe the default for each is English, so let's get rid of everything after the set description.

In [None]:
csv = ','.join(fixed_sets_list)

regex = r'(/sets/[^/,]*?)/[^/,]*?,'

# re.findall(regex, csv, flags = re.VERBOSE)

csv = re.sub(regex, r'\1,', csv, flags = re.VERBOSE)

final_sets_list = csv.split(',')

final_sets_list

YES! Looking good! Now I just need to remove the duplicates again (could have waited until now, but I just like cleaning things).

In [None]:
final_sets_list = list(set(final_sets_list))
final_sets_list

So close!!! We ended up with 574 links. But there are actually 579 sets (According to the bottom of the page). We'll I'm not sure what I missed, but I think there's a better way to handle this stuff.