# Scraping Epic Seven Character Data

## Imports

In [1]:
# necessary imports
import requests
import re
import time
import random

## Scraping

### Character Names

First I start by scraping character names. Once again, using the [Regex101](https://regex101.com/) made the endeavor quite simple. 

In [2]:
name_url = 'https://epic7x.com/characters/'

# API Call
name_response_string = requests.get(name_url).text

# Get name of all current units
name_pattern = "\"name\":\"([\w ]+)\",\"icon\""
name_list = re.findall(name_pattern, name_response_string)

# Display check
print(name_list[:5])
print(len(name_list))

['Achates', 'Adin', 'Adlay', 'Adventurer Ras', 'Ainos']
281


As of 4/17/2023, there are a total of $281$ characters in Epic Seven, so it the extraction was successful. Moving on to each individual character page:

### Character Pages

In name list, whitespace characters are used. Need to replace with `"+"` to match unit URL.

#### Single Scrape - Success

The following cells are the intermediate code I used to build up my multi-scrape function, but this is also where an interesting issue manifests. Below is an example of a successful scrape:

In [3]:
response_check = requests.get("https://epic7x.com/character/angelica/").text
print(response_check[1500:2000]) # error message around 1500:2000 range

dystatechange=function(){var t;4==s.readyState&&(200<=s.status&&s.status<300||304==s.status)&&((t=e.createElement("script")).type="text/javascript",t.text=s.responseText,e.head.appendChild(t))},s.send(null)}(document);
        (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
        new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
        j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
        'https://www.googletagmanager.com/gtm.js?id='+


In [4]:
substats_blob_pattern = "<span class='f-16'><b>([\w\W]{0,900})<\/div>"
substats_blob_list = re.findall(substats_blob_pattern, response_check)
print(len(substats_blob_list))
print(substats_blob_list[0:100])

1
["Hp%</b></span>,                                             <!-- <span class='f-16'><b>Def%</b></span>  <img class='lazyloaded character-rarity ml-15' src='https://epic7x.com/wp-content/themes/epic7x/assets/img/arrow.png'>  -->\n                        <span class='f-16'><b>Def%</b></span>,                                             <!-- <span class='f-16'><b>Speed</b></span>  <img class='lazyloaded character-rarity ml-15' src='https://epic7x.com/wp-content/themes/epic7x/assets/img/arrow.png'>  -->\n                        <span class='f-16'><b>Speed</b></span>,                                             <!-- <span class='f-16'><b>Effect Resistance%</b></span>  -->\n                        <span class='f-16'><b>Effect Resistance%</b></span>                                        </div>\n                "]


In [5]:
split_list = substats_blob_list[0].split('<b>')
stat_list = []
for stat_blob in split_list:
    stat = stat_blob.split("</b>")[0] # 0th element is the stat
    #print(stat) # TYPE str lit
    if stat not in stat_list: # don't add dupes
        stat_list.append(stat)
print(stat_list)

['Hp%', 'Def%', 'Speed', 'Effect Resistance%']


#### Single Scrape - Fail

And here is code for a failed scrape:

In [6]:
response_check = requests.get("https://epic7x.com/character/assassin-cidd/").text
print(response_check[1500:2000]) # error message around 1500:2000 range

.net/wordpress-speed/wpx-hosting-promo-coupon-discount-code/" target="_blank"><img src="https://cf.wpx.net/img/No-Website-Installed.png" style="width: 450px;" /></a>
        <br />
        &nbsp;
        <br />
        <span style="font-weight: 500;">You have reached an error page on WPX.net</span>
        <br />
        &nbsp;
        <br />
        If you are seeing this page, it is because there is no website installed on this domain yet.
        <br />
        &nbsp;
        <br /


In [7]:
substats_blob_pattern = "<span class='f-16'><b>([\w\W]{0,900})<\/div>"
substats_blob_list = re.findall(substats_blob_pattern, response_check)
print(len(substats_blob_list))
print(substats_blob_list[0:100])

0
[]


#### Multi Scrape

Despite knowing failures exist, I implemented a Python function for scraping the entire character database:

In [8]:
def extract_all_character_substats(name_list):
    # Helper function to extract unit
    def extract_substats(name):
        # attempt at trying to get around scrape blocker
        # wait_time = random.randrange(1, 5, 1)
        # time.sleep(wait_time)
        # print('Time waited: {}'.format(wait_time))
        
        # print(name) # DEBUG

        # create correct URL
        name_cleaned = name.replace(" ", "-").replace("\'", "").lower()
        name_url = "https://epic7x.com/character/" + name_cleaned + "/"
        # print(name_url) # DEBUG
        substats_blob_response_string = requests.get(name_url).text # API call
        # print(substats_blob_response_string[:100]) # DEBUG
        # print(substats_blob_response_string) # DEBUG

        # instantiate pattern
        substats_blob_pattern = "<span class='f-16'><b>([\w\W]{0,1300})<\/div>"
        substats_blob_list = re.findall(substats_blob_pattern, substats_blob_response_string)

        # split `substats_blob_list` so that 0th index is the substat
        try: 
            split_substats_list = substats_blob_list[0].split('<b>')
        except:
            # print("Blocked, scraping next character.") # DEBUG
            # print("<------------------------->\n") # DEBUG
            return [False, []] # extract failed, return false and empty

        # extract substats from `split_substats_list`, 0th element 
        substats_list = []
        for stat_blob in split_substats_list:
            stat = stat_blob.split("</b>")[0] # 0th element is the stat
            #print(stat) # TYPE str lit
            if stat not in substats_list: # don't add dupes
                substats_list.append(stat)

        # display checks            
        # print(substats_list) # DEBUG
        # print("<------------------------->\n") # DEBUG
        return [True, (name, substats_list)] # extract successs, list containing true and unit tuple

    ### START OF PARENT FUNCTION ###
    # instatiate variables for statistic display
    num_success = 0
    num_fail = 0
    num_requests = 0

    # instantiate character list
    character_list = []

    # perfrom extraction
    for name in name_list:
        if name != "Architect Laika":
            # 0th index is true/false, 1st index is the unit-substats tuple
            bool_unit_pair = extract_substats(name)
            is_success = bool_unit_pair[0]
            if (is_success):
                num_success += 1
                # add to character list
                character_list.append(bool_unit_pair[1])
            else:
                num_fail += 1
    
            num_requests += 1

    # display extraction stats
    print("num requests: {}".format(num_requests))
    print("num success: {}".format(num_success))
    print("num fail: {}".format(num_fail))
    print("success rate: {:.4f}".format(num_success/num_requests))
    return character_list

In [9]:
extracted_character_list = extract_all_character_substats(name_list)

num requests: 280
num success: 71
num fail: 209
success rate: 0.2536


Despite certain character pages getting blocked, I still scraped what I could. With a roughly $25\%$ success rate, I don't think it's too bad, but it's not nearly enough to create a useful tool for me to use. In my README I mentioned a cheap tool, but it needs to be *usable*.

## Issue - Data Scraping Block


Looking at the `robot.txt` for [Epic7x]()

<div>
    <a href="https://epic7x.com/robots.txt">
        <img src='img/epic7x_robot.png' width=500>
    </a>
</div>

I'm not a web developer, but from Googling around and reading [this article](https://www.codementor.io/@scrapingdog/10-tips-to-avoid-getting-blocked-while-scraping-websites-16papipe62?fbclid=IwAR0Cv8zD98Efq-R5V04FCDiOpYPJK7eC9oRGTUVkK35DlP7VSdRo1KK0X9s), I believe there are some mechanisms in place that are stopping me from scraping the character pages.

I tried following some of the tips listed in the article to try to get around this issue, like instilling a random wait time in-between page requests and using a different header, but to no avail. I also didn't want to bother changing my IP address or doing VPN related things as that's outside the realm of I'm interested in (also too lazy).