# DnD Monsters: Dice and Data
As a Dungeon Master, it is very important to understand the strength of the monsters you pit against your players. Too weak, they are bored, too strong, they die or worse..they don't have fun. The current method known as Challenge Rating, CR, is a numerical system used to determine how difficult an enemey is based on a party of 4 players. Challenge Ratings range from 0 to 30. Unfortunately, this method is very basic and often times does not actually hold true to every encounter. 

One thing that isn't accounted for is action economy. This is the biggest detroyer of players, the strongest weapon in your arsenal. If your players are facing 100 monsters, that's 100 turns. Even if you manage to kill a good chunk of them, the majority will make it through and some of them...with critical hits. Thus is a much more difficult encounter than an equally XP worthwhile monster, with say 2 attacks. 

Wizards of the Coast not only provide a guideline for how much XP you should have per level per day, but they also show you how much a party of 4 at X level can stomach during one encounter. They also provide an XP multiplier that takes multiple monsters into consideration. For example, 10 monsters get a x2.5 XP multiplier, causing their total XP rating to jump up for the encounter, potentially making them deadly. Action Economy rules all. 

CR is unfortunately not a great method for measuring a monster's strength. It uses AC, HP, attack bonus, damage per round and Save DC as a general guideline. It doesn't take into account legendary action, at will spells, special abilities that cause status ailments, or any other boosting abilities.

There are two CRs: Defensive and Offensive, used to calulate the total CR of a monster. Using the chart provided you find the average of the CR indicated by the HP and AC. Offensive does the same thing but uses DPR and Attack Bonus. Then by averaging the two CRs we get our final monster Challenge Rating. As you can see this doesn't take into account any of the strong abilities a monster may have. Similarly, you may have a weak physical monster that uses spells that is vastly lower in CR than it should be. 

WoTC has augmented this system by applying multipliers or increases based on other features, trains, or abilities the monster may have. 

www.dndbeyond.com/monsters has many pages of monster listings. Each listing has a dropdown that has a monster table associated with it. This contains stats, abilities, and other important details. 

Unfortunately, dndbeyond has shut down its ability to scrape through automation detection software. I don't intend to break to ToS, so I will use the SRD from the dandwiki.com page instead. 

The goal of this investigation is to learn more about Monster's abilities in relation to the CR system. To understand if there are corellations in any of the stats, abilities, environments, size, etc. To see if we can classify monsters based on any of these traits. To create a dashboard that pits monsters against each other to compare. Finally, to see if there is a way to better address the CR system and use abilities, traits, features, and spells in a more cohesive manner 


## DnDWiki: html instead of DnDBeyond's javascript
DnDBeyond requires javascript parsing, which is more advanced than the knowledge I currently want to practice. I will try
to work with DnDWiki instead since it utilizes all html.

### Libraries for Parsing
First we need to gain access to our monster data sheet. as stated above, dndbeyond.com has a great repository of monster data. This will need to be scrapped from there site. Unfortuntately, each of the monster pages is hidden behind an accordion dropdown and will need to be extracted. This is something I have not yet done, so I am excited to try. We will start out using Requests and BeautifulSoup since I am most comfortable with these.

In [23]:
#Import Libraries for scrapping
from bs4 import BeautifulSoup as bs
import requests as rq
import pandas as pd

### Get Request for Monster Names

In [24]:
#Fetching HTML
url = "https://www.dandwiki.com/wiki/5e_SRD:Monsters"
Request = rq.get(url).text

soup = bs(Request, 'html.parser')

### Collect Names of All Monsters in a List 
Unfortunately, dndwiki is not well crafted, which meant I needed to get creative. There weren't distinguishing classes or names or ids. styles between tables were a bit different, so i used that to gather the information needed.

In [30]:
#Find the main content div and and extract it for processing
#This involves finding the list items that are only housed within the parent table that has a width of 100%.
tables = soup.findAll('table',{'style':"width: 100%;"})
monster_names_dndwiki=[]

for table in tables:
    li_table = table.findAll('li')
    for name in li_table:
          monster_names_dndwiki.append(name.text)

### Clean up data
We need to remove duplicates and non-monsters from the list 

In [26]:
#Remove the non-monster data

#Remove Duplicate monsters if there are any
monster_names = list(set(monster_names))
monster_list=[]
#filter through and replace spaces with dashes to format for urls
for name in monster_names:
    if not(name.strip().isdigit()):
        new_name = name.replace(' ','-')
        monster_list.append(new_name)
    else:
        monster_list.append(name)



### Dictionary of URLs to parse
We will iterate through the monster name, knowing that dandwiki has a uniform site for all monsters pages www.dandwiki.com/wiki/5e_SRD:'MonsterName'.

In [6]:
monster_url=[]
for name in monster_list:
    monster_url.append('https://www.dndbeyond.com/monsters/'+name)


### Website Structure is disgusing
There are still some things on here that are not monsters (they summon monsters). For example the Deck of Many Things. This will break and analysis or modeling we try to do, so we need to remove them. We can look at all things monsters have in common that these other objects do not. Unfortunately, the DoMT and the figures of power also contain niche "monster" stats for their monsters. We will include these in our table, however Zombies and Dinosaurs do not, since they are just a category of many monsters, all of which are included in the list already. 



In [8]:
from collections import defaultdict

#function to make sure each get request is functioning properly and to parse the url
def Run_Soup_If_Status_Ok(url):
    request =rq.get(url)
    soup = bs(request.text, 'html.parser')
    return soup


monster_dict=defaultdict(list)

#append dictionary with monster name and the soupy information
for name,url in zip(monster_names,monster_url):
    monster_dict[name].append(Run_Soup_If_Status_Ok(url))


## DNDBeyond: Testing selenium webdriver on DnDBeyond with a single Monster
DnDWiki is frankly just very unhelpful in terms of webstructure. There are no defining class,id,names,or elements on any of the information, which makes parsing a nightmare. I will move into DnDBeyond using Selenium.
First we will grab all the information from the Mummy Lord in the 'mon-stat-block' class and the footer information which contains all our tags like source book, environment, and monster tags. We are testing on a single monster
to begin writting the scrappings scripts. 

In [20]:
from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.dndbeyond.com/monsters/mummy-lord'


driver = webdriver.Chrome(executable_path='../env/chromedriver.exe')

driver.get(url)

driver.implicitly_wait(5)

soup = BeautifulSoup(driver.page_source, 'lxml')

stat_block = soup.find('div',{'class':'mon-stat-block'})
Environment = soup.find('footer')




  driver = webdriver.Chrome(executable_path='../env/chromedriver.exe')


### Column Names: Parsing for headings, labels, and tags
Unfortunately, I don't know any one monster that contains every signle type of column we are looking for. The Mummy Lord is a strong enemey that includes a lot of information.
I added any column names to the start of the list if they weren't included in the Mummy Lord's stat blocks.

Then we create for loops looking for classes that end with label or heading/ start with enviromnnt-tags (later I will decide to expand this to all tags)

In [21]:

column_names = ['Monster Name','Size','Type', 'Alignment','Traits', 'Damage Resistances', 'Monster Tags:', 'Mythic Actions', 'Reactions','Source']
#First set of column names from 'label span'
for headers in stat_block.findAll('span',{'class': lambda e: e.endswith('label') if e else False}):    
    column_names.append(headers.text)
    
for headers in stat_block.findAll('div',{'class': lambda e: e.endswith('heading') if e else False}):    
    column_names.append(headers.text)

for headers in Environment.findAll('p',{'class': lambda e: e.startswith('environment-tags') if e else False}):    
    column_names.append(headers.contents[0].strip())


### Create Empty Dictionary with Keys from the Extracted Column Names
Iterate over the column list, filling a dictionary with a key and empty list value

In [22]:
monster_dict = dict.fromkeys(column_names)

#Initialize the monster_dic with each value for all keys to be an empty list
for column in column_names:
    monster_dict[column] = []

monster_dict

{'Monster Name': [],
 'Size': [],
 'Type': [],
 'Alignment': [],
 'Traits': [],
 'Damage Resistances': [],
 'Monster Tags:': [],
 'Mythic Actions': [],
 'Reactions': [],
 'Source': [],
 'Armor Class': [],
 'Hit Points': [],
 'Speed': [],
 'Saving Throws': [],
 'Skills': [],
 'Damage Vulnerabilities': [],
 'Damage Immunities': [],
 'Condition Immunities': [],
 'Senses': [],
 'Languages': [],
 'Challenge': [],
 'Proficiency Bonus': [],
 'STR': [],
 'DEX': [],
 'CON': [],
 'INT': [],
 'WIS': [],
 'CHA': [],
 'Actions': [],
 'Legendary Actions': [],
 'Environment:': []}

### Add Values of Mummy Data into our Dictionary
Here is our big show stopper. This will be turned into a function to be used in the main scrape

In [None]:
# Monster Name
monster_name = stat_block.find('div', {'class':'mon-stat-block__name'}).text
monster_dict['Monster Name'].append(' '.join(str(monster_name).split())) 


#This next set (Size,Alignment, and Type) will split the single meta text using split() and replace() functions
monster_subinfo = stat_block.find('div', {'class':'mon-stat-block__meta'})
monster_subinfo=monster_subinfo.text

# Size (first word)
monster_size = monster_subinfo.split()[0]
monster_dict['Size'].append(monster_size) 
# Alignment (after comma)
monster_alignment = monster_subinfo.split(', ')[-1]
monster_dict['Alignment'].append(monster_alignment) 
# Type (remaining words). The sublist will remove the above two variables from the text, as well as the loose comma.
#It will also create a list for the type, as sometimes there are sub-types associated with monsters (e.g Titan)
sub_list=(monster_size,monster_alignment, ', ')
monster_type = monster_subinfo
for substring in sub_list:
    monster_type = monster_type.replace(substring,'')
monster_type=monster_type.split()
monster_dict['Type'].append(monster_type) 

#find all attribute metrics
attribute_data = stat_block.findAll('span',{'class':'mon-stat-block__attribute-data-value'})

# Armor Class
monster_ac = ' '.join(str(attribute_data[0].text).split())
monster_dict['Armor Class'].append(monster_ac)
# Hit Points
monster_hp = ' '.join(str(attribute_data[1].text).split())
monster_dict['Hit Points'].append(monster_hp)
# Speed
monster_speed = ' '.join(str(attribute_data[2].text).split())
monster_dict['Speed'].append(monster_speed)


#find all tidbit  metrics
tidbit_label = stat_block.findAll('span', {'class':'mon-stat-block__tidbit-label'})

for label in tidbit_label:    
    '''
    Because the tidbits column shifts based on the monster, we can't index the rows, as they
    are added or deleted based on the monster. So instead, we will write a for loop that loops through 
    the monsters tidbit headings (e.g. Skills, Saving Throws, etc.) and if they exits, it will take
    the sibling data (i.e. it will take the actual data corresponding to each heading) and deposit it into the dictionary.
    Any columns not in the monster data will be left blank for now. Each if statement is labeled with the corresponding tidbit.
    '''
    if label.text == "Saving Throws":
        monster_saving_throw = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Saving Throws'].append(monster_saving_throw)
    elif label.text == "Skills":
        monster_skills = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Skills'].append(monster_skills)
    elif label.text == "Damage Vulnerabilities":    
        monster_damage_vulnerability = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Damage Vulnerabilities'].append(monster_damage_vulnerability)
    elif label.text == "Damage Immunities":
        monster_damage_immunity = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Damage Immunities'].append(monster_damage_immunity)
    elif label.text == 'Condition Immunities':
        monster_condition_immunity = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Condition Immunities'].append(monster_condition_immunity)
    elif label.text == 'Senses':
        monster_senses = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Senses'].append(monster_senses)
    elif label.text == 'Languages':
        monster_languages = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Languages'].append(monster_languages)
    elif label.text == 'Challenge':
        monster_challenge= ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Challenge'].append(monster_challenge)
    elif label.text == 'Proficiency Bonus':
        monster_proficiency = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Proficiency Bonus'].append(monster_proficiency)
    elif label.text == 'Damage Resistances':
        monster_damage_resistence = ' '.join(str(label.find_next_sibling('span').text).split())
        monster_dict['Damage Resistances'].append(monster_damage_resistence)


#find all ability score metrics
ability_scores = stat_block.findAll('span',{'class':'ability-block__score'})
    # STR Score
monster_str = ability_scores[0].text
monster_dict['STR'].append(monster_str)
    # DEX Score
monster_dex = ability_scores[1].text
monster_dict['DEX'].append(monster_dex)
    # CON Score
monster_con = ability_scores[2].text
monster_dict['CON'].append(monster_con)
    # INT Score
monster_int = ability_scores[3].text
monster_dict['INT'].append(monster_int)
    # WIS Score
monster_wis = ability_scores[4].text
monster_dict['WIS'].append(monster_wis)
    # CHA Score
monster_cha = ability_scores[5].text
monster_dict['CHA'].append(monster_cha)    
    
# Traits: because traits doesn't contain any defining HTML or any headings such as Actions or Legendary Actions
# I searched through all the description blocks of the text. If they don't contain the div 'heading' then we print
# This allows us to only print the traits and to place them in a list if need be for later wrangling and analysis. 
             
trait_list = []
description_block = stat_block.findAll('div', {'class':'mon-stat-block__description-block'})
for block in description_block:
     if not block.findAll('div',{'class':'mon-stat-block__description-block-heading'}):
        for p in block.findAll('p'):
            trait_list.append(p.text)

#Remaining descriptions that had headings
description_heading = stat_block.findAll('div', {'class':'mon-stat-block__description-block-heading'})
action_list=[]
for heading in description_heading:    
    '''
    Because the description column shifts based on the monster, we can't index the rows, as they
    are added or deleted based on the monster. So instead, we will write a for loop that loops through 
    the monsters description headings (e.g. Actions, Legendary Actions, etc.) and if they exits, it will take
    the sibling data (i.e. it will take the actual data corresponding to each heading) and deposit it into the dictionary.
    Any columns not in the monster data will be left blank for now. Each if statement is labeled with the corresponding tidbit.
    '''
    action_list=[]
    if heading.text == "Actions":
        monster_actions = heading.find_next_sibling('div')
        for p in monster_actions.findAll('p'):
           action_list.append(p.text.strip())
        monster_dict['Actions'].append(action_list)
    elif heading.text == "Legendary Actions":
        monster_legendary_actions = heading.find_next_sibling('div')
        for p in monster_legendary_actions.findAll('p'):
           action_list.append(p.text.strip())
        monster_dict['Legendary Actions'].append(action_list)
    elif heading.text == "Mythic Actions":
        monster_mythic_actions = heading.find_next_sibling('div')
        for p in monster_mythic_actions.findAll('p'):
           action_list.append(p.text.strip())
        monster_dict['Mythic Actions'].append(action_list)
    elif heading.text == "Reactions":
        monster_reactions = heading.find_next_sibling('div')
        for p in monster_reactions.findAll('p'):
           action_list.append(p.text.strip())
        monster_dict['Reactions'].append(action_list)
         
#These final traits are either referring to the environment it lives in (can be multiple), the sub type its classified as,
# or the source book it came from. all of these or none of these may be represented in the monster sheet.
monster_tags = Environment.findAll('span') 

for tag in Environment.find_all("p"):
       
    if (tag.contents[0].strip()) == "Environment:":
       monster_dict['Environment:'].append(monster_tags[0].text)
    elif (tag.contents[0].strip()) == "Monster Tags:":
        monster_dict['Monster Tags:'].append(monster_tags[1].text)
    else:
        monster_dict['Source'].append(tag.contents[0].strip())


### Turn the dictionary into a dataframe
If there are any missing values, replace them with NaN

In [178]:
#ensure listlengths are the same
import pandas as pd


monster_dict = dict([ (k,pd.Series(v)) for k,v in monster_dict.items()])
monster_dict
#list_length = []

#for col in monster_dict:
#    list_length.append(len(monster_dict[col]))
#print(list_length)
#
monster_df = pd.DataFrame(monster_dict)
#
monster_df

  monster_dict = dict([ (k,pd.Series(v)) for k,v in monster_dict.items()])


Unnamed: 0,Monster Name,Size,Type,Alignment,Traits,Damage Resistances,Monster Tags,Mythic Actions,Reactions,Source,...,Proficiency Bonus,STR,DEX,CON,INT,WIS,CHA,Actions,Legendary Actions,Environment:
0,Mummy Lord,Medium,[undead],lawful evil,,,,,,Basic Rules,...,5,18,10,17,11,18,16,[Multiattack. The mummy can use its Dreadful G...,"[The mummy lord can take 3 legendary actions, ...",Desert


### Export to csv

In [179]:
monster_df.to_csv('../data/raw/MummyTest.csv')

## Test Over Time to Iterate
1. We will change out naming database since DnDBeyond is now active for us. We will need to first iterate through each of the pages of monster files.
2. Then we will need to read each monster on each of the page and place them into our monster_list
3. Next we will remove any spaces in the monster names and replace them with '-' this will be necessary for the urls
4. we will append to the monster url and add to the monster_url list, which we will then use to iterate over for our above test. 

### Parsing Request class and Selenium Function
We want our final request clean and clear, so we will create a reusable request class with a get_selenium function.
This function will randomize our user profile to help protect against throttling/halting the srape. We will 
also perform this as headless so as not to tax our computer. The function looks for a certain class, and waits a certain
amount of time. If it sees the class, the function will return the page_source information, otherwise it will close the 
browser.

In [82]:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service

from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem

from time import sleep

from bs4 import BeautifulSoup

ser = Service('../env/chromedriver.exe')

class Request:

        def __init__(self,url):
                self.url = url

        def get_selenium(self, class_name):
                '''
                This is the fuction inputs a URL and will output a parse that is headless and also will
                randomize the user. 
                '''
                software_names = [SoftwareName.CHROME.value]
                operating_systems = [OperatingSystem.WINDOWS.value,
                                     OperatingSystem.LINUX.value]
                user_agent_rotator = UserAgent(software_names=software_names,
                                                operating_systems=operating_systems,
                                                limit=100)
                user_agent = user_agent_rotator.get_random_user_agent()
                chrome_options = Options()
                chrome_options.add_argument("--headless")
                chrome_options.add_argument('--no-sandbox')
                chrome_options.add_argument('--window-size=1420,1080')
                chrome_options.add_argument('--disable-gpu')
                chrome_options.add_argument(f'user-agent={user_agent}')     
                browser = webdriver.Chrome(service=ser,options=chrome_options)
                browser.get(self.url)       
                time_to_wait = 30   
                try:
                        WebDriverWait(browser, time_to_wait).until(
                                EC.presence_of_element_located((By.CLASS_NAME, class_name))
                        )   
                except (TimeoutException, WebDriverException):
                        browser.quit()
                else:
                        browser.maximize_window()
                        page_html = browser.page_source
                        browser.quit()
                        return page_html     
                            
               


In [25]:
page_html = Request('https://www.dndbeyond.com/monsters/adult-green-dragon').get_selenium("mon-stat-block__name")


### DnD Monster Page Iteration
The website has the same formula 'https://www.dndbeyond.com/monsters?page=' so we just need to iterate from 1 to 106 (last page)

In [9]:

url = 'https://www.dndbeyond.com/monsters?page='

monster__name= []

for i in range(1,107):
    
    page_html = Request(url+i).get_selenium('mon-stat-block__name')
    soup = BeautifulSoup(page_html, 'lxml')
    page_names = soup.find_all('span',{'class':'name'})

    for span in page_names:
        monster__name.append(span.text.strip())
        
    sleep(6)

  browser = webdriver.Chrome(executable_path='../env/chromedriver.exe',chrome_options=chrome_options)
  browser = webdriver.Chrome(executable_path='../env/chromedriver.exe',chrome_options=chrome_options)


In [4]:
monster_nospace=[]

#filter through and replace spaces with dashes to format for urls

for name in monster__name:
    if not(name.strip().isdigit()):
        new_name = name.replace(' ','-')
        monster_nospace.append(new_name)
    else:
        monster_nospace.append(name)


In [5]:
monster_name_preurl = []
#filter and replace '()' with nothing
for name in monster_nospace:
    if not(name.strip().isdigit()):
        new_name = name.replace('(','')
        final_name = new_name.replace(')','')
        monster_name_preurl.append(final_name)
    else:
        monster_name_preurl.append(name)


### Function to pull all data from DndBeyond
Using our test function from the Mummy, we will iterate over all the monsters in monster_name_preurl
to parse each monster page for their data and slam it into the dictionary!

we saved our previous variable using store magic, so we don't need to rerun the monster names or column names each time.

In [50]:
%store -r monster_dict
%store -r monster__name
%store -r monster_name_preurl

In [47]:
import pandas as pd

def monster_stat_gathering(soup):

    stat_block = soup.find('div',{'class':'mon-stat-block'}) 
    tags = soup.find('footer')

    # Monster Name
    monster_name = stat_block.find('div', {'class':'mon-stat-block__name'}).text
    monster_dict['Monster Name'].append(' '.join(str(monster_name).split())) 
    
    
    #This next set (Size,Alignment, and Type) will split the single meta text using split() and replace() functions
    monster_subinfo = stat_block.find('div', {'class':'mon-stat-block__meta'})
    monster_subinfo=monster_subinfo.text
    
    # Size (first word)
    monster_size = monster_subinfo.split()[0]
    monster_dict['Size'].append(monster_size) 
    # Alignment (after comma)
    monster_alignment = monster_subinfo.split(', ')[-1]
    monster_dict['Alignment'].append(monster_alignment) 
    # Type (remaining words). The sublist will remove the above two variables from the text, as well as the loose comma.
    #It will also create a list for the type, as sometimes there are sub-types associated with monsters (e.g Titan)
    sub_list=(monster_size,monster_alignment, ', ')
    monster_type = monster_subinfo
    for substring in sub_list:
        monster_type = monster_type.replace(substring,'')
    monster_type=monster_type.split()
    monster_dict['Type'].append(monster_type) 
    
    
    #find all attribute metrics
    attribute_data = stat_block.findAll('span',{'class':'mon-stat-block__attribute-data-value'})
    
    # Armor Class
    monster_ac = ' '.join(str(attribute_data[0].text).split())
    monster_dict['Armor Class'].append(monster_ac)
    # Hit Points
    monster_hp = ' '.join(str(attribute_data[1].text).split())
    monster_dict['Hit Points'].append(monster_hp)
    # Speed
    monster_speed = ' '.join(str(attribute_data[2].text).split())
    monster_dict['Speed'].append(monster_speed)
    
    
    #find all tidbit  metrics
    tidbit_label = stat_block.findAll('span', {'class':'mon-stat-block__tidbit-label'})
    tidbit_list = []
    for label in tidbit_label:    
        '''
        Because the tidbits column shifts based on the monster, we can't index the rows, as they
        are added or deleted based on the monster. So instead, we will write a for loop that loops through 
        the monsters tidbit headings (e.g. Skills, Saving Throws, etc.) and if they exits, it will take
        the sibling data (i.e. it will take the actual data corresponding to each heading) and deposit it into the dictionary.
        Any columns not in the monster data will be left blank for now. Each if statement is labeled with the corresponding tidbit.
        '''
        tidbit_list.append(label.text)
        if label.text == "Saving Throws":
            monster_saving_throw = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Saving Throws'].append(monster_saving_throw)
        elif label.text == "Skills":
            monster_skills = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Skills'].append(monster_skills)
        elif label.text == "Damage Vulnerabilities":    
            monster_damage_vulnerability = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Damage Vulnerabilities'].append(monster_damage_vulnerability)
        elif label.text == "Damage Immunities":
            monster_damage_immunity = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Damage Immunities'].append(monster_damage_immunity)
        elif label.text == 'Condition Immunities':
            monster_condition_immunity = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Condition Immunities'].append(monster_condition_immunity)
        elif label.text == 'Senses':
            monster_senses = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Senses'].append(monster_senses)
        elif label.text == 'Languages':
            monster_languages = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Languages'].append(monster_languages)
        elif label.text == 'Challenge':
            monster_challenge= ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Challenge'].append(monster_challenge)
        elif label.text == 'Proficiency Bonus':
            monster_proficiency = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Proficiency Bonus'].append(monster_proficiency)
        elif label.text == 'Damage Resistances':
            monster_damage_resistence = ' '.join(str(label.find_next_sibling('span').text).split())
            monster_dict['Damage Resistances'].append(monster_damage_resistence)

    #start with full list of tidbit, which will be removed for everyone that exists within the monster.
    missing_tidbit_list=["Saving Throws", "Skills", "Damage Vulnerabilities", "Damage Immunities", "Condition Immunities", "Senses", "Languages", "Challenge", "Proficiency Bonus", "Damage Resistances"]
    
    for tidbit in tidbit_list:
        missing_tidbit_list.remove(tidbit)

    #add NaN value to all missing tidbits for this monster
    for tidbit in missing_tidbit_list:
        monster_dict[tidbit].append(np.NaN)

    
    #find all ability score metrics
    ability_scores = stat_block.findAll('span',{'class':'ability-block__score'})
        # STR Score
    monster_str = ability_scores[0].text
    monster_dict['STR'].append(monster_str)
        # DEX Score
    monster_dex = ability_scores[1].text
    monster_dict['DEX'].append(monster_dex)
        # CON Score
    monster_con = ability_scores[2].text
    monster_dict['CON'].append(monster_con)
        # INT Score
    monster_int = ability_scores[3].text
    monster_dict['INT'].append(monster_int)
        # WIS Score
    monster_wis = ability_scores[4].text
    monster_dict['WIS'].append(monster_wis)
        # CHA Score
    monster_cha = ability_scores[5].text
    monster_dict['CHA'].append(monster_cha)    
        
    # Traits: because traits doesn't contain any defining HTML or any headings such as Actions or Legendary Actions
    # I searched through all the description blocks of the text. If they don't contain the div 'heading' then we print
    # This allows us to only print the traits and to place them in a list if need be for later wrangling and analysis. 
                 
    trait_list = []
    description_block = stat_block.findAll('div', {'class':'mon-stat-block__description-block'})
    for block in description_block:
         if not block.findAll('div',{'class':'mon-stat-block__description-block-heading'}):
            for p in block.findAll('p'):
                trait_list.append(p.text)
    #if no traits are found, create a NaN value
    if not trait_list:
        trait_list.append(np.NaN)
    monster_dict["Traits"].append(trait_list)
    
    #Remaining descriptions that had headings
    description_heading = stat_block.findAll('div', {'class':'mon-stat-block__description-block-heading'})
    action_list=[]
    action_headings=[]
    for heading in description_heading:    
        '''
        Because the description column shifts based on the monster, we can't index the rows, as they
        are added or deleted based on the monster. So instead, we will write a for loop that loops through 
        the monsters description headings (e.g. Actions, Legendary Actions, etc.) and if they exits, it will take
        the sibling data (i.e. it will take the actual data corresponding to each heading) and deposit it into the dictionary.
        Any columns not in the monster data will be left blank for now. Each if statement is labeled with the corresponding tidbit.
        '''
        #create a list of heading actions to use for missing actions 
        action_headings.append(heading.text)

        action_list=[]
        if heading.text == "Actions":
            monster_actions = heading.find_next_sibling('div')
            for p in monster_actions.findAll('p'):
               action_list.append(p.text.strip())
            monster_dict['Actions'].append(action_list)
        elif heading.text == "Legendary Actions":
            monster_legendary_actions = heading.find_next_sibling('div')
            for p in monster_legendary_actions.findAll('p'):
               action_list.append(p.text.strip())
            monster_dict['Legendary Actions'].append(action_list)
        elif heading.text == "Mythic Actions":
            monster_mythic_actions = heading.find_next_sibling('div')
            for p in monster_mythic_actions.findAll('p'):
               action_list.append(p.text.strip())
            monster_dict['Mythic Actions'].append(action_list)
        elif heading.text == "Reactions":
            monster_reactions = heading.find_next_sibling('div')
            for p in monster_reactions.findAll('p'):
               action_list.append(p.text.strip())
            monster_dict['Reactions'].append(action_list)

#like tidbits, we will create the full list of actions, that we will subtract away from to find the missing action categories for NaN
    missing_actions_list = ["Actions", "Legendary Actions", "Mythic Actions", "Reactions"]

    for action in action_headings:
        missing_actions_list.remove(action)

    #add NaN value to all missing tidbits for this monster
    for action in missing_actions_list:
        monster_dict[action].append(np.NaN)



    #These final traits are either referring to the environment it lives in (can be multiple), the sub type its classified as,
    # or the source book it came from. all of these or none of these may be represented in the monster sheet.
   #There may be more than one environment or monster tag, so we iterate through the span children and append them  
    monster_env_list=[]
    monster_tag_list=[]
    for tag in tags.find_all("p"): 
        if (tag.contents[0].strip()) == "Environment:":
            monster_env = tag.findChildren("span",recursive=False)
            for span in monster_env:
                monster_env_list.append(span.text)
            monster_dict['Environment:'].append(monster_env_list)
        elif (tag.contents[0].strip()) == "Monster Tags:":
            monster_tag = tag.findChildren("span",recursive=False)
            for span in monster_tag:
                monster_tag_list.append(span.text)
            monster_dict['Monster Tags:'].append(monster_tag_list)
        else:
            monster_dict['Source'].append(tag.contents[0].strip())
    
    #this will find out if either the two tags are missing in this monster and replace the value with NaN
    monster_tags=[]
    missing_tags=['Environment:',"Monster Tags:"]
    
    for tag in tags.find_all("p",{'class':'tags'}):
        monster_tags.append(tag.contents[0].strip())

    for tag in monster_tags:
        missing_tags.remove(tag)
    
    for tag in missing_tags:
        monster_dict[tag].append(np.NaN)

In [48]:
monster_dict

{'Monster Name': ['Adult Green Dragon',
  'Adult Silver Dragon',
  'Adult White Dragon',
  'Air Elemental',
  'Ape',
  'Assassin',
  'Azer',
  'Black Pudding',
  'Blink Dog',
  'Blood Hawk',
  'Boar',
  'Bone Devil',
  'Bulette',
  'Chain Devil',
  'Couatl',
  'Crab',
  'Crocodile',
  'Cultist',
  'Death Dog',
  'Deva',
  'Dire Wolf',
  'Diseased Giant Rat',
  'Draft Horse',
  'Dragon Turtle',
  'Dryad',
  'Duergar',
  'Earth Elemental',
  'Efreeti',
  'Faerie Dragon (Older)',
  'Fire Giant',
  'Flying Snake',
  'Giant Centipede',
  'Giant Constrictor Snake',
  'Giant Crocodile',
  'Giant Elk',
  'Giant Fire Beetle',
  'Goblin',
  'Gorgon',
  'Griffon',
  'Grimlock',
  'Guardian Naga',
  'Guardian Wolf',
  'Hezrou',
  'Hill Giant',
  'Hippogriff',
  'Invisible Stalker',
  'Lamia',
  'Lemure',
  'Mage',
  'Magma Mephit',
  'Minotaur',
  'Minotaur Skeleton',
  'Mule',
  'Mummy',
  'Mummy Lord',
  'Night Hag',
  'Nyxborn lynx',
  'Polar Bear',
  'Raegrin Mau',
  'Rat',
  'Riding Horse',
 

### Log in to retrieve our paid content
Some of the data is behind a login paywall. We only want to grab the data we have paid for,
so we will have selenium log in for us prior to parsing.

### Iterate over monster pages
Don't grab any info that we don't have access to

In [13]:
monster__name

['Adult Deep Dragon',
 'Adult Emerald Dragon',
 'Adult Gold Dragon',
 'Adult Green Dragon',
 'Adult Kruthik',
 'Adult Moonstone Dragon',
 'Adult Oblex',
 'Adult Red Dracolich',
 'Adult Red Dragon',
 'Adult Sapphire Dragon',
 'Adult Silver Dragon',
 'Adult Topaz Dragon',
 'Adult White Dragon',
 'Aeorian Absorber',
 'Aeorian Nullifier',
 'Aeorian Reverser',
 'Aerisi Kalinoth',
 'Agdon Longscarf',
 'Ahmaergo',
 'Air Elemental',
 'Air Elemental Myrmidon',
 'Akroan Hoplite',
 'Alagarthas',
 'Albino Dwarf Spirit Warrior',
 'Albino Dwarf Warrior',
 'Aldani (Lobsterfolk)',
 'Alhoon',
 'Alkilith',
 'Allip',
 'Allosaurus',
 'Almiraj',
 'Alseid',
 'Amarith Coppervein',
 'Amble',
 'Ambush Drake',
 'Amethyst Dragon Wyrmling',
 'Amethyst Greatwyrm',
 'Amidor the Dandelion',
 'Ammalia Cassalanter',
 'Amnizu',
 'Animated Knife',
 'Animated Staff of Frost',
 'Animated Statue',
 'Animated Stove',
 'Animated Table',
 'Animated Tree',
 'Ankheg',
 'Ankylosaurus',
 'Ankylosaurus Zombie',
 'Annis Hag',
 'Anv

In [None]:
import pandas as pd
import numpy as np


url = 'https://www.dndbeyond.com/monsters/'
j=0

#iterate through monster names and add to url 
for i in monster_name_preurl[0:21]:
  page_html = None
#request the html using selenium function
  page_html = Request(url+i).get_selenium('mon-stat-block__name')
  j+=1
#if we get a monster page, the html will not be set to none(it looks for a certain element ID)
# use beautifulsoup and our own monster extraction function to place information into dictionary       
  if page_html is not None:
      soup = BeautifulSoup(page_html, 'lxml')
      monster_stat_gathering(soup)
  sleep(60)
  print(j)
%store monster_dict

In [16]:
monster_dict
%store monster_dict

Stored 'monster_dict' (dict)


In [75]:
#ensure listlengths are the same
list_length = []

for col in monster_dict:
    list_length.append(len(monster_dict[col]))
print(list_length)

monster_df = pd.DataFrame(monster_dict)


[159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159, 159]


In [77]:
monster_df

Unnamed: 0,Monster Name,Size,Type,Alignment,Traits,Damage Resistances,Monster Tags:,Mythic Actions,Reactions,Source,...,Proficiency Bonus,STR,DEX,CON,INT,WIS,CHA,Actions,Legendary Actions,Environment:
0,Adult Green Dragon,Huge,[dragon],lawful evil,[Amphibious. The dragon can breathe air and wa...,,,,,Basic Rules,...,+5,23,12,21,18,15,17,[Multiattack. The dragon can use its Frightful...,"[The dragon can take 3 legendary actions, choo...",[Forest]
1,Adult Silver Dragon,Huge,[dragon],lawful good,[Legendary Resistance (3/Day). If the dragon f...,,,,,Basic Rules,...,+5,27,10,25,16,13,21,[Multiattack. The dragon can use its Frightful...,"[The dragon can take 3 legendary actions, choo...","[Mountain, Urban]"
2,Adult White Dragon,Huge,[dragon],chaotic evil,[Ice Walk. The dragon can move across and clim...,,,,,Basic Rules,...,+5,22,10,22,8,12,12,[Multiattack. The dragon can use its Frightful...,"[The dragon can take 3 legendary actions, choo...",[Arctic]
3,Air Elemental,Large,[elemental],neutral,[Air Form. The elemental can enter a hostile c...,"Lightning, Thunder; Bludgeoning, Piercing, and...",,,,Basic Rules,...,+3,14,20,14,6,10,6,[Multiattack. The elemental makes two slam att...,,"[Desert, Mountain]"
4,Ape,Medium,[beast],unaligned,[nan],,[Misc Creature],,,Basic Rules,...,+2,16,14,14,6,12,7,"[Multiattack. The ape makes two fist attacks.,...",,[Forest]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154,Wolf,Medium,[beast],unaligned,[Keen Hearing and Smell. The wolf has advantag...,,[Misc Creature],,,Basic Rules,...,+2,12,15,12,3,12,6,"[Bite. Melee Weapon Attack: +4 to hit, reach 5...",,"[Forest, Grassland, Hill]"
155,Young Brass Dragon,Large,[dragon],chaotic good,[nan],,,,,Basic Rules,...,+3,19,10,17,12,11,15,[Multiattack. The dragon makes three attacks: ...,,[Desert]
156,Young Copper Dragon,Large,[dragon],chaotic good,[nan],,,,,Basic Rules,...,+3,19,12,17,16,13,15,[Multiattack. The dragon makes three attacks: ...,,[Hill]
157,Young Red Dragon,Large,[dragon],chaotic evil,[nan],,,,,Basic Rules,...,+4,23,10,21,14,11,19,[Multiattack. The dragon makes three attacks: ...,,"[Hill, Mountain]"


In [76]:
monster_df.to_csv('../data/raw/Partial_Monster_Data4.csv')

# Round 2: Reparse
Apparently, we only got about 1/2 the monsters from that run. It's possible that our time to wait is too short, or we have some internet blips. I will consolidate all the tests we have done and create a list of names from that list. Then we can remove the names we already have from our master list so we don't rerun them on the next parse. 

In [1]:
%store -r monster_dict

In [78]:
#Create a list of monster names that we have already parsed for comparison.
monster_name_parsed = list(monster_dict["Monster Name"])
monster_name_parsed

['Adult Green Dragon',
 'Adult Silver Dragon',
 'Adult White Dragon',
 'Air Elemental',
 'Ape',
 'Assassin',
 'Azer',
 'Black Pudding',
 'Blink Dog',
 'Blood Hawk',
 'Boar',
 'Bone Devil',
 'Bulette',
 'Chain Devil',
 'Couatl',
 'Crab',
 'Crocodile',
 'Cultist',
 'Death Dog',
 'Deva',
 'Dire Wolf',
 'Diseased Giant Rat',
 'Draft Horse',
 'Dragon Turtle',
 'Dryad',
 'Duergar',
 'Earth Elemental',
 'Efreeti',
 'Faerie Dragon (Older)',
 'Fire Giant',
 'Flying Snake',
 'Giant Centipede',
 'Giant Constrictor Snake',
 'Giant Crocodile',
 'Giant Elk',
 'Giant Fire Beetle',
 'Goblin',
 'Gorgon',
 'Griffon',
 'Grimlock',
 'Guardian Naga',
 'Guardian Wolf',
 'Hezrou',
 'Hill Giant',
 'Hippogriff',
 'Invisible Stalker',
 'Lamia',
 'Lemure',
 'Mage',
 'Magma Mephit',
 'Minotaur',
 'Minotaur Skeleton',
 'Mule',
 'Mummy',
 'Mummy Lord',
 'Night Hag',
 'Nyxborn lynx',
 'Polar Bear',
 'Raegrin Mau',
 'Rat',
 'Riding Horse',
 'Roc',
 'Roper',
 'Rug of Smothering',
 'Rust Monster',
 'Saber-Toothed Tiger

In [80]:
#Using our original list from DnDWiki, lets see what monsters we are missing from the basic rules

#DnDWiki Difference
monster_names_not_in_dndwiki = list(name for name in monster_name_parsed  if name not in monster_names_dndwiki)

print(len(monster_names_not_in_dndwiki), "not in DnDWiki")


#DnDBeyond difference
monster_names_not_parsed_from_DnDBeyond = list(name for name in monster_names_dndwiki if name not in monster_name_parsed)

print(len(monster_names_not_parsed_from_DnDBeyond), "on the DnDWiki list we haven't parsed")

75 not in DnDWiki
147 on the DnDWiki list we haven't parsed


## Interesting Results
The dndwiki list has 173 monsters that were not parsed, 231 monsters total. We have parsed 101 monsters, which means the dndbeyond site has available monsters not on the dndwiki page. 
43 monsters according to the difference in the list

This means we shouldn't use the dndwiki list since it will clearly miss cool mosnters

## Can we make a DnDBeyond friendly list?
I don't want to parse 1300 or even 1200 websites again. Is there a way to parse only the friendly info?

Or, what if I iterated through the dropdown on the monster page with Selienium clicks? That is how I got the monster names, I would still only parse what I have access 

In [81]:
#turn the monster parsed list into a "preurl" list

monster_nospace=[]

#filter through and replace spaces with dashes to format for urls

for name in monster_name_parsed:
    if not(name.strip().isdigit()):
        new_name = name.replace(' ','-')
        monster_nospace.append(new_name)
    else:
        monster_nospace.append(name)

monster_name_preurl_parsed = []
#filter and replace '()' with nothing
for name in monster_nospace:
    if not(name.strip().isdigit()):
        new_name = name.replace('(','')
        final_name = new_name.replace(')','')
        monster_name_preurl_parsed.append(final_name)
    else:
        monster_name_preurl_parsed.append(name)



#remove the prased list from the full preurl list
monster_name_preurl_second_round = list(name for name in monster_name_preurl if name not in monster_name_preurl_parsed)
len(monster_name_preurl_second_round) 

1221

## Rerun
We have removed the monsters we have already parsed (101) and will rerun with 1279 monsters. I will increase the wait time to 20 seconds hopefully, this will improve capture.

## Rerun 2
We have removed monsters after the second run (additional 58) and will rerun with 1221 monsters. I will increase the wait time to 30 seconds.

In [73]:
import pandas as pd
import numpy as np


url = 'https://www.dndbeyond.com/monsters/'
j=0

#iterate through monster names and add to url 
for i in monster_name_preurl_second_round[0:700]:
  page_html = None
#request the html using selenium function
  page_html = Request(url+i).get_selenium('mon-stat-block__name')
  j+=1
#if we get a monster page, the html will not be set to none(it looks for a certain element ID)
# use beautifulsoup and our own monster extraction function to place information into dictionary       
  if page_html is not None:
      soup = BeautifulSoup(page_html, 'lxml')
      monster_stat_gathering(soup)
  sleep(60)
  print(j)
%store monster_dict

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


159