# The Mushroom Pokédex - The Mokédex

## The aim of this project is to build a deep learning algorithm that can predict a species of mushroom from a photo, returning the species, common name, and whether it is edible, poisonous, or psychedelic. 

In order to run this project you need to have the correct packages installed. You may install them however you like based off of the import modules that appear in this notebook, but if you have retrieved this notebook from my github repo, you will notice a requirements.txt file which lists all relevant packages used in my projects based on fastai.
Here is a link to the relevant repo folder: https://github.com/alexmarsian/Personal-Projects/tree/master/FastAI/Week1

Yes it would be more efficient to create a new virtual environment for each separate project, but I suspect I will be returning to the same modules over and over again..

Note as well this is not built out to its full capacity, this project was meant for me to learn about building a deep learning model using the fastai library. I have thought about how to create a more complete Mokédex, I have archived that for the time being in the mokédex-side-project directory in the fastai folder on my repo.

### Part 1 - Collecting image URLs

At the end of this document we will have generated folders containing CSV files of image URLs and a list of all classes (labels). Downloading images and utilisation of the fastAI module will continue in another document 'week1-cnn'. 

I had to split this project into two documents because of difficulty uploading the chromedriver exe to paperspace gradient (where I have a GPU linked notebook). Thus in part 1 I will scrape the data, and in part 2 I will upload the CSV files to paperspace, download the images using fastAI, and train the CNN. 

### Initial Setup

Setting up the jupyter notebook environment:

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

First we need to webscrape the data on various mushrooms species. To do this I am going to use selenium webdriver to browse through relevant websites on google chrome, beautifulsoup to parse data from the html of a webpage, and the fastai datablock to store our information. 

In [2]:
from selenium import webdriver
from bs4 import BeautifulSoup, SoupStrainer

First configure the selenium webdriver to use google chrome. You need to download the relevant chromedriver exe for your OS from here: https://sites.google.com/a/chromium.org/chromedriver/. I saved the chromedriver exe in the same dir as this notebook. So initialising the selenium webdriver to use the chrome exe:

In [3]:
import os.path
path = os.getcwd() + '/chromedriver'

### Scraping Mushroom Data

Now we are going to start scraping data on mushrooms using this website: http://www.foragingguide.com/
I will first scrape all the data on edible mushrooms, and then retrieve data on poisonous mushrooms. We should be able to use the same scripts to scrape each set of data. 

For each mushroom we want to store:
* Species (Latin Name)
* Common Name
* Type: Edible, Poisonous, Psychedelic
* Frequency

So here is a mushroom class for storing this information:

In [4]:
class Mushroom:
    """Each mushroom contains a species, common name, frequency, and edibility"""
    def __init__(self, species, name, mushroom_type, frequency):
        self.spec = species
        self.name = name
        self.type = mushroom_type
        self.freq = frequency

Open the page containing edible mushrooms, sorted A-Z by latin name, retreive the html content of the page, and parse the html using BeautifulSoup4

In [5]:
def get_html(url):
    driver = webdriver.Chrome(executable_path=path)
    driver.get(url)
    html = driver.page_source
    driver.quit()
    return html
edible_html = get_html("http://www.foragingguide.com/mushrooms/edible_by_latin_name")

To parse a website's html we need to understand the structure of the html. To get an idea for this, I opened the webpage http://www.foragingguide.com/mushrooms/edible_by_latin_name, right clicked and chose 'inspect'. Looking at how the information for the first mushroom 'Agaricus augustus' is stored in the html, we see this structure in the inspect column:

<img src='mushroom-html-example.jpg'>

So all the information we want for each individual mushroom is stored in strings within a div class=info. Each div class=info is within a div class=list_div. There are a number of ways to retreive the data we want, but I am going to try use the most memory efficient method. 
* We will use SoupStrainer to parse only div class="info" elements, and then use the .stripped_strings method from BeautifulSoup to create a generator that returns descendant strings (whitespace stripped). 
* For each div class=info, we expect 3 strings to be returned.  
    1. The latin name (species)
    2. The common name, from which we will strip the opening '(' and closing ')'
    3. The frequency and edibility in the format 'frequency, edibility' which we can split into two strings 'frequency' and 'edibility'. 

In [6]:
soup = BeautifulSoup(edible_html, "html.parser", parse_only=SoupStrainer(attrs='info'))

Now I'm going to define a function that will go through each string in the parsed info divs, strip the brackets from the common name, split the frequency and edibility into two strings, and store the information into the relevant object instance in the Mushroom class, returning a list of mushroom objects:

In [7]:
# BeautifulSoup Object + List -> ListOfMushroom
# Creates a list of mushroom objects given the parsed html content
def get_mushies(soup, list_of_mushies):
    all_strings = soup.stripped_strings
    for string in all_strings:
        species = string
        name = next(all_strings)
        name = name.replace(')', '').replace('(', '')
        freq_and_type = next(all_strings).split(',')
        frequency = freq_and_type[0].strip().capitalize()
        mushroom_type = freq_and_type[1].strip().capitalize()
        list_of_mushies.append(Mushroom(species, name, mushroom_type, frequency))
    return list_of_mushies

mushies = get_mushies(soup, [])

To see that is has worked:

In [8]:
for m in mushies:
    if m.spec == "Agaricus augustus":
        print("Species: " + m.spec)
        print("Common name: " + m.name)
        print("Type: " + m.type.capitalize())
        print("Frequency: " + m.freq.capitalize())

Species: Agaricus augustus
Common name: The Prince
Type: Edible good
Frequency: Occasional


Now this process can be repeated for the poisonous mushrooms:

In [9]:
# Get html content for the poisonous mushrooms
poisonous_html = get_html("http://www.foragingguide.com/mushrooms/poisonous_by_latin_name")
# Again parse in only the info tags
soup = soup = BeautifulSoup(poisonous_html, "html.parser", parse_only=SoupStrainer(attrs='info'))
# Call get_mushies handing in the previously built list 
mushies = get_mushies(soup, mushies)

In [10]:
for m in mushies:
    print(m.spec + ': ' + m.type)

Agaricus augustus: Edible good
Agaricus campestris: Edible excellent
Agaricus Langei: Edible good
Agaricus silvicola: Edible good
Aleuria aurantia: Edible
Amanita rubescens: Edible
Armillaria mellea: Edible
Auricularia auricula-judae: Edible
Boletus appendiculatus: Edible good
Boletus badius: Edible good
Boletus chrysenteron: Edible
Boletus Edulis: Edible excellent
Boletus Luridiformis: Edible
Boletus luridus: Edible
Boletus pruinatus: Edible
Calocybe gambosa: Edible
Calvatia giantea: Edible good
Camarophyllus pratensis: Edible good
Cantharellus cibarius: Edible excellent
Cantharellus tubaeformis: Edible good
Clitocybe geotropa: Edible
Clitocybe gibba: Edible
Clitocybe odora: Edible
Clitopilus prunulus: Edible good
Coprinus comatus: Edible
Coprinus micaceus: Edible
Cuphophyllus pratensis: Edible good
Fistulina hepatica: Edible
Flammulina velutipes: Edible
Grifola frondosa: Edible
Handkea excipuliformis: Edible
Handkea utriformis: Edible
Hydnum repandum: Edible good
Hydnum rufescens: Ed

Now there are a few species of Psychedelic mushroom that were not added to this list, I am going to manually add them as there are only 3. Notice also that Psilocybe semilanceata was included under the type 'Poisonous'. We will edit that to be pyschedelic as well. 

In [11]:
mushies.append(Mushroom("Psilocybe azurescens", "Stamets and Gartz", "Psychedelic", "Common"))
mushies.append(Mushroom("Psilocybe cubensis", "Golden Cap", "Psychedelic", "Common"))
mushies.append(Mushroom("Psilocybe cyanescens", "Wavy Cap", "Psychedelic", "Common"))
for m in mushies:
    if m.spec == "Psilocybe semilanceata":
        m.type = "Psychedelic"

In [12]:
for m in mushies:
    print(m.spec + ': ' + m.type)
print(str(len(mushies)) + ' Mushrooms foraged from the internet!')

Agaricus augustus: Edible good
Agaricus campestris: Edible excellent
Agaricus Langei: Edible good
Agaricus silvicola: Edible good
Aleuria aurantia: Edible
Amanita rubescens: Edible
Armillaria mellea: Edible
Auricularia auricula-judae: Edible
Boletus appendiculatus: Edible good
Boletus badius: Edible good
Boletus chrysenteron: Edible
Boletus Edulis: Edible excellent
Boletus Luridiformis: Edible
Boletus luridus: Edible
Boletus pruinatus: Edible
Calocybe gambosa: Edible
Calvatia giantea: Edible good
Camarophyllus pratensis: Edible good
Cantharellus cibarius: Edible excellent
Cantharellus tubaeformis: Edible good
Clitocybe geotropa: Edible
Clitocybe gibba: Edible
Clitocybe odora: Edible
Clitopilus prunulus: Edible good
Coprinus comatus: Edible
Coprinus micaceus: Edible
Cuphophyllus pratensis: Edible good
Fistulina hepatica: Edible
Flammulina velutipes: Edible
Grifola frondosa: Edible
Handkea excipuliformis: Edible
Handkea utriformis: Edible
Hydnum repandum: Edible good
Hydnum rufescens: Ed

### Collecting URLS of Mushroom Images

So now we have a non-exhaustive list of mushroom species, some of them edible, some of them poisonous, and a few psyhcedelic. We have their common name, and frequency as well. We could really build out this dataset and include more information to genuinely create a mushroom equivalent of a Pokédex, but I will save that project for another time. The main goal of this is to get hands on with the FastAi package and train a deep learning model. 

The next step is to collect URLs of images of each species using google images. Using google images is not perfect, we can not be so sure of these mushrooms being correctly labelled, for that we would need access to a better source - some kind of visual encyclopedia of foragable mushrooms, such a thing does not exist..yet. Furthermore, my first attempts at building a deep learning model to predict all 84 species was only 50% accurate, so this time I'm attempting the simpler task of classifying non-psychedelic and psychedelic species. 

In [13]:
import csv, time
from fastai.vision import *

# String -> URL
# Create a google image search URL (reusable images only) given a query string
def make_google_image_URL(query):
    query = '"' + query.replace(' ', '+') + '"'
    return "https://www.google.com/"\
           "search?q="+query+"+-cartoon+-drawing+-diagram+-art+-text&tbm=isch&safe=off" # No cartoon images


# String Integer -> ListOfURLs
# Given a google image search query and a number of images, return a list of URLs that is num_imgs long
# Original function idea from here: https://towardsdatascience.com/image-scraping-with-python-a96feda8af2d
# The below function is a heavily modified version of the original:
def fetch_image_urls(species, query:str, num_imgs:int, sleep_between_interactions:int=1):
        
    # Fn to scroll to end of google images results page (loads more images)
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
        
    # Instantiate Selenium WebDriver and load query page
    wd = webdriver.Chrome(executable_path=path)
    wd.get(query)
      
    # Go through all img tags and add source to a list of image URLs 
    # until desired number of URLs have been gathered
    
    image_urls = set() # Avoids retreival of duplicate URLs
    image_count = 0
    results_start = 0
    attempts = 0 # Allows for breaking of loop if can't find desired number of links
    while image_count < num_imgs:
        
        html = wd.page_source
        scroll_to_end(wd)
            
        # get html and all img tags using BeautifulSoup after scrolling
        images = BeautifulSoup(html, "html.parser", parse_only=SoupStrainer('img'))
        
        number_results = len(images)
        
        if number_results == 0:
            print("No images retrieved. Check your query or html selectors.")
            break
        
        for img in images:
            # extract image urls
            if img.has_attr('src'):
                image_urls.add(img['src'])   

        image_count = len(image_urls)

        if image_count >= num_imgs:
            print(f"Found {image_count} image links for {species}.")
            break
        try:
            wd.find_element_by_xpath('/html/body/div[2]/c-wiz/div[3]/div[1]/div/div/div/div/div[5]/input').click()
        except:
            if attempts > 1000:
                print(f"Could only find {image_count} image links for {species}.")
                break
            else:
                attempts += 1

        # move the result startpoint further down
        results_start = number_results
    
    wd.quit()
    
    return list(image_urls)


# ListOfObjects Attribute -> Images in folder: object/attribute/images
# Function to collect google images based on an object attribute from a list of objects
# In this case we have a list of mushroom objects
# If using this in another setting, you may need to change paths to folders, files, chromedriver etc.

def get_URL_CSVs(list_of_objs, attr, num_imgs):
    
    # Make a folder to store the images (if it doesn't already exist)
    # Will create a dir like this: ./nameofobjectclass
    folder = type(list_of_objs[0]).__name__.lower()
    if not os.path.isdir(folder+'-images'):
        os.makedirs(folder+'-images', exist_ok=True)
        out_dir = folder+'-images'
    else:
        out_dir = folder+'-images'
    
    for o in list_of_objs:
        
        # Create folder labels and csv filename
        # Will create nameofobjectclass/attribute/attribute.csv
        query = str(getattr(o,attr))
        n_dir = query.replace(' ', '-').lower()
        new_dir = out_dir + '/' + n_dir
        filename = n_dir + '.csv'
        
        # If file containing URLs already exists go to next object: 
        if os.path.isfile(new_dir+'/'+filename):
            continue
        else:
            # Get list of URLs
            query_url = make_google_image_URL(query)
            URLs = fetch_image_urls(query, query_url, num_imgs)
                        
            # Write list of URLs to CSV in a new folder labelled by the attribute 
            os.makedirs(new_dir, exist_ok=True)
            with open(new_dir+'/'+filename, 'w') as f:
                URLs = map(lambda x:x+'\n', URLs)
                f.writelines(URLs)
        
get_URL_CSVs(mushies, 'spec', 300)


So we have scraped our mushroom image URLs from google images and stored them in CSV files that are in separate folders labelled according to their species. Each species folder is contained in a folder named after the Mushroom class ('mushroom-images'). Now to simplify our folder and class structure, I will combine all the CSVs from non-psilocybe species into one CSV in one folder called 'non-psychedelic'. We will upload the entire 'mushroom-images' folder to paperspace. 

In [14]:
# Generate the 5 classes, and a list of all directories in the mushroom-images folder
dirs = []
classes = ['non-psychedelic']
for m in mushies:
    if m.spec.startswith('Psilocybe'):
        classes.append(m.spec.replace(' ', '-').lower())
    dirs.append(m.spec.replace(' ', '-').lower())
print(classes)
print(dirs)

['non-psychedelic', 'psilocybe-semilanceata', 'psilocybe-azurescens', 'psilocybe-cubensis', 'psilocybe-cyanescens']
['agaricus-augustus', 'agaricus-campestris', 'agaricus-langei', 'agaricus-silvicola', 'aleuria-aurantia', 'amanita-rubescens', 'armillaria-mellea', 'auricularia-auricula-judae', 'boletus-appendiculatus', 'boletus-badius', 'boletus-chrysenteron', 'boletus-edulis', 'boletus-luridiformis', 'boletus-luridus', 'boletus-pruinatus', 'calocybe-gambosa', 'calvatia-giantea', 'camarophyllus-pratensis', 'cantharellus-cibarius', 'cantharellus-tubaeformis', 'clitocybe-geotropa', 'clitocybe-gibba', 'clitocybe-odora', 'clitopilus-prunulus', 'coprinus-comatus', 'coprinus-micaceus', 'cuphophyllus-pratensis', 'fistulina-hepatica', 'flammulina-velutipes', 'grifola-frondosa', 'handkea-excipuliformis', 'handkea-utriformis', 'hydnum-repandum', 'hydnum-rufescens', 'hygrocybe-coccinea', 'hygrocybe-pratensis', 'hygrocybe-punicea', 'hygrocybe-virginea', 'hygrophorus-pratensis', 'kuehneromyces-mutab

### Only continue if you want to simplify the classes from 84 different species into 5: non-psychedelic and the 4 psilocybe species. 

In [23]:
# Create a new mushroom-images dir with a 'non-psychedelic' sub-dir
main = os.getcwd()
non_psyche = main + '/' + 'mushroom-images-v2' + '/' + 'non-psychedelic'
os.makedirs(non_psyche)
# Copy all CSVs from the non-psilocybe species into the non-psychedelic dir
# Copy the psilocybe dirs into the mushroom-images-v2 folder
for d in dirs:
    if not d.startswith('psilocybe'):
        shutil.copy(main + '/mushroom-images' + '/' + d + '/' + d + '.csv', non_psyche)
    else:
        shutil.copytree(main + '/mushroom-images' + '/' + d, main + '/' + 'mushroom-images-v2' + '/' + d)

FileExistsError: [Errno 17] File exists: '/Users/alexmars/Documents/CS/Personal-Projects/FastAI/Week1/mushroom-images-v2/non-psychedelic'

In [32]:
# Combine all the CSVs in the non-psychedelic dir into one CSV
newfile = open(non_psyche + '/' + 'non-psychedelic.csv', 'a')
for d in dirs:
    if not d.startswith('psilocybe'):
        with open(non_psyche+'/' + d + '.csv') as f:
            for line in f:
                newfile.write(line)
        os.remove(non_psyche+'/' + d + '.csv')
newfile.close()

So now we have only 5 classes: non-psychedelic, and the 4 psilocybe species organised into 5 individual CSVs in 5 directories, all within the mushroom-images-v2 folder. We will upload this entire folder to paperspace. 

### Continued in 'week1-cnn' which requires a GPU linked notebook instance