# The Mushroom Pokédex - The Mokédex

## The aim of this project is to build a deep learning algorithm that can predict a species of mushroom from a photo, returning the species, common name, and whether it is edible, poisonous, or psychedelic. 

In order to run this project you need to have the correct packages installed. You may install them however you like based off of the import modules that appear in this notebook, but if you have retrieved this notebook from my github repo, you will notice a requirements.txt file which lists all relevant packages used in my projects based on fastai.
Here is a link to the relevant repo folder: https://github.com/alexmarsian/Personal-Projects/tree/master/FastAI/Week1

Yes it would be more efficient to create a new virtual environment for each separate project, but I suspect I will be returning to the same modules over and over again..

### Initial Setup

Setting up the jupyter notebook environment:

In [6]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

First we need to webscrape the data on various mushrooms species. To do this I am going to use selenium webdriver to browse through relevant websites on google chrome, beautifulsoup to parse data from the html of a webpage, and the fastai datablock to store our information. 

In [7]:
from selenium import webdriver
from bs4 import BeautifulSoup, SoupStrainer

First configure the selenium webdriver to use google chrome. You need to download the relevant chromedriver exe for your OS from here: https://sites.google.com/a/chromium.org/chromedriver/. I saved the chromedriver exe in the same dir as this notebook. So initialising the selenium webdriver to use the chrome exe:

In [8]:
import os.path
path = os.getcwd() + '/chromedriver'

### How to gather Mushroom data?

Now we are going to start scraping data on mushrooms using Wikipedia. 
I will first scrape all the data on edible mushrooms which is less well organised, and then retrieve data on poisonous, deadly, and psychedelic mushrooms. 
The pages to be used for scraping are:
* Edible: https://en.wikipedia.org/wiki/Edible_mushroom
* Poisonous: https://en.wikipedia.org/wiki/List_of_poisonous_fungus_species
* Deadly: https://en.wikipedia.org/wiki/List_of_deadly_fungus_species
* Psychidelic: There are only 4, so we will retrieve their indivudal pages:
    * Psilocybe Azurescens: https://en.wikipedia.org/wiki/Psilocybe_azurescens
    * Psilocybe cubensis: https://en.wikipedia.org/wiki/Psilocybe_cubensis
    * Psilocybe cyanescens: https://en.wikipedia.org/wiki/Psilocybe_cyanescens
    * Psilocybe semilanceata: https://en.wikipedia.org/wiki/Psilocybe_semilanceata

Given the variety of webpages to be used here we will not be able to write one script to scrape everything. Unfortunately there is no online encylopedia of mushrooms that contains all the information for every type of mushroom - that would allow this process to be automated. 

Before scraping any data, we need to consider what information we want to collect, and how we should store it. 
I am going to create a mushroom class that will store:
* Species (latin name)
* Common Name
* Type: Edble, Poisonous, Deadly, or Psychidelic
* Info: A short description of the mushroom
* Similar: Known look alikes

In [14]:
class Mushroom:
    """Each mushroom contains a species, common name, frequency, and edibility"""
    def __init__(self, species, name, mushroom_type, info, similar):
        self.spec = species
        self.name = name
        self.type = mushroom_type
        self.info = info
        self.simi = similar

Now we need to consider where that information may be retrieved from, and if there are any aspects of the data collection that will be uniform across all mushroom types i.e. can be contained within a single function

So for all the edible mushrooms, the wikipedia page has a similar format of "Species Name" - Common Name or Info.

<img src='edible-mushroom-eg.jpg'>

Whilst the pages for poisonous and deadly mushrooms share almost exactly the same tabular format with Species name, common name, active agent, distribution, and similar edible species. 

<img src='toxic-mushroom-eg.jpg'>

Notably there is not really any description on either site, but we can attain a description by going to the species page of a particular mushroom on wikipedia and going to the description field. The following is the wikipedia description for Gyromitra_esculenta: 

<img src='mushroom-info-eg.jpg'>

### Summary of Required Tasks for Data Scraping

* We can gather species name, common name, and type for edible mushrooms from a single wikipedia page
* We can gather species name, common name, type, and look alikes for both poisonous and deadly mushrooms using the same method as the data is organised identically on wikipedia
* Once we have all the species names we want, we can use the same method to gather a description of every mushroom species based on the Wikipedia description
* We can manually add the psychedelic species as there are only 4

### Collecting Data on Edible Mushrooms

Open the page containing edible mushrooms and retreive the html content of the page.

In [9]:
def get_html(url):
    driver = webdriver.Chrome(executable_path=path)
    driver.get(url)
    html_content = driver.page_source
    return html_content
html_content = get_html("https://en.wikipedia.org/wiki/Edible_mushroom")

In [20]:
parse_only = SoupStrainer(attrs='info')
soup = BeautifulSoup(html_content, "html.parser", parse_only=parse_only)

Now I'm going to define a function that will go through each string in the parsed info divs, strip the brackets from the common name, split the frequency and edibility into two strings, and store the information into the relevant object instance in the Mushroom class, returning a list of mushroom objects:

In [22]:
# BeautifulSoup Object + List -> ListOfMushroom
# Creates a list of mushroom objects given the parsed html content
def get_mushies(soup, list_of_mushies):
    all_strings = soup.stripped_strings
    for string in all_strings:
        species = string
        common_name = next(all_strings)
        common_name = common_name.replace(')', '').replace('(', '')
        freq_and_edib = next(all_strings).split(',')
        frequency = freq_and_edib[0].strip()
        edibility = freq_and_edib[1].strip()
        list_of_mushies.append(Mushroom(species, common_name, frequency, edibility))
    return list_of_mushies

mushies = get_mushies(soup, [])